qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH for-2.4] tcg/i386: Implement trunc_shr_i32


From: Aurelien Jarno
Subject: Re: [Qemu-devel] [PATCH for-2.4] tcg/i386: Implement trunc_shr_i32
Date: Sun, 19 Jul 2015 13:26:45 +0200
User-agent: Mutt/1.5.23 (2014-03-12)

On 2015-07-18 23:18, Aurelien Jarno wrote:
> On 2015-07-18 08:58, Richard Henderson wrote:
> > Enforce the invariant that 32-bit quantities are zero extended
> > in the register.  This avoids having to re-zero-extend at memory
> > accesses for 32-bit guests.
> > 
> > Signed-off-by: Richard Henderson <address@hidden>
> > ---
> > Here's an alternative to the other things we've been considering.
> > We could even make this conditional on USER_ONLY if you like.
> > 
> > This does in fact fix the mips test case.  Consider the fact that
> > memory operations are probably more common than truncations, and
> > it would seem that we have a net size win by forcing the truncate
> > over adding a byte for the ADDR32 (or 2 bytes for a zero-extend).
> 
> I think we should go with your previous patch for 2.4, and think calmly
> about how to do that better for 2.5. It slightly increases the generated
> code, but only in bytes, not in number of instructions, so I don't think
> the performance impact is huge.
> 
> > Indeed, for 2.5, we could look at dropping the existing zero-extend
> > from the softmmu path.  Also for 2.5, split trunc_shr into two parts,
> 
> From a quick look, we need to move the address to new registers anyway,
> so not zero-extending will mean adding the REXW prefix.

Well looking more in details, we can move one instruction from the
fast-path to the slow-path. Here is a typical TLB code for store:

fast-path:
      mov    %rbp,%rdi
      mov    %rbp,%rsi
      shr    $0x7,%rdi
      and    $0xfffffffffffff003,%rsi
      and    $0x1fe0,%edi
      lea    0x4e68(%r14,%rdi,1),%rdi
      cmp    (%rdi),%rsi
      mov    %rbp,%rsi
      jne    0x7f45b8bcc800
      add    0x10(%rdi),%rsi
      mov    %ebx,(%rsi)

slow-path:
      mov    %r14,%rdi
      mov    %ebx,%edx
      mov    $0x22,%ecx
      lea    -0x156(%rip),%r8
      push   %r8
      jmpq   0x7f45cb337010

If we know that %rbp is properly zero-extend when needed, we can change
the end of the fast path into:

      cmp    (%rdi),%rsi
      jne    0x7f45b8bcc800
      mov    0x10(%rdi),%rsi  
      mov    %ebx,(%rsi,%rbp,1)

However that means that %rsi is not loaded anymore with the address, so
we have to load it in the slow path. At the end it means moving one
instruction from the fast-path to the slow-path.

Now I have no idea what would really improve the performances. Smaller
fast-path so there are less instructions to execute? Smaller code in
general so that the caches are better used?

-- 
Aurelien Jarno                          GPG: 4096R/1DDD8C9B
address@hidden                 http://www.aurel32.net



reply via email to

[Prev in Thread] Current Thread [Next in Thread]