qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH] target-mips: apply workaround for TCG optimizat


From: Aurelien Jarno
Subject: Re: [Qemu-devel] [PATCH] target-mips: apply workaround for TCG optimizations for MFC1
Date: Wed, 15 Jul 2015 00:09:38 +0200
User-agent: Mutt/1.5.23 (2014-03-12)

On 2015-07-14 22:56, Paolo Bonzini wrote:
> 
> 
> On 14/07/2015 20:37, Aurelien Jarno wrote:
> >> > 
> >> > I certainly don't have a global view, so much that I didn't think at 
> >> > all of the optimizer... Instead, it looks to me like a bug in the 
> >> > register allocator.  In particular this code in tcg_reg_alloc_mov:
> > That's exactly my point when I said that someone doesn't have a global
> > view. I think the fact that we don't check for type when simplifying
> > moves in the register allocator is intentional, the same way we simply
> > transform the trunc op into a mov op (except on sparc). This is done
> > because it's not needed for example on x86 and most architectures,
> > given 32-bit instructions do not care about the high part of the
> > registers.
> > 
> > Basically size changing ops are trunc_i64_i32, ext_i32_i64 and
> > extu_i32_i64. We can be conservative and implement all of them as real
> > instructions in all TCG backends. In that case the mov op never has
> > to deal with registers of different size (just like we enforce that at
> > the TCG frotnend level), and the register allocator and the optimizer
> > do not have to deal with this. However that's suboptimal on some
> > architectures, that's why on x86 we decided to just replace the
> > trunc_i64_i32 by a move. But if we do this simplification it should be
> > done everywhere (in that case, including in the qemu_ld op). And
> > DOCUMENTED somewhere, given different choices can be made for different
> > backends.
> 
> I think there are four cases:

Well I think we should not see it in terms of only the qemu_ld/qemu_st
32-bit ops, but 32-bit ops in general.

> 1) 64-bit processors that do not have loads with 32-bit addresses, and
> do not zero extend on 32-bit operations---possibly because 32-bit
> operations do not exist at all.
> 
>       => qemu_ld/qemu_st must truncate the address
> 
>       ia64, s390, sparc all fall under this group.
> 
> 2) 64-bit processors that have loads with 32-bit addresses.
> 
>       => qemu_ld/qemu_st can use 32-bit addresses to do the
>          truncation
> 
>       aarch64, I think, falls under this group

I don't think that works. We don't want to get a load with a 32-bit
address. We want a load of (guest_base + address), with guest_base
possibly being 64-bit, address being 32-bit and the result likely
being 64-bit.

> 3) Processors that do not have 32-bit loads, and automatically zero
> extend on 32-bit operations
> 
>       => qemu_ld/qemu_st could use 64-bit addresses and no truncation
> 
> x86 currently falls under 3, because it doesn't use ADDR32, but the
> register allocator is breaking case 3 by forcing 64-bit operations when
> loading from a global.

Well the use of ADDR32 is a bit special, it only works because we can't
use %gs to add the guest base address. When we can't use %gs, ADDR32
can't work.

> I am not sure if the optimizer could also break this case, or if it is

Now that we track high bits are "garbage", the optimizer should be safe.

> working by chance.  So, the simplest fix for 2.4 would be to add the
> prefix as suggested in the comment and make x86 fall under 2.

I think it's the way to go, at least until we have a better view of how
the 32 to 64-bit register works.

> If the optimizer is not breaking this case, fixing the register
> allocator would be an option, and then the ADDR32 prefix could be reverted.

I don't think the register allocator is at fault at all. The register
tcg_reg_alloc_mov doesn't check for the register type because a TCG mov
is by definition only between registers of the same size. We have
different ops (trunc, ext, extu) to handle moves between different
register size.

The problem is that we replace the trunc instruction by a mov (except on
sparc) in tcg_gen_trunc_shr_i64_i32 to get more optimized code:

| ...
|     } else if (count == 0) {
|         tcg_gen_mov_i32(ret, MAKE_TCGV_I32(GET_TCGV_I64(arg)));
|     } else {
|         TCGv_i64 t = tcg_temp_new_i64();
|         tcg_gen_shri_i64(t, arg, count);
|         tcg_gen_mov_i32(ret, MAKE_TCGV_I32(GET_TCGV_I64(t)));
|         tcg_temp_free_i64(t);
|     }
| ...

If we actually implement the trunc_shr_i64_i32 instruction on all
targets, we get rid of this problem without having to tweak the register
allocator. But the generated code is then slightly less optimal, as we
emit an, x86 mov instruction to do the zero extension.

> Even if the prefix was added, modifying the register allocator to use
> 32-bit loads would still be useful as an optimization, since on x86
> 32-bit loads are smaller than 64-bit loads.

AFAIK, that's already the case. The REXW prefix is only emitted for
64-bit ops. The user mode qemu_ld/st is a bit different case, because
there you mix values and addresses.

-- 
Aurelien Jarno                          GPG: 4096R/1DDD8C9B
address@hidden                 http://www.aurel32.net



reply via email to

[Prev in Thread] Current Thread [Next in Thread]