Re: [Qemu-devel] [RFC] Streamlining endian handling in TCG

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC] Streamlining endian handling in TCG

From:	Aurelien Jarno
Subject:	Re: [Qemu-devel] [RFC] Streamlining endian handling in TCG
Date:	Tue, 3 Sep 2013 01:42:23 +0200
User-agent:	Mutt/1.5.21 (2010-09-15)

On Wed, Aug 28, 2013 at 08:26:43AM -0700, Richard Henderson wrote:
> On 08/28/2013 07:34 AM, Peter Maydell wrote:
> > On 28 August 2013 15:31, Richard Henderson <address@hidden> wrote:
> >> On 08/28/2013 01:15 AM, Peter Maydell wrote:
> >>> [*] not impossible, we already do something on the ppc
> >>> that's similar; however I'd really want to take the time to
> >>> figure out how to do endianness swapping "properly"
> >>> and what qemu does currently before messing with it.
> >>
> >> I've got a loose plan in my head for how to clean up handling of
> >> reverse-endian load/store instructions at both the translator and
> >> tcg backend levels.
> > 
> > Nice. Will it allow us to get rid of TARGET_WORDS_BIGENDIAN?
> 
> I don't know, as I don't know off-hand what all that implies.
> 
> Let me lay out my idea and see what you think:
> 
> Currently, at the TCG level we have 8 qemu_ld* opcodes, and 4 qemu_st* 
> opcodes,
> that always produce target_ulong sized results, and always in the guest
> declared endianness.
> 
> There are several problems I want to address:
> 
> (1) I want explicit _i32 and _i64 sizes for the loads and stores.  This will
> clean up a number of places in several translators where we have to load to 
> _tl
> and then truncate or extend to an explicit size.

I guess you mean there that it would still be possible to do a
qemu_ld32u for a _i64 size? 

Also it should be the moment to clean the big mess with qemu_ld32 for
32-bit guests vs qemu_ld32/qemu_ld32u/qemu_ld32s for 64-bit guests.

> (2) I want explicit endianness for the loads and stores.  E.g. when a sparc
> guest does a byte-swapped store, there's little point in doing two offsetting
> bswaps to make that happen.

That's indeed something which would be nice to fix. This is also the
case of powerpc which has a byte-swapped ld/st instruction.

> (3) For hosts that do not support byte-swapped loads and stores themselves, 
> the
> need to allocate extra registers during the memory operation in order to  hold
> the swapped results is an unnecessary burden.  Better to expose the bswap
> operation at the tcg opcode level and let normal register allocation happen.

I don't fully agree with that point. For load ops, the byte swap is
basically done in place, not using any additional register. For store
ops, the bswap has to be done before, but if the value to be stored is
not used later in the TB, no additional register is used.

> Now, naively implementing 1 and 2 would result in 32 opcodes for qemu_ld*. 
> That
> is obviously a non-starter.  However, the very first thing that each tcg
> backend does is map the current 8 opcodes into a bitmask ("opc" and "s_bits"
> in the source).  Let us make that official, and then extend it.

Agreed.

> Therefore:
> 
> (A) Compress qemu_ld* into two qemu_ld_{i32,i64}, with an additional constant
> argument that describes the actual load, exactly as "opc" does today.
> Adjusting the translators to match can be done in stages, or we might decide 
> to
> leave the existing translator-level interface in place permanently.

I think this can be done quite quickly, as the conversion is basically a
matter of find and replace while you have identified if _tl is _i32 or
_i64. That would left a few non-optimized cases like loads with _i32 
later converted to _i64, but that's better than having two interfaces.

> (B) Add an additional bit to the "opc" to indicate which endianness is 
> desired.
>  E.g. 0 = LE, 8 = BE.  Expose the opc interface to the translators.  At which
> point generating a load becomes more like
> 
>     tcg_gen_qemu_ld_tl(dest, addr, size | sign | dc->big_endian);
> 
> and the current endianness of the guest becomes a bit on the TB, to be copied
> into the DisasContext at the beginning of translation.

We should probably define short constants to define that, and why not
the 32 possible constants in a few letters.

Also the question is do we want to use little-endian/big-endian or
native-endian/cross-endian?

> (C) Examine the endian bit in the tcg-op.h expander, and check a
> TCG_TARGET_HAS_foo flag to see if the tcg backend supports reverse endian
> memory ops.  If not, break out the bswap into the opcode stream as a 
> temporary.
> 
> The corollary here is that we must have a full set of bi-endian tcg helper
> functions.  At the moment, the helper functions are all keyed to the 
> hard-coded
> guest endianness.  That means the typical LE/BE host/guest memory op looks 
> like
> 
>       if (tlb hit) {
>             t = bswap(data);
>             store t;
>         } else {
>             helper_store_be(data);
>         }
> 
> If we hoist the bswap it'll need to be
> 
>       t = bswap(data);
>       if (tlb hit) {
>           store t;
>       } else {
>           helper_store_le(t);
>       }

I am not sure it is worth adding additional complexity (and difference
between targets) there, it's basically writing the code corresponding
to qemu_ld* code in each TCG target.

All targets already support cross-endianness, as they can run both
little and big endian guests. Minimum tweaks have to be done to support
both endianness (basically converting a few #ifdef into if). The problem
is at the helper level, we need to either bswap the value before/after
calling the helper, or define more versions of the helper to handle
big/little endian.

I would personally go for more helpers, as it would keep the slow path
small, and also means less modification in the target code.

> (D) Profit!  I'm not sure what will be left of TARGET_WORDS_BIGENDIAN at this
> point.  Possibly only if we leave the current translator interface in place in
> step A.

At the TCG level, almost nothing, but probably still a lot at the
board/devices level.

-- 
Aurelien Jarno                          GPG: 1024D/F1BCDB73
address@hidden                 http://www.aurel32.net

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [Qemu-devel] [RFC] Streamlining endian handling in TCG, Aurelien Jarno <=
- Re: [Qemu-devel] [RFC] Streamlining endian handling in TCG, Richard Henderson, 2013/09/03

Prev by Date: Re: [Qemu-devel] [PATCH 3/3] prep: improve Raven PCI host emulation
Next by Date: Re: [Qemu-devel] [PATCH V3 3/3] qemu-iotests: add tests for runtime fd passing via SCM rights
Previous by thread: Re: [Qemu-devel] [PATCH 3/3] prep: improve Raven PCI host emulation
Next by thread: Re: [Qemu-devel] [RFC] Streamlining endian handling in TCG
Index(es):
- Date
- Thread