Re: [Qemu-devel] [Qemu-ppc] KVM and variable-endianness guest CPUs

qemu-devel
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [Qemu-ppc] KVM and variable-endianness guest CPUs

From:	Alexander Graf
Subject:	Re: [Qemu-devel] [Qemu-ppc] KVM and variable-endianness guest CPUs
Date:	Wed, 22 Jan 2014 11:52:20 +0100
On 22.01.2014, at 08:26, Victor Kamensky <address@hidden> wrote:

> On 21 January 2014 22:41, Alexander Graf <address@hidden> wrote:
>> 
>> 
>> "Native endian" really is just a shortcut for "target endian"
>> which is LE for ARM and BE for PPC. There shouldn't be
>> a qemu-system-armeb or qemu-system-ppc64le.
> 
> I disagree. Fully functional ARM BE system is what we've
> been working on for last few months. 'We' is Linaro
> Networking Group, Endian subteam and some other guys
> in ARM and across community. Why we do that is a bit
> beyond of this discussion.
> 
> ARM BE patches for both V7 and V8 are already in mainline
> kernel. But ARM BE KVM host is broken now. It is known
> deficiency that I am trying to fix. Please look at [1]. Patches
> for V7 BE KVM were proposed and currently under active
> discussion. Currently I work on ARM V8 BE KVM changes.
> 
> So "native endian" in ARM is value of CPSR register E bit.
> If it is off native endian is LE, if it is on it is BE.
> 
> Once and if we agree on ARM BE KVM host changes, the
> next step would be patches in qemu one of which introduces
> qemu-system-armeb. Please see [2].

I think we're facing an ideology conflict here. Yes, there should be a 
qemu-system-arm that is BE capable. There should also be a qemu-system-ppc64 
that is LE capable. But there is no point in changing the "default endiannes" 
for the virtual CPUs that we plug in there. Both CPUs are perfectly capable of 
running in LE or BE mode, the question is just what we declare the "default".

Think about the PPC bootstrap. We start off with a BE firmware, then boot into 
the Linux kernel which calls a hypercall to set the LE bit on every interrupt. 
But there's no reason this little endian kernel couldn't theoretically have big 
endian user space running with access to emulated device registers.

As Peter already pointed out, the actual breakage behind this is that we have a 
"default endianness" at all. But that's a very difficult thing to resolve and I 
don't think should be our primary goal. Just live with the fact that we declare 
ARM little endian in QEMU and swap things accordingly - then everyone's happy.

This really only ever becomes a problem if you have devices that have awareness 
of the CPUs endian mode. The only one on PPC that I'm aware of that falls into 
this category is virtio and there are patches pending to solve that. I don't 
know if there are any QEMU emulated devices outside of virtio with this issue 
on ARM, but you'll have to make the emulation code for those look at the CPU 
state then.

> 
>> QEMU emulates everything that comes after the CPU, so
>> imagine the ioctl struct as a bus package. Your bus
>> doesn't care what endianness the CPU is in - it just
>> gets data from the CPU.
> 
> I am not sure that I follow above. Suppose I have
> 
> move r1, #1
> str r1, [r0]
> 
> where r0 is device address. Now depending on CPSR
> E bit value device address will receive 1 as integer either
> in LE order or in BE order. That is how ARM v7 CPU
> works, regardless whether it is emulated or not.
> 
> So if E bit is off (LE case) after str is executed
> byte at r0 address will get 1
> byte at r0 + 1 address will get 0
> byte at r0 + 2 address will get 0
> byte at r0 + 3 address will get 0
> 
> If E bit is on (BE case) after str is executed
> byte at r0 address will get 0
> byte at r0 + 1 address will get 0
> byte at r0 + 2 address will get 0
> byte at r0 + 3 address will get 1
> 
> my point that mmio.data[] just carries bytes for phys_addr
> mmio.data[0] would be value for byte at phys_addr,
> mmio.data[1] would be value for byte at phys_addr + 1, and
> so on.

What we get is an instruction that traps because it wants to "write r1 (which 
has value=1) into address x". So at that point we get the register value.

Then we need to take a look at the E bit to see whether the write was supposed 
to be in non-host endianness because we need to emulate exactly the LE/BE 
difference you're indicating above. The way we implement this on PPC is that we 
simply byte swap the register value when guest_endian != host_endian.

With this in place, QEMU can just memcpy() the value into a local register and 
feed it into its emulation code which expects a "register value as if the CPU 
was running in native endianness" as parameter - with "native" meaning "little 
endian" for qemu-system-arm. Device emulation code doesn't know what to do with 
a byte array.

Take a look at QEMU's MMIO handler:

        case KVM_EXIT_MMIO:
            DPRINTF("handle_mmio\n");
            cpu_physical_memory_rw(run->mmio.phys_addr,
                                   run->mmio.data,
                                   run->mmio.len,
                                   run->mmio.is_write);
            ret = 0;
            break;

which translates to

                switch (l) {
                case 8:
                    /* 64 bit write access */
                    val = ldq_p(buf);
                    error |= io_mem_write(mr, addr1, val, 8);
                    break;
                case 4:
                    /* 32 bit write access */
                    val = ldl_p(buf);
                    error |= io_mem_write(mr, addr1, val, 4);
                    break;
                case 2:
                    /* 16 bit write access */
                    val = lduw_p(buf);
                    error |= io_mem_write(mr, addr1, val, 2);
                    break;
                case 1:
                    /* 8 bit write access */
                    val = ldub_p(buf);
                    error |= io_mem_write(mr, addr1, val, 1);
                    break;
                default:
                    abort();
                }

which calls the ldx_p primitives

#if defined(TARGET_WORDS_BIGENDIAN)
#define lduw_p(p) lduw_be_p(p)
#define ldsw_p(p) ldsw_be_p(p)
#define ldl_p(p) ldl_be_p(p)
#define ldq_p(p) ldq_be_p(p)
#define ldfl_p(p) ldfl_be_p(p)
#define ldfq_p(p) ldfq_be_p(p)
#define stw_p(p, v) stw_be_p(p, v)
#define stl_p(p, v) stl_be_p(p, v)
#define stq_p(p, v) stq_be_p(p, v)
#define stfl_p(p, v) stfl_be_p(p, v)
#define stfq_p(p, v) stfq_be_p(p, v)
#else
#define lduw_p(p) lduw_le_p(p)
#define ldsw_p(p) ldsw_le_p(p)
#define ldl_p(p) ldl_le_p(p)
#define ldq_p(p) ldq_le_p(p)
#define ldfl_p(p) ldfl_le_p(p)
#define ldfq_p(p) ldfq_le_p(p)
#define stw_p(p, v) stw_le_p(p, v)
#define stl_p(p, v) stl_le_p(p, v)
#define stq_p(p, v) stq_le_p(p, v)
#define stfl_p(p, v) stfl_le_p(p, v)
#define stfq_p(p, v) stfq_le_p(p, v)
#endif

and then passes the result as "originating register access" to the device 
emulation part of QEMU.


Maybe it becomes more clear if you understand the code flow that TCG is going 
through. With TCG whenever a write traps into MMIO we go through these functions

void
glue(glue(helper_st, SUFFIX), MMUSUFFIX)(CPUArchState *env, target_ulong addr,
                                         DATA_TYPE val, int mmu_idx)
{
    helper_te_st_name(env, addr, val, mmu_idx, GETRA());
}

#ifdef TARGET_WORDS_BIGENDIAN
# define TGT_BE(X)  (X)
# define TGT_LE(X)  BSWAP(X)
#else
# define TGT_BE(X)  BSWAP(X)
# define TGT_LE(X)  (X)
#endif

void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
                       int mmu_idx, uintptr_t retaddr)
{
[...]
    /* Handle an IO access.  */
    if (unlikely(tlb_addr & ~TARGET_PAGE_MASK)) {
        hwaddr ioaddr;
        if ((addr & (DATA_SIZE - 1)) != 0) {
            goto do_unaligned_access;
        }
        ioaddr = env->iotlb[mmu_idx][index];

        /* ??? Note that the io helpers always read data in the target
           byte ordering.  We should push the LE/BE request down into io.  */
        val = TGT_LE(val);
        glue(io_write, SUFFIX)(env, ioaddr, val, addr, retaddr);
        return;
    }
    [...]
}

static inline void glue(io_write, SUFFIX)(CPUArchState *env,
                                          hwaddr physaddr,
                                          DATA_TYPE val,
                                          target_ulong addr,
                                          uintptr_t retaddr)
{
    MemoryRegion *mr = iotlb_to_region(physaddr);

    physaddr = (physaddr & TARGET_PAGE_MASK) + addr;
    if (mr != &io_mem_rom && mr != &io_mem_notdirty && !can_do_io(env)) {
        cpu_io_recompile(env, retaddr);
    }

    env->mem_io_vaddr = addr;
    env->mem_io_pc = retaddr;
    io_mem_write(mr, physaddr, val, 1 << SHIFT);
}

which at the end of the chain means if you're running an same endianness on 
guest and host, you get the original register value as function parameter. If 
you run different endianness you get a swapped value as function parameter.

So at the end of all of this, if you're running qemu-system-arm (TCG) on a BE 
host the request into the io callback function will come in as register, then 
stay all the way it is until it reaches the IO callback function. Unless you 
define a specific endianness for your device in which case the callback may 
swizzle it again. But if your device defines DEVICE_LITTLE_ENDIAN or 
DEVICE_NATIVE_ENDIAN, it won't swizzle it.

What happens when you switch your guest to BE mode (or LE for PPC)? Very 
simple. The TCG frontend swizzles every memory read and write before it hits 
TCG's memory operations.

If you're running qemu-system-arm (KVM) on a BE host the request will come into 
kvm-all.c, get read with swapped endianness (ldq_p) and then passed into that 
way into the IO callback function. That's where the bug lies. It should behave 
the same way as TCG, so it needs to know the value the register originally had. 
So instead of doing an ldq_p() it should go through a different path that does 
memcpy().

But that doesn't fix the other-endian issue yet, right? Every value now would 
come in as the register value.

Well, unless you do the same thing TCG does inside the kernel. So the kernel 
would swap the reads and writes before it accesses the ioctl struct that 
connects kvm with QEMU. Then all abstraction layers work just fine again and we 
don't need any qemu-system-armeb.


Alex
[Prev in Thread]
Current Thread
[Next in Thread]
Re: [Qemu-devel] KVM and variable-endianness guest CPUs, (continued)
Prev by Date: Re: [Qemu-devel] [PATCH 08/24] target-arm: A64: Implement MSR (immediate) instructions
Next by Date: Re: [Qemu-devel] [PATCH v3 1/4] KVM/X86: Fix xsave cpuid exposing bug
Previous by thread: Re: [Qemu-devel] [Qemu-ppc] KVM and variable-endianness guest CPUs
Next by thread: Re: [Qemu-devel] [Qemu-ppc] KVM and variable-endianness guest CPUs
Index(es):
- Date
- Thread