Re: [Qemu-devel] [Qemu-ppc] KVM and variable-endianness guest CPUs

qemu-devel
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [Qemu-ppc] KVM and variable-endianness guest CPUs

From:	Greg Kurz
Subject:	Re: [Qemu-devel] [Qemu-ppc] KVM and variable-endianness guest CPUs
Date:	Thu, 23 Jan 2014 11:56:21 +0100
On Wed, 22 Jan 2014 20:25:05 -0800
Victor Kamensky <address@hidden> wrote:

> Hi Alex,
> 
> Sorry, for delayed reply, I was focusing on discussion
> with Peter. Hope you and other folks may get something
> out of it :).
> 
> Please see responses inline
> 
> On 22 January 2014 02:52, Alexander Graf <address@hidden> wrote:
> >
> > On 22.01.2014, at 08:26, Victor Kamensky <address@hidden>
> > wrote:
> >
> >> On 21 January 2014 22:41, Alexander Graf <address@hidden> wrote:
> >>>
> >>>
> >>> "Native endian" really is just a shortcut for "target endian"
> >>> which is LE for ARM and BE for PPC. There shouldn't be
> >>> a qemu-system-armeb or qemu-system-ppc64le.
> >>
> >> I disagree. Fully functional ARM BE system is what we've
> >> been working on for last few months. 'We' is Linaro
> >> Networking Group, Endian subteam and some other guys
> >> in ARM and across community. Why we do that is a bit
> >> beyond of this discussion.
> >>
> >> ARM BE patches for both V7 and V8 are already in mainline
> >> kernel. But ARM BE KVM host is broken now. It is known
> >> deficiency that I am trying to fix. Please look at [1]. Patches
> >> for V7 BE KVM were proposed and currently under active
> >> discussion. Currently I work on ARM V8 BE KVM changes.
> >>
> >> So "native endian" in ARM is value of CPSR register E bit.
> >> If it is off native endian is LE, if it is on it is BE.
> >>
> >> Once and if we agree on ARM BE KVM host changes, the
> >> next step would be patches in qemu one of which introduces
> >> qemu-system-armeb. Please see [2].
> >
> > I think we're facing an ideology conflict here. Yes, there
> > should be a qemu-system-arm that is BE capable.
> 
> Maybe it is not ideology conflict but rather terminology clarity
> issue :). I am not sure what do you mean by "qemu-system-arm
> that is BE capable". In qemu build system there is just target
> name 'arm', which is ARM V7 cpu in LE mode, and 'armeb'
> target which is ARM V7 cpu in BE mode. That is true for a lot
> of open source packages. You could check [1] patch that
> introduces armeb target into qemu. Build for
> arm target produces qemu-system-arm executable that is
> marked 'ELF 32-bit LSB executable' and it could run on LE
> traditional ARM Linux. Build for armeb target produces
> qemu-system-armeb executable that is marked 'ELF 32-bit
> MSB executable' that can run on BE ARM Linux. armbe is
> nothing special here, just build option for qemu that should run
> on BE ARM Linux.
> 

Hmmm... it looks like there is a confusion about the qemu command naming.
The -target suffix in qemu-system-target has nothing to do with the ELF
information of the command itself.

address@hidden ~]$ file `which qemu-system-arm`
/bin/qemu-system-arm: ELF 64-bit LSB shared object, x86-64, version 1
(SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.32,
BuildID[sha1]=0xbcb974847daa8159c17ed74906cd5351387d4097, stripped

It is valid to create a new target if it is substantially different from
existing ones (ppc64 versus ppc for example). This is not the case with ARM
since it is the very same CPU that can switch endianess with the 'setend'
instruction (which needs anyway to be emulated when running in TCG mode).

qemu-system-arm is THE command that should be able to emulate an ARM cpu,
whether the guest does 'setend le' or 'setend be'.

> Both qemu-system-arm and qemu-system-armeb should
> be BE/LE capable. I.e either of them along with KVM could
> either run LE or BE guest. MarcZ demonstrated that this
> is possible. I've tested both LE and BE guests with
> qemu-system-arm running on traditional LE ARM Linux,
> effectively repeating Marc's setup but with qemu.
> And I did test with my patches both BE and LE guests with
> qemu-system-armeb running on BE ARM Linux.
> 
> > There
> > should also be a qemu-system-ppc64 that is LE capable.
> > But there is no point in changing the "default endiannes"
> > for the virtual CPUs that we plug in there. Both CPUs are
> > perfectly capable of running in LE or BE mode, the
> > question is just what we declare the "default".
> 
> I am not sure, what you mean by "default"? Is it initial
> setting of CPSR E bit and 'cp15 c1, c0, 0' EE bit? Yes,
> the way it is currently implemented by committed
> qemu-system-arm, and proposed qemu-system-armeb
> patches, they are both off. I.e even qemu-system-armeb
> starts running vcpu in LE mode, exactly by very similar
> reason as desribed in your next paragraph
> qemu-system-armeb has tiny bootloader that starts
> in LE mode, jumps to kernel kernel switches cpu to
> run in BE mode 'setend be' and EE bit is set just
> before mmu is enabled.
> 
> > Think about the PPC bootstrap. We start off with a
> > BE firmware, then boot into the Linux kernel which
> > calls a hypercall to set the LE bit on every interrupt.
> 
> We have very similar situation with BE ARM Linux.
> When we run ARM BE Linux we start with bootloader
> which is LE and then CPU issues 'setend be' very
> soon as it starts executing kernel code, all secondary
> CPUs issue 'setend be' when they go out of reset pen
> or bootmonitor sleep.
> 
> > But there's no reason this little endian kernel
> > couldn't theoretically have big endian user space running
> > with access to emulated device registers.
> 
> I don't want to go there, it is very very messy ...
> 
> ------ Just a side note: ------
> Interestingly, half a year before I joined Linaro in Cisco I and
> my colleague implemented kernel patch that allowed to run
> BE user-space processes as sort of separate personality on
> top of LE ARM kernel ... treated kind of multi-abi system.
> Effectively we had to do byteswaps on all non-trivial
> system calls and ioctls in side of the kernel. We converted
> around 30 system calls and around 10 ioctls. Our target process
> was just using those and it works working, but patch was
> very intrusive and unnatural. I think in Linaro there was
> some public version of my presentation circulated that
> explained all this mess. I don't want seriously to consider it.
> 
> The only robust mixed mode, as MarcZ demonstrated,
> could be done only on VM boundaries. I.e LE host can
> run BE guest fine. And BE host can run LE guest fine.
> Everything else would be a huge mess. If we want to
> start pro and cons of different mixed modes we need to
> start separate thread.
> ------ End of side note ------------
> 
> > As Peter already pointed out, the actual breakage behind
> > this is that we have a "default endianness" at all. But that's
> > a very difficult thing to resolve and I don't think should be
> > our primary goal. Just live with the fact that we declare
> > ARM little endian in QEMU and swap things
> > accordingly - then everyone's happy.
> 
> I disagree with Peter's point of view as you saw from our
> long thread :). I strongly believe that current mmio.data[]
> describes data on the bus perfectly fine with array of bytes.
> data[0] goes into phys_addr, data[1] goes into phys_addr + 1,
> etc.
> 
> Please check "Differences between BE-32 and BE-8 buses"
> section in [2]. In modern ARM CPU memory bus is byte invariant (BE-8).
> As data lines bytes view concerns, it is the same between LE and
> BE-8 that is why IMHO array of bytes view is very good choice.
> PPC and MIPS CPUs memory buses are also byte invariant, they
> always been that way. I don't think we care about BE-32. So
> for all practical purposes, mmio structure is BE-8 bus emulation,
> where data signals could be defined by array of bytes. If one
> would try to define it as set of other bigger integers
> one need to have endianness attribute associated with it. If
> such attribute implied by default just through CPU type in order to
> work with existing cases it should be different for different CPU
> types, which means qemu running in the same endianity but
> on different CPU types should acts differently if it emulates
> the same device and that is bad IMHO. So I don't see any
> value from departing from bytes array view of data on the bus.
> 
> > This really only ever becomes a problem if you have devices
> > that have awareness of the CPUs endian mode. The only one
> > on PPC that I'm aware of that falls into this category is virtio
> > and there are patches pending to solve that. I don't know if there
> > are any QEMU emulated devices outside of virtio with this
> > issue on ARM, but you'll have to make the emulation code
> > for those look at the CPU state then.
> 
> Agreed on native endianity devices, I don't think we really should
> have them on ARM, I believe to your assertion for PPC. In any case
> those native endian devices will be very bad for mixed endianess case.
> Agreed virtio issues must be addressed, when I tested mixed
> modes I had to bring in virtio patches.
> 
> Thanks,
> Victor
> 
> [1]
> https://git.linaro.org/people/victor.kamensky/qemu-be.git/commitdiff/9cc68f682d7c25c6749f0137269de0164d666356?hp=bdc07868d30d3362a4ba0215044a185ff7a80bf4
> 
> [2]
> http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0290g/ch06s05s01.html
> 
> >>
> >>> QEMU emulates everything that comes after the CPU, so
> >>> imagine the ioctl struct as a bus package. Your bus
> >>> doesn't care what endianness the CPU is in - it just
> >>> gets data from the CPU.
> >>
> >> I am not sure that I follow above. Suppose I have
> >>
> >> move r1, #1
> >> str r1, [r0]
> >>
> >> where r0 is device address. Now depending on CPSR
> >> E bit value device address will receive 1 as integer either
> >> in LE order or in BE order. That is how ARM v7 CPU
> >> works, regardless whether it is emulated or not.
> >>
> >> So if E bit is off (LE case) after str is executed
> >> byte at r0 address will get 1
> >> byte at r0 + 1 address will get 0
> >> byte at r0 + 2 address will get 0
> >> byte at r0 + 3 address will get 0
> >>
> >> If E bit is on (BE case) after str is executed
> >> byte at r0 address will get 0
> >> byte at r0 + 1 address will get 0
> >> byte at r0 + 2 address will get 0
> >> byte at r0 + 3 address will get 1
> >>
> >> my point that mmio.data[] just carries bytes for phys_addr
> >> mmio.data[0] would be value for byte at phys_addr,
> >> mmio.data[1] would be value for byte at phys_addr + 1, and
> >> so on.
> >
> > What we get is an instruction that traps because it wants to "write r1
> > (which has value=1) into address x". So at that point we get the
> > register value.
> >
> > Then we need to take a look at the E bit to see whether the write was
> > supposed to be in non-host endianness because we need to emulate
> > exactly the LE/BE difference you're indicating above. The way we
> > implement this on PPC is that we simply byte swap the register value
> > when guest_endian != host_endian.
> >
> > With this in place, QEMU can just memcpy() the value into a local
> > register and feed it into its emulation code which expects a "register
> > value as if the CPU was running in native endianness" as parameter -
> > with "native" meaning "little endian" for qemu-system-arm. Device
> > emulation code doesn't know what to do with a byte array.
> >
> > Take a look at QEMU's MMIO handler:
> >
> >         case KVM_EXIT_MMIO:
> >             DPRINTF("handle_mmio\n");
> >             cpu_physical_memory_rw(run->mmio.phys_addr,
> >                                    run->mmio.data,
> >                                    run->mmio.len,
> >                                    run->mmio.is_write);
> >             ret = 0;
> >             break;
> >
> > which translates to
> >
> >                 switch (l) {
> >                 case 8:
> >                     /* 64 bit write access */
> >                     val = ldq_p(buf);
> >                     error |= io_mem_write(mr, addr1, val, 8);
> >                     break;
> >                 case 4:
> >                     /* 32 bit write access */
> >                     val = ldl_p(buf);
> >                     error |= io_mem_write(mr, addr1, val, 4);
> >                     break;
> >                 case 2:
> >                     /* 16 bit write access */
> >                     val = lduw_p(buf);
> >                     error |= io_mem_write(mr, addr1, val, 2);
> >                     break;
> >                 case 1:
> >                     /* 8 bit write access */
> >                     val = ldub_p(buf);
> >                     error |= io_mem_write(mr, addr1, val, 1);
> >                     break;
> >                 default:
> >                     abort();
> >                 }
> >
> > which calls the ldx_p primitives
> >
> > #if defined(TARGET_WORDS_BIGENDIAN)
> > #define lduw_p(p) lduw_be_p(p)
> > #define ldsw_p(p) ldsw_be_p(p)
> > #define ldl_p(p) ldl_be_p(p)
> > #define ldq_p(p) ldq_be_p(p)
> > #define ldfl_p(p) ldfl_be_p(p)
> > #define ldfq_p(p) ldfq_be_p(p)
> > #define stw_p(p, v) stw_be_p(p, v)
> > #define stl_p(p, v) stl_be_p(p, v)
> > #define stq_p(p, v) stq_be_p(p, v)
> > #define stfl_p(p, v) stfl_be_p(p, v)
> > #define stfq_p(p, v) stfq_be_p(p, v)
> > #else
> > #define lduw_p(p) lduw_le_p(p)
> > #define ldsw_p(p) ldsw_le_p(p)
> > #define ldl_p(p) ldl_le_p(p)
> > #define ldq_p(p) ldq_le_p(p)
> > #define ldfl_p(p) ldfl_le_p(p)
> > #define ldfq_p(p) ldfq_le_p(p)
> > #define stw_p(p, v) stw_le_p(p, v)
> > #define stl_p(p, v) stl_le_p(p, v)
> > #define stq_p(p, v) stq_le_p(p, v)
> > #define stfl_p(p, v) stfl_le_p(p, v)
> > #define stfq_p(p, v) stfq_le_p(p, v)
> > #endif
> >
> > and then passes the result as "originating register access" to the
> > device emulation part of QEMU.
> >
> >
> > Maybe it becomes more clear if you understand the code flow that TCG is
> > going through. With TCG whenever a write traps into MMIO we go through
> > these functions
> >
> > void
> > glue(glue(helper_st, SUFFIX), MMUSUFFIX)(CPUArchState *env,
> > target_ulong addr, DATA_TYPE val, int mmu_idx)
> > {
> >     helper_te_st_name(env, addr, val, mmu_idx, GETRA());
> > }
> >
> > #ifdef TARGET_WORDS_BIGENDIAN
> > # define TGT_BE(X)  (X)
> > # define TGT_LE(X)  BSWAP(X)
> > #else
> > # define TGT_BE(X)  BSWAP(X)
> > # define TGT_LE(X)  (X)
> > #endif
> >
> > void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE
> > val, int mmu_idx, uintptr_t retaddr)
> > {
> > [...]
> >     /* Handle an IO access.  */
> >     if (unlikely(tlb_addr & ~TARGET_PAGE_MASK)) {
> >         hwaddr ioaddr;
> >         if ((addr & (DATA_SIZE - 1)) != 0) {
> >             goto do_unaligned_access;
> >         }
> >         ioaddr = env->iotlb[mmu_idx][index];
> >
> >         /* ??? Note that the io helpers always read data in the target
> >            byte ordering.  We should push the LE/BE request down into
> > io.  */ val = TGT_LE(val);
> >         glue(io_write, SUFFIX)(env, ioaddr, val, addr, retaddr);
> >         return;
> >     }
> >     [...]
> > }
> >
> > static inline void glue(io_write, SUFFIX)(CPUArchState *env,
> >                                           hwaddr physaddr,
> >                                           DATA_TYPE val,
> >                                           target_ulong addr,
> >                                           uintptr_t retaddr)
> > {
> >     MemoryRegion *mr = iotlb_to_region(physaddr);
> >
> >     physaddr = (physaddr & TARGET_PAGE_MASK) + addr;
> >     if (mr != &io_mem_rom && mr != &io_mem_notdirty && !can_do_io(env))
> > { cpu_io_recompile(env, retaddr);
> >     }
> >
> >     env->mem_io_vaddr = addr;
> >     env->mem_io_pc = retaddr;
> >     io_mem_write(mr, physaddr, val, 1 << SHIFT);
> > }
> >
> > which at the end of the chain means if you're running an same
> > endianness on guest and host, you get the original register value as
> > function parameter. If you run different endianness you get a swapped
> > value as function parameter.
> >
> > So at the end of all of this, if you're running qemu-system-arm (TCG)
> > on a BE host the request into the io callback function will come in as
> > register, then stay all the way it is until it reaches the IO callback
> > function. Unless you define a specific endianness for your device in
> > which case the callback may swizzle it again. But if your device
> > defines DEVICE_LITTLE_ENDIAN or DEVICE_NATIVE_ENDIAN, it won't swizzle
> > it.
> >
> > What happens when you switch your guest to BE mode (or LE for PPC)?
> > Very simple. The TCG frontend swizzles every memory read and write
> > before it hits TCG's memory operations.
> >
> > If you're running qemu-system-arm (KVM) on a BE host the request will
> > come into kvm-all.c, get read with swapped endianness (ldq_p) and then
> > passed into that way into the IO callback function. That's where the
> > bug lies. It should behave the same way as TCG, so it needs to know the
> > value the register originally had. So instead of doing an ldq_p() it
> > should go through a different path that does memcpy().
> >
> > But that doesn't fix the other-endian issue yet, right? Every value now
> > would come in as the register value.
> >
> > Well, unless you do the same thing TCG does inside the kernel. So the
> > kernel would swap the reads and writes before it accesses the ioctl
> > struct that connects kvm with QEMU. Then all abstraction layers work
> > just fine again and we don't need any qemu-system-armeb.
> >
> >
> > Alex
> >
> 



-- 
Gregory Kurz                                     address@hidden
                                                 address@hidden
Software Engineer @ IBM/Meiosys                  http://www.ibm.com
Tel +33 (0)562 165 496

"Anarchy is about taking complete responsibility for yourself."
        Alan Moore.
[Prev in Thread]
Current Thread
[Next in Thread]
Re: [Qemu-devel] KVM and variable-endianness guest CPUs, (continued)
Prev by Date: Re: [Qemu-devel] [PATCH] qapi: store raw expressions to QAPISchema
Next by Date: Re: [Qemu-devel] [PATCH] configure: helpfully output package names for some missing dependencies.
Previous by thread: Re: [Qemu-devel] [Qemu-ppc] KVM and variable-endianness guest CPUs
Next by thread: Re: [Qemu-devel] [Qemu-ppc] KVM and variable-endianness guest CPUs
Index(es):
- Date
- Thread