[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [PATCH v4 1/8] fw_cfg: max access size and region size
From: |
Laszlo Ersek |
Subject: |
Re: [Qemu-devel] [PATCH v4 1/8] fw_cfg: max access size and region size are the same for MMIO data reg |
Date: |
Wed, 17 Dec 2014 09:40:34 +0100 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.3.0 |
On 12/17/14 09:28, Alexander Graf wrote:
>
>
>
>> Am 17.12.2014 um 08:13 schrieb Laszlo Ersek <address@hidden>:
>>
>>> On 12/16/14 21:41, Peter Maydell wrote:
>>>> On 16 December 2014 at 19:00, Laszlo Ersek <address@hidden> wrote:
>>>> The root of this question is what each of
>>>>
>>>> enum device_endian {
>>>> DEVICE_NATIVE_ENDIAN,
>>>> DEVICE_BIG_ENDIAN,
>>>> DEVICE_LITTLE_ENDIAN,
>>>> };
>>>>
>>>> means.
>>>
>>> An opening remark: endianness is a horribly confusing topic and
>>> support of more than one endianness is even worse. I may have made
>>> some inadvertent errors in this reply; if you think so please
>>> let me know and I'll have another stab at it.
>>>
>>> That said: the device_endian options indicate what a device's
>>> internal idea of its endianness is. This is most relevant if
>>> a device accepts accesses at wider than byte width
>>> (for instance, if you can read a 32-bit value from
>>> address offset 0 and also an 8-bit value from offset 3 then
>>> how do those values line up? If you read a 32-bit value then
>>> which way round is it compared to what the device's io read/write
>>> function return?).
>>>
>>> NATIVE_ENDIAN means "same order as the CPU's main data bus's
>>> natural representation". (Note that this is not necessarily
>>> the same as "the endianness the CPU currently has"; on ARM
>>> you can flip the CPU between LE and BE at runtime, which
>>> is basically inserting a byte-swizzling step between data
>>> accesses and the CPU's data bus, which is always LE for ARMv7+.)
>>>
>>> Note that RAM accessible to a KVM guest is always NATIVE_ENDIAN
>>> (because the guest vcpu reads it directly with the h/w cpu).
>>>
>>>> Consider the following call tree, which implements the splitting /
>>>> combining of an MMIO read:
>>>>
>>>> memory_region_dispatch_read() [memory.c]
>>>> memory_region_dispatch_read1()
>>>> access_with_adjusted_size()
>>>> memory_region_big_endian()
>>>> for each byte in the wide access:
>>>> memory_region_read_accessor()
>>>> fw_cfg_data_mem_read() [hw/nvram/fw_cfg.c]
>>>> fw_cfg_read()
>>>> adjust_endianness()
>>>> memory_region_wrong_endianness()
>>>> bswapXX()
>>>>
>>>> The function access_with_adjusted_size() always iterates over the MMIO
>>>> address range in incrementing address order. However, it can calculate
>>>> the shift count for memory_region_read_accessor() in two ways.
>>>>
>>>> When memory_region_big_endian() returns "true", the shift count
>>>> decreases as the MMIO address increases.
>>>>
>>>> When memory_region_big_endian() returns "false", the shift count
>>>> increases as the MMIO address increases.
>>>
>>> Yep, because this is how the device has told us that it thinks
>>> of accesses as being put together.
>>>
>>> The column in your table "host value" is the 16 bit value read from
>>> the device, ie what we have decided (based on device_endian) that
>>> it would have returned us if it had supported a 16 bit read directly
>>> itself. BE devices compose 16 bit values with the high byte first,
>>> LE devices with the low byte first, and native-endian devices
>>> in the same order as guest-endianness.
>>>
>>>> In memory_region_read_accessor(), the shift count is used for a logical
>>>> (ie. numeric) bitwise left shift (<<). That operator works with numeric
>>>> values and hides (ie. accounts for) host endianness.
>>>>
>>>> Let's consider
>>>> - an array of two bytes, [0xaa, 0xbb],
>>>> - a uint16_t access made from the guest,
>>>> - and all twelve possible cases.
>>>>
>>>> In the table below, the "host", "guest" and "device_endian" columns are
>>>> input variables. The columns to the right are calculated / derived
>>>> values. The arrows above the columns show the data dependencies.
>>>>
>>>> After memory_region_dispatch_read1() constructs the value that is
>>>> visible in the "host value" column, it is passed to adjust_endianness().
>>>> If memory_region_wrong_endianness() returns "true", then the host value
>>>> is byte-swapped. The result is then passed to the guest.
>>>>
>>>>
>>>> +---------------------------------------------------------------------------------------------------------------+----------+
>>>> |
>>>> | |
>>>> +----
>>>> ------+-------------------------------------------------------------------------+
>>>> | |
>>>> | | |
>>>> | | |
>>>>
>>>> +----------------------------------------------------------+---------+
>>>> | | |
>>>> | | | | |
>>>> | | | |
>>>> | +-----------+-------------------+ +----------------+ |
>>>> | +------------------------+-------------------+ | |
>>>> | | | | | | | | | |
>>>> | | | | | | |
>>>> | | | | | | v | v |
>>>> v | v | v | v
>>>> # host guest device_endian memory_region_big_endian() host value
>>>> host repr. memory_region_wrong_endianness() guest repr. guest value
>>>> -- ---- ----- ------------- -------------------------- ----------
>>>> ------------ -------------------------------- ------------ -----------
>>>> 1 LE LE native 0 0xbbaa
>>>> [0xaa, 0xbb] 0 [0xaa, 0xbb] 0xbbaa
>>>> 2 LE LE BE 1 0xaabb
>>>> [0xbb, 0xaa] 1 [0xaa, 0xbb] 0xbbaa
>>>> 3 LE LE LE 0 0xbbaa
>>>> [0xaa, 0xbb] 0 [0xaa, 0xbb] 0xbbaa
>>>>
>>>> 4 LE BE native 1 0xaabb
>>>> [0xbb, 0xaa] 0 [0xbb, 0xaa] 0xbbaa
>>>> 5 LE BE BE 1 0xaabb
>>>> [0xbb, 0xaa] 0 [0xbb, 0xaa] 0xbbaa
>>>> 6 LE BE LE 0 0xbbaa
>>>> [0xaa, 0xbb] 1 [0xbb, 0xaa] 0xbbaa
>>>>
>>>> 7 BE LE native 0 0xbbaa
>>>> [0xbb, 0xaa] 0 [0xbb, 0xaa] 0xaabb
>>>> 8 BE LE BE 1 0xaabb
>>>> [0xaa, 0xbb] 1 [0xbb, 0xaa] 0xaabb
>>>> 9 BE LE LE 0 0xbbaa
>>>> [0xbb, 0xaa] 0 [0xbb, 0xaa] 0xaabb
>>>>
>>>> 10 BE BE native 1 0xaabb
>>>> [0xaa, 0xbb] 0 [0xaa, 0xbb] 0xaabb
>>>> 11 BE BE BE 1 0xaabb
>>>> [0xaa, 0xbb] 0 [0xaa, 0xbb] 0xaabb
>>>> 12 BE BE LE 0 0xbbaa
>>>> [0xbb, 0xaa] 1 [0xaa, 0xbb] 0xaabb
>>>
>>> The column you have labelled 'guest repr' here is the returned data
>>> from io_mem_read, whose API contract is "write the data from the
>>> device into this host C uint16_t (or whatever) such that it is the
>>> value returned by the device read as a native host value". It's
>>> not related to the guest order at all.
>>>
>>> So for instance, io_mem_read() is called by cpu_physical_memory_rw(),
>>> which passes it a local variable "val". So now "val" has the
>>> "guest repr" column's bytes in it, and (as a host C variable) the
>>> value:
>>> 1,2,3 : 0xbbaa
>>> 4,5,6 : 0xaabb
>>> 7,8,9 : 0xbbaa
>>> 10,11,12 : 0xaabb
>>>
>>> As you can see, this depends on the "guest endianness" (which is
>>> kind of the endianness of the bus): a BE guest 16 bit access to
>>> this device would return the 16 bit value 0xaabb, and an LE guest
>>> 0xbbaa, and we have exactly those values in our host C variable.
>>> If this is TCG, then we'll copy that 16 bit host value into the
>>> CPUState struct field corresponding to the destination guest
>>> register as-is. (TCG CPUState fields are always in host-C-order.)
>>>
>>> However, to pass them up to KVM we have to put them into a
>>> buffer in RAM as per the KVM_EXIT_MMIO ABI. So:
>>> cpu_physical_memory_rw() calls stw_p(buf, val) [which is "store
>>> in target CPU endianness"], so now buf has the bytes:
>>> 1,2,3 : [0xaa, 0xbb]
>>> 4,5,6 : [0xaa, 0xbb]
>>> 7,8,9 : [0xaa, 0xbb]
>>> 10,11,12 : [0xaa, 0xbb]
>>>
>>> ...which is the same for every case.
>>>
>>> This buffer is (for KVM) the run->mmio.data buffer, whose semantics
>>> are "the value as it would appear if the VCPU performed a load or store
>>> of the appropriate width directly to the byte array". Which is what we
>>> want -- your device has two bytes in order 0xaa, 0xbb, and we did
>>> a 16 bit load in the guest CPU, so we should get the same answer as if
>>> we did a 16 bit load of RAM containing 0xaa, 0xbb. That will be
>>> 0xaabb if the VCPU is big-endian, and 0xbbaa if it is not.
>>>
>>> I think the fact that all of these things come out to the same
>>> set of bytes in the mmio.data buffer is the indication that all
>>> QEMU's byte swapping is correct.
>>>
>>>> Looking at the two rightmost columns, we must conclude:
>>>>
>>>> (a) The "device_endian" field has absolutely no significance wrt. what
>>>> the guest sees. In each triplet of cases, when we go from
>>>> DEVICE_NATIVE_ENDIAN to DEVICE_BIG_ENDIAN to DEVICE_LITTLE_ENDIAN,
>>>> the guest sees the exact same value.
>>>>
>>>> I don't understand this result (it makes me doubt device_endian
>>>> makes any sense). What did I do wrong?
>>>
>>> I think it's because you defined your device as only supporting
>>> byte accesses that you didn't see any difference between the
>>> various device_endian settings. If a device presents itself as
>>> just an array of bytes then it doesn't have an endianness, really.
>>>
>>> As Paolo says, if you make your device support wider accesses
>>> directly and build up the value yourself then you'll see that
>>> setting the device_endian to LE vs BE vs native does change
>>> the values you see in the guest (and that you'll need to set it
>>> to LE and interpret the words in the guest as LE to get the
>>> right behaviour).
>>>
>>>> This means that this interface is *value preserving*, not
>>>> representation preserving. In other words: when host and guest
>>>> endiannesses are identical, the *array* is transferred okay (the
>>>> guest array matches the initial host array [0xaa, 0xbb]). When guest
>>>> and host differ in endianness, the guest will see an incorrect
>>>> *array*.
>>>
>>> Think of a device which supports only byte accesses as being
>>> like a lump of RAM. A big-endian guest CPU which reads 32 bits
>>> at a time will get different values in registers to an LE guest
>>> which does the same.
>>>
>>> *However*, if both CPUs just do "read 32 bits; write 32 bits to
>>> RAM" (ie a kind of memcpy but with the source being the mmio
>>> register rather than some other bit of RAM) then you should get
>>> a bytewise-correct copy of the data in RAM.
>>>
>>> So I *think* that would be a viable approach: have your QEMU
>>> device code as it is now, and just make sure that if the guest
>>> is doing wider-than-byte accesses it does them as
>>> do {
>>> load word from mmio register;
>>> store word to RAM;
>>> } while (count);
>>>
>>> and doesn't try to interpret the byte order of the values while
>>> they're in the CPU register in the middle of the loop.
>>>
>>> Paolo's suggestion would work too, if you prefer that.
>>>
>>>> I apologize for wasting everyone's time, but I think both results are
>>>> very non-intuitive.
>>>
>>> Everything around endianness handling is non-intuitive --
>>> it's the nature of the problem, I'm afraid. (Some of this is
>>> also QEMU's fault for not having good documentation of the
>>> semantics of each of the layers involved in memory accesses.
>>> I have on my todo list the vague idea of trying to write these
>>> all up as a blog post.)
>>
>> Thanks for taking the time to write this up. My analysis must have
>> missed at least two important things then:
>> - the device's read/write function needs to consider address & size, and
>> return values that match host byte order. fw_cfg doesn't conform ATM.
>> - there's one more layer outside the call tree that I checked that can
>> perform endianness conversion.
>>
>> I'll try to implement Paolo's suggestion (ie. support wide accesses in
>> fw_cfg internally, not relying on splitting/combining by memory.c).
>
> Awesome :). Please define it as device little endian while you go as well.
> That should give us fewer headaches if we want to support wide access on ppc.
Definitely; that was actually the first part of Paolo's suggestion. :)
Thanks!
Laszlo
- Re: [Qemu-devel] [PATCH v4 1/8] fw_cfg: max access size and region size are the same for MMIO data reg, (continued)
- Re: [Qemu-devel] [PATCH v4 1/8] fw_cfg: max access size and region size are the same for MMIO data reg, Paolo Bonzini, 2014/12/16
- Re: [Qemu-devel] [PATCH v4 1/8] fw_cfg: max access size and region size are the same for MMIO data reg, Peter Maydell, 2014/12/16
- Re: [Qemu-devel] [PATCH v4 1/8] fw_cfg: max access size and region size are the same for MMIO data reg, Laszlo Ersek, 2014/12/17
- Re: [Qemu-devel] [PATCH v4 1/8] fw_cfg: max access size and region size are the same for MMIO data reg, Peter Maydell, 2014/12/16
- Re: [Qemu-devel] [PATCH v4 1/8] fw_cfg: max access size and region size are the same for MMIO data reg, Laszlo Ersek, 2014/12/17
- Re: [Qemu-devel] [PATCH v4 1/8] fw_cfg: max access size and region size are the same for MMIO data reg, Alexander Graf, 2014/12/17
- Re: [Qemu-devel] [PATCH v4 1/8] fw_cfg: max access size and region size are the same for MMIO data reg,
Laszlo Ersek <=
[Qemu-devel] [PATCH v4 3/8] fw_cfg: introduce the "data_memwidth" property, Laszlo Ersek, 2014/12/12
- Re: [Qemu-devel] [PATCH v4 3/8] fw_cfg: introduce the "data_memwidth" property, Alexander Graf, 2014/12/16
- Re: [Qemu-devel] [PATCH v4 3/8] fw_cfg: introduce the "data_memwidth" property, Laszlo Ersek, 2014/12/16
- Re: [Qemu-devel] [PATCH v4 3/8] fw_cfg: introduce the "data_memwidth" property, Laszlo Ersek, 2014/12/16
- Re: [Qemu-devel] [PATCH v4 3/8] fw_cfg: introduce the "data_memwidth" property, Peter Maydell, 2014/12/16
- Re: [Qemu-devel] [PATCH v4 3/8] fw_cfg: introduce the "data_memwidth" property, Alexander Graf, 2014/12/16
- Re: [Qemu-devel] [PATCH v4 3/8] fw_cfg: introduce the "data_memwidth" property, Laszlo Ersek, 2014/12/16
[Qemu-devel] [PATCH v4 2/8] fw_cfg: generalize overlap check for combining control and data I/O ports, Laszlo Ersek, 2014/12/12
[Qemu-devel] [PATCH v4 4/8] fw_cfg: expose the "data_memwidth" prop with fw_cfg_init_data_memwidth(), Laszlo Ersek, 2014/12/12
[Qemu-devel] [PATCH v4 5/8] arm: add fw_cfg to "virt" board, Laszlo Ersek, 2014/12/12