qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v4 1/8] fw_cfg: max access size and region size


From: Alexander Graf
Subject: Re: [Qemu-devel] [PATCH v4 1/8] fw_cfg: max access size and region size are the same for MMIO data reg
Date: Wed, 17 Dec 2014 09:28:36 +0100



> Am 17.12.2014 um 08:13 schrieb Laszlo Ersek <address@hidden>:
> 
>> On 12/16/14 21:41, Peter Maydell wrote:
>>> On 16 December 2014 at 19:00, Laszlo Ersek <address@hidden> wrote:
>>> The root of this question is what each of
>>> 
>>> enum device_endian {
>>>    DEVICE_NATIVE_ENDIAN,
>>>    DEVICE_BIG_ENDIAN,
>>>    DEVICE_LITTLE_ENDIAN,
>>> };
>>> 
>>> means.
>> 
>> An opening remark: endianness is a horribly confusing topic and
>> support of more than one endianness is even worse. I may have made
>> some inadvertent errors in this reply; if you think so please
>> let me know and I'll have another stab at it.
>> 
>> That said: the device_endian options indicate what a device's
>> internal idea of its endianness is. This is most relevant if
>> a device accepts accesses at wider than byte width
>> (for instance, if you can read a 32-bit value from
>> address offset 0 and also an 8-bit value from offset 3 then
>> how do those values line up? If you read a 32-bit value then
>> which way round is it compared to what the device's io read/write
>> function return?).
>> 
>> NATIVE_ENDIAN means "same order as the CPU's main data bus's
>> natural representation". (Note that this is not necessarily
>> the same as "the endianness the CPU currently has"; on ARM
>> you can flip the CPU between LE and BE at runtime, which
>> is basically inserting a byte-swizzling step between data
>> accesses and the CPU's data bus, which is always LE for ARMv7+.)
>> 
>> Note that RAM accessible to a KVM guest is always NATIVE_ENDIAN
>> (because the guest vcpu reads it directly with the h/w cpu).
>> 
>>> Consider the following call tree, which implements the splitting /
>>> combining of an MMIO read:
>>> 
>>> memory_region_dispatch_read() [memory.c]
>>>  memory_region_dispatch_read1()
>>>    access_with_adjusted_size()
>>>      memory_region_big_endian()
>>>      for each byte in the wide access:
>>>        memory_region_read_accessor()
>>>          fw_cfg_data_mem_read() [hw/nvram/fw_cfg.c]
>>>            fw_cfg_read()
>>>  adjust_endianness()
>>>    memory_region_wrong_endianness()
>>>    bswapXX()
>>> 
>>> The function access_with_adjusted_size() always iterates over the MMIO
>>> address range in incrementing address order. However, it can calculate
>>> the shift count for memory_region_read_accessor() in two ways.
>>> 
>>> When memory_region_big_endian() returns "true", the shift count
>>> decreases as the MMIO address increases.
>>> 
>>> When memory_region_big_endian() returns "false", the shift count
>>> increases as the MMIO address increases.
>> 
>> Yep, because this is how the device has told us that it thinks
>> of accesses as being put together.
>> 
>> The column in your table "host value" is the 16 bit value read from
>> the device, ie what we have decided (based on device_endian) that
>> it would have returned us if it had supported a 16 bit read directly
>> itself. BE devices compose 16 bit values with the high byte first,
>> LE devices with the low byte first, and native-endian devices
>> in the same order as guest-endianness.
>> 
>>> In memory_region_read_accessor(), the shift count is used for a logical
>>> (ie. numeric) bitwise left shift (<<). That operator works with numeric
>>> values and hides (ie. accounts for) host endianness.
>>> 
>>> Let's consider
>>> - an array of two bytes, [0xaa, 0xbb],
>>> - a uint16_t access made from the guest,
>>> - and all twelve possible cases.
>>> 
>>> In the table below, the "host", "guest" and "device_endian" columns are
>>> input variables. The columns to the right are calculated / derived
>>> values. The arrows above the columns show the data dependencies.
>>> 
>>> After memory_region_dispatch_read1() constructs the value that is
>>> visible in the "host value" column, it is passed to adjust_endianness().
>>> If memory_region_wrong_endianness() returns "true", then the host value
>>> is byte-swapped. The result is then passed to the guest.
>>> 
>>>              
>>> +---------------------------------------------------------------------------------------------------------------+----------+
>>>              |                                                              
>>>                                                  |          |
>>>            +---- 
>>> ------+-------------------------------------------------------------------------+
>>>                            |          |
>>>            | |         |                                                    
>>>                      |                           |          |
>>>      +----------------------------------------------------------+---------+ 
>>>                      |                           |          |
>>>      |     | |         |                                        |         | 
>>>                      |                           |          |
>>>      |   +-----------+-------------------+  +----------------+  |         | 
>>>  +------------------------+-------------------+  |          |
>>>      |   | | |       | |                 |  |                |  |         | 
>>>  |                   |    |                   |  |          |
>>>      |   | | |       | |                 v  |                v  |         v 
>>>  |                   v    |                   v  |          v
>>> #  host  guest  device_endian  memory_region_big_endian()  host value  host 
>>> repr.    memory_region_wrong_endianness()  guest repr.   guest value
>>> --  ----  -----  -------------  --------------------------  ----------  
>>> ------------  --------------------------------  ------------  -----------
>>> 1  LE    LE     native         0                           0xbbaa      
>>> [0xaa, 0xbb]  0                                 [0xaa, 0xbb]  0xbbaa
>>> 2  LE    LE     BE             1                           0xaabb      
>>> [0xbb, 0xaa]  1                                 [0xaa, 0xbb]  0xbbaa
>>> 3  LE    LE     LE             0                           0xbbaa      
>>> [0xaa, 0xbb]  0                                 [0xaa, 0xbb]  0xbbaa
>>> 
>>> 4  LE    BE     native         1                           0xaabb      
>>> [0xbb, 0xaa]  0                                 [0xbb, 0xaa]  0xbbaa
>>> 5  LE    BE     BE             1                           0xaabb      
>>> [0xbb, 0xaa]  0                                 [0xbb, 0xaa]  0xbbaa
>>> 6  LE    BE     LE             0                           0xbbaa      
>>> [0xaa, 0xbb]  1                                 [0xbb, 0xaa]  0xbbaa
>>> 
>>> 7  BE    LE     native         0                           0xbbaa      
>>> [0xbb, 0xaa]  0                                 [0xbb, 0xaa]  0xaabb
>>> 8  BE    LE     BE             1                           0xaabb      
>>> [0xaa, 0xbb]  1                                 [0xbb, 0xaa]  0xaabb
>>> 9  BE    LE     LE             0                           0xbbaa      
>>> [0xbb, 0xaa]  0                                 [0xbb, 0xaa]  0xaabb
>>> 
>>> 10  BE    BE     native         1                           0xaabb      
>>> [0xaa, 0xbb]  0                                 [0xaa, 0xbb]  0xaabb
>>> 11  BE    BE     BE             1                           0xaabb      
>>> [0xaa, 0xbb]  0                                 [0xaa, 0xbb]  0xaabb
>>> 12  BE    BE     LE             0                           0xbbaa      
>>> [0xbb, 0xaa]  1                                 [0xaa, 0xbb]  0xaabb
>> 
>> The column you have labelled 'guest repr' here is the returned data
>> from io_mem_read, whose API contract is "write the data from the
>> device into this host C uint16_t (or whatever) such that it is the
>> value returned by the device read as a native host value". It's
>> not related to the guest order at all.
>> 
>> So for instance, io_mem_read() is called by cpu_physical_memory_rw(),
>> which passes it a local variable "val". So now "val" has the
>> "guest repr" column's bytes in it, and (as a host C variable) the
>> value:
>> 1,2,3 : 0xbbaa
>> 4,5,6 : 0xaabb
>> 7,8,9 : 0xbbaa
>> 10,11,12 : 0xaabb
>> 
>> As you can see, this depends on the "guest endianness" (which is
>> kind of the endianness of the bus): a BE guest 16 bit access to
>> this device would return the 16 bit value 0xaabb, and an LE guest
>> 0xbbaa, and we have exactly those values in our host C variable.
>> If this is TCG, then we'll copy that 16 bit host value into the
>> CPUState struct field corresponding to the destination guest
>> register as-is. (TCG CPUState fields are always in host-C-order.)
>> 
>> However, to pass them up to KVM we have to put them into a
>> buffer in RAM as per the KVM_EXIT_MMIO ABI. So:
>> cpu_physical_memory_rw() calls stw_p(buf, val) [which is "store
>> in target CPU endianness"], so now buf has the bytes:
>> 1,2,3 : [0xaa, 0xbb]
>> 4,5,6 : [0xaa, 0xbb]
>> 7,8,9 : [0xaa, 0xbb]
>> 10,11,12 : [0xaa, 0xbb]
>> 
>> ...which is the same for every case.
>> 
>> This buffer is (for KVM) the run->mmio.data buffer, whose semantics
>> are "the value as it would appear if the VCPU performed a load or store
>> of the appropriate width directly to the byte array". Which is what we
>> want -- your device has two bytes in order 0xaa, 0xbb, and we did
>> a 16 bit load in the guest CPU, so we should get the same answer as if
>> we did a 16 bit load of RAM containing 0xaa, 0xbb. That will be
>> 0xaabb if the VCPU is big-endian, and 0xbbaa if it is not.
>> 
>> I think the fact that all of these things come out to the same
>> set of bytes in the mmio.data buffer is the indication that all
>> QEMU's byte swapping is correct.
>> 
>>> Looking at the two rightmost columns, we must conclude:
>>> 
>>> (a) The "device_endian" field has absolutely no significance wrt. what
>>>    the guest sees. In each triplet of cases, when we go from
>>>    DEVICE_NATIVE_ENDIAN to DEVICE_BIG_ENDIAN to DEVICE_LITTLE_ENDIAN,
>>>    the guest sees the exact same value.
>>> 
>>>    I don't understand this result (it makes me doubt device_endian
>>>    makes any sense). What did I do wrong?
>> 
>> I think it's because you defined your device as only supporting
>> byte accesses that you didn't see any difference between the
>> various device_endian settings. If a device presents itself as
>> just an array of bytes then it doesn't have an endianness, really.
>> 
>> As Paolo says, if you make your device support wider accesses
>> directly and build up the value yourself then you'll see that
>> setting the device_endian to LE vs BE vs native does change
>> the values you see in the guest (and that you'll need to set it
>> to LE and interpret the words in the guest as LE to get the
>> right behaviour).
>> 
>>>    This means that this interface is *value preserving*, not
>>>    representation preserving. In other words: when host and guest
>>>    endiannesses are identical, the *array* is transferred okay (the
>>>    guest array matches the initial host array [0xaa, 0xbb]). When guest
>>>    and host differ in endianness, the guest will see an incorrect
>>>    *array*.
>> 
>> Think of a device which supports only byte accesses as being
>> like a lump of RAM. A big-endian guest CPU which reads 32 bits
>> at a time will get different values in registers to an LE guest
>> which does the same.
>> 
>> *However*, if both CPUs just do "read 32 bits; write 32 bits to
>> RAM" (ie a kind of memcpy but with the source being the mmio
>> register rather than some other bit of RAM) then you should get
>> a bytewise-correct copy of the data in RAM.
>> 
>> So I *think* that would be a viable approach: have your QEMU
>> device code as it is now, and just make sure that if the guest
>> is doing wider-than-byte accesses it does them as
>>  do {
>>     load word from mmio register;
>>     store word to RAM;
>>  } while (count);
>> 
>> and doesn't try to interpret the byte order of the values while
>> they're in the CPU register in the middle of the loop.
>> 
>> Paolo's suggestion would work too, if you prefer that.
>> 
>>> I apologize for wasting everyone's time, but I think both results are
>>> very non-intuitive.
>> 
>> Everything around endianness handling is non-intuitive --
>> it's the nature of the problem, I'm afraid. (Some of this is
>> also QEMU's fault for not having good documentation of the
>> semantics of each of the layers involved in memory accesses.
>> I have on my todo list the vague idea of trying to write these
>> all up as a blog post.)
> 
> Thanks for taking the time to write this up. My analysis must have
> missed at least two important things then:
> - the device's read/write function needs to consider address & size, and
>  return values that match host byte order. fw_cfg doesn't conform ATM.
> - there's one more layer outside the call tree that I checked that can
>  perform endianness conversion.
> 
> I'll try to implement Paolo's suggestion (ie. support wide accesses in
> fw_cfg internally, not relying on splitting/combining by memory.c).

Awesome :). Please define it as device little endian while you go as well. That 
should give us fewer headaches if we want to support wide access on ppc.


Alex




reply via email to

[Prev in Thread] Current Thread [Next in Thread]