Re: [Qemu-devel] [PATCH v4 1/8] fw_cfg: max access size and region size

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v4 1/8] fw_cfg: max access size and region size

From:	Peter Maydell
Subject:	Re: [Qemu-devel] [PATCH v4 1/8] fw_cfg: max access size and region size are the same for MMIO data reg
Date:	Tue, 16 Dec 2014 20:41:07 +0000

On 16 December 2014 at 19:00, Laszlo Ersek <address@hidden> wrote:
> The root of this question is what each of
>
> enum device_endian {
>     DEVICE_NATIVE_ENDIAN,
>     DEVICE_BIG_ENDIAN,
>     DEVICE_LITTLE_ENDIAN,
> };
>
> means.

An opening remark: endianness is a horribly confusing topic and
support of more than one endianness is even worse. I may have made
some inadvertent errors in this reply; if you think so please
let me know and I'll have another stab at it.

That said: the device_endian options indicate what a device's
internal idea of its endianness is. This is most relevant if
a device accepts accesses at wider than byte width
(for instance, if you can read a 32-bit value from
address offset 0 and also an 8-bit value from offset 3 then
how do those values line up? If you read a 32-bit value then
which way round is it compared to what the device's io read/write
function return?).

NATIVE_ENDIAN means "same order as the CPU's main data bus's
natural representation". (Note that this is not necessarily
the same as "the endianness the CPU currently has"; on ARM
you can flip the CPU between LE and BE at runtime, which
is basically inserting a byte-swizzling step between data
accesses and the CPU's data bus, which is always LE for ARMv7+.)

Note that RAM accessible to a KVM guest is always NATIVE_ENDIAN
(because the guest vcpu reads it directly with the h/w cpu).

> Consider the following call tree, which implements the splitting /
> combining of an MMIO read:
>
> memory_region_dispatch_read() [memory.c]
>   memory_region_dispatch_read1()
>     access_with_adjusted_size()
>       memory_region_big_endian()
>       for each byte in the wide access:
>         memory_region_read_accessor()
>           fw_cfg_data_mem_read() [hw/nvram/fw_cfg.c]
>             fw_cfg_read()
>   adjust_endianness()
>     memory_region_wrong_endianness()
>     bswapXX()
>
> The function access_with_adjusted_size() always iterates over the MMIO
> address range in incrementing address order. However, it can calculate
> the shift count for memory_region_read_accessor() in two ways.
>
> When memory_region_big_endian() returns "true", the shift count
> decreases as the MMIO address increases.
>
> When memory_region_big_endian() returns "false", the shift count
> increases as the MMIO address increases.

Yep, because this is how the device has told us that it thinks
of accesses as being put together.

The column in your table "host value" is the 16 bit value read from
the device, ie what we have decided (based on device_endian) that
it would have returned us if it had supported a 16 bit read directly
itself. BE devices compose 16 bit values with the high byte first,
LE devices with the low byte first, and native-endian devices
in the same order as guest-endianness.

> In memory_region_read_accessor(), the shift count is used for a logical
> (ie. numeric) bitwise left shift (<<). That operator works with numeric
> values and hides (ie. accounts for) host endianness.
>
> Let's consider
> - an array of two bytes, [0xaa, 0xbb],
> - a uint16_t access made from the guest,
> - and all twelve possible cases.
>
> In the table below, the "host", "guest" and "device_endian" columns are
> input variables. The columns to the right are calculated / derived
> values. The arrows above the columns show the data dependencies.
>
> After memory_region_dispatch_read1() constructs the value that is
> visible in the "host value" column, it is passed to adjust_endianness().
> If memory_region_wrong_endianness() returns "true", then the host value
> is byte-swapped. The result is then passed to the guest.
>
>               
> +---------------------------------------------------------------------------------------------------------------+----------+
>               |                                                               
>                                                 |          |
>             +---- 
> ------+-------------------------------------------------------------------------+
>                            |          |
>             | |         |                                                     
>                     |                           |          |
>       +----------------------------------------------------------+---------+  
>                     |                           |          |
>       |     | |         |                                        |         |  
>                     |                           |          |
>       |   +-----------+-------------------+  +----------------+  |         |  
> +------------------------+-------------------+  |          |
>       |   | | |       | |                 |  |                |  |         |  
> |                   |    |                   |  |          |
>       |   | | |       | |                 v  |                v  |         v  
> |                   v    |                   v  |          v
>  #  host  guest  device_endian  memory_region_big_endian()  host value  host 
> repr.    memory_region_wrong_endianness()  guest repr.   guest value
> --  ----  -----  -------------  --------------------------  ----------  
> ------------  --------------------------------  ------------  -----------
>  1  LE    LE     native         0                           0xbbaa      
> [0xaa, 0xbb]  0                                 [0xaa, 0xbb]  0xbbaa
>  2  LE    LE     BE             1                           0xaabb      
> [0xbb, 0xaa]  1                                 [0xaa, 0xbb]  0xbbaa
>  3  LE    LE     LE             0                           0xbbaa      
> [0xaa, 0xbb]  0                                 [0xaa, 0xbb]  0xbbaa
>
>  4  LE    BE     native         1                           0xaabb      
> [0xbb, 0xaa]  0                                 [0xbb, 0xaa]  0xbbaa
>  5  LE    BE     BE             1                           0xaabb      
> [0xbb, 0xaa]  0                                 [0xbb, 0xaa]  0xbbaa
>  6  LE    BE     LE             0                           0xbbaa      
> [0xaa, 0xbb]  1                                 [0xbb, 0xaa]  0xbbaa
>
>  7  BE    LE     native         0                           0xbbaa      
> [0xbb, 0xaa]  0                                 [0xbb, 0xaa]  0xaabb
>  8  BE    LE     BE             1                           0xaabb      
> [0xaa, 0xbb]  1                                 [0xbb, 0xaa]  0xaabb
>  9  BE    LE     LE             0                           0xbbaa      
> [0xbb, 0xaa]  0                                 [0xbb, 0xaa]  0xaabb
>
> 10  BE    BE     native         1                           0xaabb      
> [0xaa, 0xbb]  0                                 [0xaa, 0xbb]  0xaabb
> 11  BE    BE     BE             1                           0xaabb      
> [0xaa, 0xbb]  0                                 [0xaa, 0xbb]  0xaabb
> 12  BE    BE     LE             0                           0xbbaa      
> [0xbb, 0xaa]  1                                 [0xaa, 0xbb]  0xaabb

The column you have labelled 'guest repr' here is the returned data
from io_mem_read, whose API contract is "write the data from the
device into this host C uint16_t (or whatever) such that it is the
value returned by the device read as a native host value". It's
not related to the guest order at all.

So for instance, io_mem_read() is called by cpu_physical_memory_rw(),
which passes it a local variable "val". So now "val" has the
"guest repr" column's bytes in it, and (as a host C variable) the
value:
1,2,3 : 0xbbaa
4,5,6 : 0xaabb
7,8,9 : 0xbbaa
10,11,12 : 0xaabb

As you can see, this depends on the "guest endianness" (which is
kind of the endianness of the bus): a BE guest 16 bit access to
this device would return the 16 bit value 0xaabb, and an LE guest
0xbbaa, and we have exactly those values in our host C variable.
If this is TCG, then we'll copy that 16 bit host value into the
CPUState struct field corresponding to the destination guest
register as-is. (TCG CPUState fields are always in host-C-order.)

However, to pass them up to KVM we have to put them into a
buffer in RAM as per the KVM_EXIT_MMIO ABI. So:
cpu_physical_memory_rw() calls stw_p(buf, val) [which is "store
in target CPU endianness"], so now buf has the bytes:
1,2,3 : [0xaa, 0xbb]
4,5,6 : [0xaa, 0xbb]
7,8,9 : [0xaa, 0xbb]
10,11,12 : [0xaa, 0xbb]

...which is the same for every case.

This buffer is (for KVM) the run->mmio.data buffer, whose semantics
are "the value as it would appear if the VCPU performed a load or store
of the appropriate width directly to the byte array". Which is what we
want -- your device has two bytes in order 0xaa, 0xbb, and we did
a 16 bit load in the guest CPU, so we should get the same answer as if
we did a 16 bit load of RAM containing 0xaa, 0xbb. That will be
0xaabb if the VCPU is big-endian, and 0xbbaa if it is not.

I think the fact that all of these things come out to the same
set of bytes in the mmio.data buffer is the indication that all
QEMU's byte swapping is correct.

> Looking at the two rightmost columns, we must conclude:
>
> (a) The "device_endian" field has absolutely no significance wrt. what
>     the guest sees. In each triplet of cases, when we go from
>     DEVICE_NATIVE_ENDIAN to DEVICE_BIG_ENDIAN to DEVICE_LITTLE_ENDIAN,
>     the guest sees the exact same value.
>
>     I don't understand this result (it makes me doubt device_endian
>     makes any sense). What did I do wrong?

I think it's because you defined your device as only supporting
byte accesses that you didn't see any difference between the
various device_endian settings. If a device presents itself as
just an array of bytes then it doesn't have an endianness, really.

As Paolo says, if you make your device support wider accesses
directly and build up the value yourself then you'll see that
setting the device_endian to LE vs BE vs native does change
the values you see in the guest (and that you'll need to set it
to LE and interpret the words in the guest as LE to get the
right behaviour).

>     This means that this interface is *value preserving*, not
>     representation preserving. In other words: when host and guest
>     endiannesses are identical, the *array* is transferred okay (the
>     guest array matches the initial host array [0xaa, 0xbb]). When guest
>     and host differ in endianness, the guest will see an incorrect
>     *array*.

Think of a device which supports only byte accesses as being
like a lump of RAM. A big-endian guest CPU which reads 32 bits
at a time will get different values in registers to an LE guest
which does the same.

*However*, if both CPUs just do "read 32 bits; write 32 bits to
RAM" (ie a kind of memcpy but with the source being the mmio
register rather than some other bit of RAM) then you should get
a bytewise-correct copy of the data in RAM.

So I *think* that would be a viable approach: have your QEMU
device code as it is now, and just make sure that if the guest
is doing wider-than-byte accesses it does them as
  do {
     load word from mmio register;
     store word to RAM;
  } while (count);

and doesn't try to interpret the byte order of the values while
they're in the CPU register in the middle of the loop.

Paolo's suggestion would work too, if you prefer that.

> I apologize for wasting everyone's time, but I think both results are
> very non-intuitive.

Everything around endianness handling is non-intuitive --
it's the nature of the problem, I'm afraid. (Some of this is
also QEMU's fault for not having good documentation of the
semantics of each of the layers involved in memory accesses.
I have on my todo list the vague idea of trying to write these
all up as a blog post.)

thanks
-- PMM

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [Qemu-devel] [PATCH v4 1/8] fw_cfg: max access size and region size are the same for MMIO data reg, (continued)
- [Qemu-devel] [PATCH v4 3/8] fw_cfg: introduce the "data_memwidth" property, Laszlo Ersek, 2014/12/12
  - Re: [Qemu-devel] [PATCH v4 3/8] fw_cfg: introduce the "data_memwidth" property, Alexander Graf, 2014/12/16
    - Re: [Qemu-devel] [PATCH v4 3/8] fw_cfg: introduce the "data_memwidth" property, Laszlo Ersek, 2014/12/16
    - Re: [Qemu-devel] [PATCH v4 3/8] fw_cfg: introduce the "data_memwidth" property, Laszlo Ersek, 2014/12/16
    - Re: [Qemu-devel] [PATCH v4 3/8] fw_cfg: introduce the "data_memwidth" property, Peter Maydell, 2014/12/16
    - Re: [Qemu-devel] [PATCH v4 3/8] fw_cfg: introduce the "data_memwidth" property, Alexander Graf, 2014/12/16
    - Re: [Qemu-devel] [PATCH v4 3/8] fw_cfg: introduce the "data_memwidth" property, Laszlo Ersek, 2014/12/16

Prev by Date: Re: [Qemu-devel] [PATCH v4 1/8] fw_cfg: max access size and region size are the same for MMIO data reg
Next by Date: [Qemu-devel] [PATCH] hw/net/xen_nic.c: Free 'netdev->txs' when map 'netdev->rxs' fails
Previous by thread: Re: [Qemu-devel] [PATCH v4 1/8] fw_cfg: max access size and region size are the same for MMIO data reg
Next by thread: Re: [Qemu-devel] [PATCH v4 1/8] fw_cfg: max access size and region size are the same for MMIO data reg
Index(es):
- Date
- Thread