qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] device assignment for embedded Power


From: Benjamin Herrenschmidt
Subject: Re: [Qemu-devel] device assignment for embedded Power
Date: Fri, 01 Jul 2011 10:58:14 +1000

On Thu, 2011-06-30 at 15:59 +0000, Yoder Stuart-B08248 wrote:
> One feature we need for QEMU/KVM on embedded Power Architecture is the 
> ability to do passthru assignment of SoC I/O devices and memory.  An 
> important use case in embedded is creating static partitions-- 
> taking physical memory and I/O devices (non-PCI) and partitioning
> them between the host Linux and several virtual machines.   Things like
> live migration would not be needed or supported in these types of scenarios.
> 
> SoC devices do not sit on a probeable bus and there are no identifiers 
> like 01:00.0 with PCI that we can use to identify devices--  the host
> Linux kernel is made aware of SoC I/O devices from nodes/properties in a 
> device tree structure passed at boot.   QEMU needs to generate a
> device tree to pass to the guest as well with all the guest's virtual
> and physical resources.  Today a number of mostly complete guest device
> trees are kept under ./pc-bios in QEMU, but this too static and
> inflexible.
> 
> Some new mechanism is needed to assign SoC devices to guests, and we
> (FSL + Alex Graf) have been discussing a few possible approaches
> for doing this from QEMU and would like some feedback.
> 
> Some possibilities:
> 
> 1. Option 1.  Pass the host dev tree to QEMU and assign devices
>    by device tree path
>
>      -dtb ./mpc8572ds.dtb -device assigned-soc-dev,dev=/soc/address@hidden
> 
>    /soc/address@hidden is the device tree path to the assigned device.
>    The device node 'address@hidden' has some number of properties (e.g. 
>    address, interrupt info) and possibly subnodes under
>    it.   QEMU copies that node when generating the guest dev tree.
>    See snippet of entire node:  http://paste2.org/p/1496460

Yuck (see below)

> 2. Option 2.  Pass the entire assigned device node as a string to
>    QEMU
> 
>      -device assigned-soc-dev,dev=/address@hidden,dev-node='#address-cells = 
> <1>;
>       #size-cells = <0>; cell-index = <0>; compatible = "fsl-i2c";
>       reg = <0xffe03000 0x100>; interrupts = <43 2>;
>       interrupt-parent = <&mpic>; dfsrr;'

Beuark ! (see below)

>    This avoids needing to pass the host device tree, but could 
>    get awkward-- the i2c example above is very simple, some device
>    nodes are very large with a complex hierarchy of subnodes and 
>    could be hundreds of lines of text to represent a single
>    node.
> 
> It gets more complicated...


So, from a qemu command line perspective, all you should have to do is
pass qemu the device-tree -path- to the device you want to pass-trough
(you may support passing a full hierarchy here).

That is for normal MMIO mapped SoC devices. Something else (individual
i2c, usb, ...) will use specific virtualization of the corresponding
busses.

Anything else sucks too much really.

>From there, well, there's several approach inside qemu/kvm to handle
that path. If you want to do things at the qemu level you can probably
parse /proc/device-tree. But I'd personally just make it a kernel thing.

IE. I would have an ioctl to "instanciate" a pass-through device, that
takes that path as an argument. I would make it return an anonymous fd
which you can then use to mmap the resources, etc...

> In some cases, modifications to device tree nodes may be needed.
> An example-- sometimes a device tree property references another node 
> and that relationship may not exist when assigned to a guest.
> A "phy-handle" property may need to be deleted and a "fixed-link"
> property added to a node representing a network device.

That's fishy. Why wouldn't you give full access to the MDIO ? It's
shared ? Such things are so device-specific that they would have to be
handled by device-specific quirks, which can live either in qemu or in
the kernel.

> So in addition to assigning a device, a mechanism is needed to update 
> device tree nodes.  So for the above example, maybe--
> 
>  -device assigned-soc-dev,dev=/soc/address@hidden,delete-prop=phy-handle,
>   node-update="fixed-link = <2 1 1000 0 0>"

That's just so gross and error prone, borderline insane.

> The types of modifications needed--  deleting nodes, deleting properties, 
> adding nodes, adding properties, adding properties that reference other
> nodes, changing properties. This device tree transformation mechanism
> needed is general enough that it could apply to any device tree based
> embedded platform (e.g. ARM, MIPS)
>
> Another complexity relates to the IOMMU.  Here things get very company 
> and IOMMU specific. Freescale has a proprietary IOMMU.

Look at the work currently being done for a generic qemu iommu layer. We
need it for server power as well and from what I last saw coming from
Eduardo and David, it's not PCI specific.

> Devices have 1 or more logical I/O device numbers used to index into 
> the IOMMU table. The IOMMU is limited in that it is designed to only 
> support large, physically contiguous mappings per device.  It does not 
> support any kind of page table.  The IOMMU hardware architecture 
> assumes DMAs are typically targeted to just a few address regions.  
> So, a common IOMMU setup for a device would be a device with a single 
> IOMMU mapping covering the guest's main memory segment.  However, 
> there are many much more complicated IOMMU setups that are common as 
> well, such as doing "operation translations" where a device's write 
> transaction is translated to "stash" directly into CPU caches.  We 
> can't assume that all memory slots belonging to the guest are targets 
> of DMA.
> 
> So for Freescale we would need some very Freescale-specific 
> configuration mechanism to set up the IOMMU.  Here I think we would 
> need the new qcfg approach to expressing nested
> structures (http://wiki.qemu.org/Features/QCFG).   Device
> assignment with IOMMU set up might look like the examples
> below:

Cheers,
Ben.

> # device with multiple logical i/o device numbers
> 
> -device assigned-soc-dev,dev=/qman-portals/address@hidden,
> vcpu=1,fsl,iommu.stash-mem={
> dma-window.guest-addr=0x0,
> dma-window.size=0x100000000,
> liodn-index=1,
> operation-mapping=0
> stash-dest=1},
> fsl,iommu.stash-dqrr={
> dma-window.guest-addr=0xff4200000,
> dma-window.size=0x4000,
> liodn-index=0,
> operation-mapping=0
> stash-dest=1}
> 
> # assign pci-bus to a guest with multiple memory # regions
> #    addr       size
> #    0x0         512MB
> #    0x20000000  4KB  (for MSIs)
> #    0x40000000  16MB (shared memory)
> #    0xc0000000  64MB (shared memory)
> 
> -device assigned-soc-dev,dev=/address@hidden,
> fsl,iommu={dma-window.guest-addr=0x0,
> dma-window.size=0x100000000,
> dma-window.subwindow-count =8,
> dma-window.sub-window.0.guest-addr=0x0,
> dma-window.sub-window.0.size=0x20000000,
> dma-window.sub-window.1.guest-addr=0x20000000,
> dma-window.sub-window.1.size=0x4000,
> dma-window.sub-window.1.pci-msi-subwindow,
> dma-window.sub-window.2.guest-addr. 0x40000000, 
> dma-window.sub-window.2.size=0x01000000,
> dma-window.sub-window.3.guest-addr. 0xc0000000, 
> dma-window.sub-window.3.size=0x04000000}
> 
> The above are from some real examples based on the SoC device 
> assignment mechanisms in the Freescale Embedded Hypervisor.
> 
> A final thing...
> 
> Both options 1 and 2 above introduce an implementation complexity--
> both need to be able to parse text device tree syntax format.  In option
> 2 since the entire node is passed as text.  And both options for doing
> complex node updates.  QEMU would need to do syntactic and semantic
> parsing of DTS syntax, basically needing parts of the front end of
> dtc (the device tree compiler-- http://git.jdl.com/gitweb/).
> 
> Option 3.  So a 3rd approach could be an extension of options 1
> or 2.  Instead of expressing nodes in ascii DTS format requiring
> parsing, pass a compiled file in device tree binary format to QEMU
> that expresses the Qdev properties.
> 
> So instead of:
>  -device assigned-soc-dev,dev=/soc/address@hidden,delete-prop=phy-handle,
>   node-update="fixed-link = <2 1 1000 0 0>"
> 
> You might have a config file containing:
> 
> ethernet0 {
>    compatible = "device";
>    type = "assigned-soc-dev";
>    dev = "/soc/address@hidden";
>    node-update {
>       delete-prop="phy-handle";
>       fixed-link = <2 1 1000 0 0>";
>    }; 
> };
> 
> You would compile the file into a DTB and then pass it to QEMU:
> 
>    -config-dtb ./myguest.dtb
> 
> The above is a very simple example-- the benefit of this approach is
> in the much more complicated node updates that are sometimes needed.
> 
> The config-dtb is just an alternate way of getting complex
> device tree data into QEMU.  It supplements and does not change
> existing QEMU architecture.
> 
> Some pluses of this approach:
>    -avoids pulling in substantial complexity for parsing DTS
>     syntax
>    -device tree nodes are represented in their "native" DTB
>     format
>    -an available user space library (libfdt) is already part
>     of QEMU for parsing DTBs
>    -greatly simplifies handling node updates where node reference other
>     nodes
>    -could use either option 1 (assign node by reference) or option 2
>     (assign node by
>    -we've used an approach similar to this in the Freescale Embedded
>     Hypervisor for 3+ years now and it's held up well
> 
> 
> Regards,
> Stuart Yoder





reply via email to

[Prev in Thread] Current Thread [Next in Thread]