Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release

qemu-devel
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release

From:	Neo Jia
Subject:	Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
Date:	Tue, 26 Jan 2016 14:28:30 -0800
User-agent:	Mutt/1.5.24 (2015-08-30)
On Tue, Jan 26, 2016 at 01:06:13PM -0700, Alex Williamson wrote:
> On Tue, 2016-01-26 at 02:20 -0800, Neo Jia wrote:
> > On Mon, Jan 25, 2016 at 09:45:14PM +0000, Tian, Kevin wrote:
> > > > From: Alex Williamson [mailto:address@hidden
> > 
> > Hi Alex, Kevin and Jike,
> > 
> > (Seems I shouldn't use attachment, resend it again to the list, patches are
> > inline at the end)
> > 
> > Thanks for adding me to this technical discussion, a great opportunity
> > for us to design together which can bring both Intel and NVIDIA vGPU 
> > solution to
> > KVM platform.
> > 
> > Instead of directly jumping to the proposal that we have been working on
> > recently for NVIDIA vGPU on KVM, I think it is better for me to put out 
> > couple
> > quick comments / thoughts regarding the existing discussions on this thread 
> > as
> > fundamentally I think we are solving the same problem, DMA, interrupt and 
> > MMIO.
> > 
> > Then we can look at what we have, hopefully we can reach some consensus 
> > soon.
> > 
> > > Yes, and since you're creating and destroying the vgpu here, this is
> > > where I'd expect a struct device to be created and added to an IOMMU
> > > group.  The lifecycle management should really include links between
> > > the vGPU and physical GPU, which would be much, much easier to do with
> > > struct devices create here rather than at the point where we start
> > > doing vfio "stuff".
> > 
> > Infact to keep vfio-vgpu to be more generic, vgpu device creation and 
> > management
> > can be centralized and done in vfio-vgpu. That also include adding to IOMMU
> > group and VFIO group.
> 
> Is this really a good idea?  The concept of a vgpu is not unique to
> vfio, we want vfio to be a driver for a vgpu, not an integral part of
> the lifecycle of a vgpu.  That certainly doesn't exclude adding
> infrastructure to make lifecycle management of a vgpu more consistent
> between drivers, but it should be done independently of vfio.  I'll go
> back to the SR-IOV model, vfio is often used with SR-IOV VFs, but vfio
> does not create the VF, that's done in coordination with the PF making
> use of some PCI infrastructure for consistency between drivers.
> 
> It seems like we need to take more advantage of the class and driver
> core support to perhaps setup a vgpu bus and class with vfio-vgpu just
> being a driver for those devices.
> 
> > Graphics driver can register with vfio-vgpu to get management and emulation 
> > call
> > backs to graphics driver.   
> > 
> > We already have struct vgpu_device in our proposal that keeps pointer to
> > physical device.  
> > 
> > > - vfio_pci will inject an IRQ to guest only when physical IRQ
> > > generated; whereas vfio_vgpu may inject an IRQ for emulation
> > > purpose. Anyway they can share the same injection interface;
> > 
> > eventfd to inject the interrupt is known to vfio-vgpu, that fd should be
> > available to graphics driver so that graphics driver can inject interrupts
> > directly when physical device triggers interrupt. 
> > 
> > Here is the proposal we have, please review.
> > 
> > Please note the patches we have put out here is mainly for POC purpose to
> > verify our understanding also can serve the purpose to reduce confusions 
> > and speed up 
> > our design, although we are very happy to refine that to something 
> > eventually
> > can be used for both parties and upstreamed.
> > 
> > Linux vGPU kernel design
> > ==================================================================================
> > 
> > Here we are proposing a generic Linux kernel module based on VFIO framework
> > which allows different GPU vendors to plugin and provide their GPU 
> > virtualization
> > solution on KVM, the benefits of having such generic kernel module are:
> > 
> > 1) Reuse QEMU VFIO driver, supporting VFIO UAPI
> > 
> > 2) GPU HW agnostic management API for upper layer software such as libvirt
> > 
> > 3) No duplicated VFIO kernel logic reimplemented by different GPU driver 
> > vendor
> > 
> > 0. High level overview
> > ==================================================================================
> > 
> >  
> >   user space:
> >                                 +-----------+  VFIO IOMMU IOCTLs
> >                       +---------| QEMU VFIO |-------------------------+
> >         VFIO IOCTLs   |         +-----------+                         |
> >                       |                                               | 
> >  
> > ---------------------|-----------------------------------------------|---------
> >                       |                                               |
> >   kernel space:       |  +--->----------->---+  (callback)            V
> >                       |  |                   v                 
> > +------V-----+
> >   +----------+   +----V--^--+          +--+--+-----+           | VGPU       
> > |
> >   |          |   |          |     +----| nvidia.ko +----->-----> TYPE1 
> > IOMMU|
> >   | VFIO Bus <===| VGPU.ko  |<----|    +-----------+     |     
> > +---++-------+ 
> >   |          |   |          |     | (register)           ^         ||
> >   +----------+   +-------+--+     |    +-----------+     |         ||
> >                          V        +----| i915.ko   +-----+     
> > +---VV-------+ 
> >                          |             +-----^-----+           | TYPE1      
> > |
> >                          |  (callback)       |                 | IOMMU      
> > |
> >                          +-->------------>---+                 
> > +------------+
> >  access flow:
> > 
> >   Guest MMIO / PCI config access
> >   |
> >   -------------------------------------------------
> >   |
> >   +-----> KVM VM_EXITs  (kernel)
> >           |
> >   -------------------------------------------------
> >           |
> >           +-----> QEMU VFIO driver (user)
> >                   | 
> >   -------------------------------------------------
> >                   |
> >                   +---->  VGPU kernel driver (kernel)
> >                           |  
> >                           | 
> >                           +----> vendor driver callback
> > 
> > 
> > 1. VGPU management interface
> > ==================================================================================
> > 
> > This is the interface allows upper layer software (mostly libvirt) to query 
> > and
> > configure virtual GPU device in a HW agnostic fashion. Also, this management
> > interface has provided flexibility to underlying GPU vendor to support 
> > virtual
> > device hotplug, multiple virtual devices per VM, multiple virtual devices 
> > from
> > different physical devices, etc.
> > 
> > 1.1 Under per-physical device sysfs:
> > ----------------------------------------------------------------------------------
> > 
> > vgpu_supported_types - RO, list the current supported virtual GPU types and 
> > its
> > VGPU_ID. VGPU_ID - a vGPU type identifier returned from reads of
> > "vgpu_supported_types".
> >                             
> > vgpu_create - WO, input syntax <VM_UUID:idx:VGPU_ID>, create a virtual
> > gpu device on a target physical GPU. idx: virtual device index inside a VM
> > 
> > vgpu_destroy - WO, input syntax <VM_UUID:idx>, destroy a virtual gpu device 
> > on a
> > target physical GPU
> 
> 
> I've noted in previous discussions that we need to separate user policy
> from kernel policy here, the kernel policy should not require a "VM
> UUID".  A UUID simply represents a set of one or more devices and an
> index picks the device within the set.  Whether that UUID matches a VM
> or is independently used is up to the user policy when creating the
> device.
> 
> Personally I'd also prefer to get rid of the concept of indexes within a
> UUID set of devices and instead have each device be independent.  This
> seems to be an imposition on the nvidia implementation into the kernel
> interface design.
> 

Hi Alex,

I agree with you that we should not put UUID concept into a kernel API. At
this point (without any prototyping), I am thinking of using a list of virtual
devices instead of UUID.

> 
> > 1.3 Under vgpu class sysfs:
> > ----------------------------------------------------------------------------------
> > 
> > vgpu_start - WO, input syntax <VM_UUID>, this will trigger the registration
> > interface to notify the GPU vendor driver to commit virtual GPU resource for
> > this target VM. 
> > 
> > Also, the vgpu_start function is a synchronized call, the successful return 
> > of
> > this call will indicate all the requested vGPU resource has been fully
> > committed, the VMM should continue.
> > 
> > vgpu_shutdown - WO, input syntax <VM_UUID>, this will trigger the 
> > registration
> > interface to notify the GPU vendor driver to release virtual GPU resource of
> > this target VM.
> > 
> > 1.4 Virtual device Hotplug
> > ----------------------------------------------------------------------------------
> > 
> > To support virtual device hotplug, <vgpu_create> and <vgpu_destroy> can be
> > accessed during VM runtime, and the corresponding registration callback 
> > will be
> > invoked to allow GPU vendor support hotplug.
> > 
> > To support hotplug, vendor driver would take necessary action to handle the
> > situation when a vgpu_create is done on a VM_UUID after vgpu_start, and that
> > implies both create and start for that vgpu device.
> > 
> > Same, vgpu_destroy implies a vgpu_shudown on a running VM only if vendor 
> > driver
> > supports vgpu hotplug.
> > 
> > If hotplug is not supported and VM is still running, vendor driver can 
> > return
> > error code to indicate not supported.
> > 
> > Separate create from start gives flixibility to have:
> > 
> > - multiple vgpu instances for single VM and
> > - hotplug feature.
> > 
> > 2. GPU driver vendor registration interface
> > ==================================================================================
> > 
> > 2.1 Registration interface definition (include/linux/vgpu.h)
> > ----------------------------------------------------------------------------------
> > 
> > extern int vgpu_register_device(struct pci_dev *dev, 
> >                                 const struct gpu_device_ops *ops);
> > 
> > extern void vgpu_unregister_device(struct pci_dev *dev);
> > 
> > /**
> >  * struct gpu_device_ops - Structure to be registered for each physical GPU 
> > to
> >  * register the device to vgpu module.
> >  *
> >  * @owner:                      The module owner.
> >  * @vgpu_supported_config:      Called to get information about supported 
> > vgpu
> >  * types.
> > address@hidden : pci device structure of physical GPU. 
> > address@hidden: should return string listing supported
> >  *                              config
> >  *                              Returns integer: success (0) or error (< 0)
> >  * @vgpu_create:                Called to allocate basic resouces in 
> > graphics
> >  *                              driver for a particular vgpu.
> > address@hidden: physical pci device structure on which
> >  *                              vgpu 
> >  *                                    should be created
> > address@hidden: VM's uuid for which VM it is intended
> >  *                              to
> > address@hidden: vgpu instance in that VM
> > address@hidden: This represents the type of vgpu to be
> >  *                                        created
> >  *                              Returns integer: success (0) or error (< 0)
> >  * @vgpu_destroy:               Called to free resources in graphics driver 
> > for
> >  *                              a vgpu instance of that VM.
> > address@hidden: physical pci device structure to which
> >  *                              this vgpu points to.
> > address@hidden: VM's uuid for which the vgpu belongs
> >  *                              to.
> > address@hidden: vgpu instance in that VM
> >  *                              Returns integer: success (0) or error (< 0)
> >  *                              If VM is running and vgpu_destroy is called 
> > that 
> >  *                              means the vGPU is being hotunpluged. Return
> >  *                              error
> >  *                              if VM is running and graphics driver doesn't
> >  *                              support vgpu hotplug.
> >  * @vgpu_start:                 Called to do initiate vGPU initialization
> >  *                              process in graphics driver when VM boots 
> > before
> >  *                              qemu starts.
> > address@hidden: VM's UUID which is booting.
> >  *                              Returns integer: success (0) or error (< 0)
> >  * @vgpu_shutdown:              Called to teardown vGPU related resources 
> > for
> >  *                              the VM
> > address@hidden: VM's UUID which is shutting down .
> >  *                              Returns integer: success (0) or error (< 0)
> >  * @read:                       Read emulation callback
> > address@hidden: vgpu device structure
> > address@hidden: read buffer
> > address@hidden: number bytes to read 
> > address@hidden: specifies for which address
> >  *                              space
> >  *                              the request is: pci_config_space, IO 
> > register
> >  *                              space or MMIO space.
> >  *                              Retuns number on bytes read on success or 
> > error.
> >  * @write:                      Write emulation callback
> > address@hidden: vgpu device structure
> > address@hidden: write buffer
> > address@hidden: number bytes to be written
> > address@hidden: specifies for which address
> >  *                              space
> >  *                              the request is: pci_config_space, IO 
> > register
> >  *                              space or MMIO space.
> >  *                              Retuns number on bytes written on success or
> >  *                              error.
> >  * @vgpu_set_irqs:              Called to send about interrupts 
> > configuration
> >  *                              information that qemu set. 
> > address@hidden: vgpu device structure
> > address@hidden, index, start, count and *data : same as
> >  *                              that of struct vfio_irq_set of
> >  *                              VFIO_DEVICE_SET_IRQS API. 
> >  *
> >  * Physical GPU that support vGPU should be register with vgpu module with 
> >  * gpu_device_ops structure.
> >  */
> > 
> > struct gpu_device_ops {
> >         struct module   *owner;
> >         int     (*vgpu_supported_config)(struct pci_dev *dev, char *config);
> >         int     (*vgpu_create)(struct pci_dev *dev, uuid_le vm_uuid,
> >                                uint32_t instance, uint32_t vgpu_id);
> >         int     (*vgpu_destroy)(struct pci_dev *dev, uuid_le vm_uuid,
> >                                 uint32_t instance);
> >         int     (*vgpu_start)(uuid_le vm_uuid);
> >         int     (*vgpu_shutdown)(uuid_le vm_uuid);
> >         ssize_t (*read) (struct vgpu_device *vdev, char *buf, size_t count,
> >                          uint32_t address_space, loff_t pos);
> >         ssize_t (*write)(struct vgpu_device *vdev, char *buf, size_t count,
> >                          uint32_t address_space,loff_t pos);
> >         int     (*vgpu_set_irqs)(struct vgpu_device *vdev, uint32_t flags,
> >                                  unsigned index, unsigned start, unsigned 
> > count,
> >                                  void *data);
> > 
> > };
> 
> 
> I wonder if it shouldn't be vfio-vgpu sub-drivers (ie, Intel and Nvidia)
> that register these ops with the main vfio-vgpu driver and they should
> also include a probe() function which allows us to associate a given
> vgpu device with a set of vendor ops.
> 
> 
> > 
> > 2.2 Details for callbacks we haven't mentioned above.
> > ---------------------------------------------------------------------------------
> > 
> > vgpu_supported_config: allows the vendor driver to specify the supported 
> > vGPU
> >                        type/configuration
> > 
> > vgpu_create          : create a virtual GPU device, can be used for device 
> > hotplug.
> > 
> > vgpu_destroy         : destroy a virtual GPU device, can be used for device 
> > hotplug.
> > 
> > vgpu_start           : callback function to notify vendor driver vgpu device
> >                        come to live for a given virtual machine.
> > 
> > vgpu_shutdown        : callback function to notify vendor driver 
> > 
> > read                 : callback to vendor driver to handle virtual device 
> > config
> >                        space or MMIO read access
> > 
> > write                : callback to vendor driver to handle virtual device 
> > config
> >                        space or MMIO write access
> > 
> > vgpu_set_irqs        : callback to vendor driver to pass along the interrupt
> >                        information for the target virtual device, then 
> > vendor
> >                        driver can inject interrupt into virtual machine for 
> > this
> >                        device.
> > 
> > 2.3 Potential additional virtual device configuration registration 
> > interface:
> > ---------------------------------------------------------------------------------
> > 
> > callback function to describe the MMAP behavior of the virtual GPU 
> > 
> > callback function to allow GPU vendor driver to provide PCI config space 
> > backing
> > memory.
> > 
> > 3. VGPU TYPE1 IOMMU
> > ==================================================================================
> > 
> > Here we are providing a TYPE1 IOMMU for vGPU which will basically keep 
> > track the 
> > <iova, hva, size, flag> and save the QEMU mm for later reference.
> > 
> > You can find the quick/ugly implementation in the attached patch file, 
> > which is
> > actually just a simple version Alex's type1 IOMMU without actual real
> > mapping when IOMMU_MAP_DMA / IOMMU_UNMAP_DMA is called. 
> > 
> > We have thought about providing another vendor driver registration 
> > interface so
> > such tracking information will be sent to vendor driver and he will use the 
> > QEMU
> > mm to do the get_user_pages / remap_pfn_range when it is required. After 
> > doing a
> > quick implementation within our driver, I noticed following issues:
> > 
> > 1) OS/VFIO logic into vendor driver which will be a maintenance issue.
> > 
> > 2) Every driver vendor has to implement their own RB tree, instead of 
> > reusing
> > the common existing VFIO code (vfio_find/link/unlink_dma) 
> > 
> > 3) IOMMU_UNMAP_DMA is expecting to get "unmapped bytes" back to the 
> > caller/QEMU,
> > better not have anything inside a vendor driver that the VFIO caller 
> > immediately
> > depends on.
> > 
> > Based on the above consideration, we decide to implement the DMA tracking 
> > logic
> > within VGPU TYPE1 IOMMU code (ideally, this should be merged into current 
> > TYPE1
> > IOMMU code) and expose two symbols to outside for MMIO mapping and page
> > translation and pinning. 
> > 
> > Also, with a mmap MMIO interface between virtual and physical, this allows
> > para-virtualized guest driver can access his virtual MMIO without taking a 
> > MMAP
> > fault hit, also we can support different MMIO size between virtual and 
> > physical
> > device.
> > 
> > int vgpu_map_virtual_bar
> > (
> >     uint64_t virt_bar_addr,
> >     uint64_t phys_bar_addr,
> >     uint32_t len,
> >     uint32_t flags
> > )
> > 
> > EXPORT_SYMBOL(vgpu_map_virtual_bar);
> 
> 
> Per the implementation provided, this needs to be implemented in the
> vfio device driver, not in the iommu interface.  Finding the DMA mapping
> of the device and replacing it is wrong.  It should be remapped at the
> vfio device file interface using vm_ops.
> 

So you are basically suggesting that we are going to take a mmap fault and
within that fault handler, we will go into vendor driver to look up the
"pre-registered" mapping and remap there.

Is my understanding correct?

> 
> > int vgpu_dma_do_translate(dma_addr_t *gfn_buffer, uint32_t count)
> > 
> > EXPORT_SYMBOL(vgpu_dma_do_translate);
> > 
> > Still a lot to be added and modified, such as supporting multiple VMs and 
> > multiple virtual devices, tracking the mapped / pinned region within VGPU 
> > IOMMU 
> > kernel driver, error handling, roll-back and locked memory size per user, 
> > etc. 
> 
> Particularly, handling of mapping changes is completely missing.  This
> cannot be a point in time translation, the user is free to remap
> addresses whenever they wish and device translations need to be updated
> accordingly.
> 

When you say "user", do you mean the QEMU? Here, whenever the DMA that
the guest driver is going to launch will be first pinned within VM, and then
registered to QEMU, therefore the IOMMU memory listener, eventually the pages
will be pinned by the GPU or DMA engine.

Since we are keeping the upper level code same, thinking about passthru case,
where the GPU has already put the real IOVA into his PTEs, I don't know how QEMU
can change that mapping without causing an IOMMU fault on a active DMA device.

> 
> > 4. Modules
> > ==================================================================================
> > 
> > Two new modules are introduced: vfio_iommu_type1_vgpu.ko and vgpu.ko
> > 
> > vfio_iommu_type1_vgpu.ko - IOMMU TYPE1 driver supporting the IOMMU 
> >                            TYPE1 v1 and v2 interface. 
> 
> Depending on how intrusive it is, this can possibly by done within the
> existing type1 driver.  Either that or we can split out common code for
> use by a separate module.
> 
> > vgpu.ko                  - provide registration interface and virtual device
> >                            VFIO access.
> > 
> > 5. QEMU note
> > ==================================================================================
> > 
> > To allow us focus on the VGPU kernel driver prototyping, we have introduced 
> > a new VFIO 
> > class - vgpu inside QEMU, so we don't have to change the existing 
> > vfio/pci.c file and 
> > use it as a reference for our implementation. It is basically just a quick 
> > c & p
> > from vfio/pci.c to quickly meet our needs.
> > 
> > Once this proposal is finalized, we will move to vfio/pci.c instead of a new
> > class, and probably the only thing required is to have a new way to 
> > discover the
> > device.
> > 
> > 6. Examples
> > ==================================================================================
> > 
> > On this server, we have two NVIDIA M60 GPUs.
> > 
> > address@hidden ~]# lspci -d 10de:13f2
> > 86:00.0 VGA compatible controller: NVIDIA Corporation Device 13f2 (rev a1)
> > 87:00.0 VGA compatible controller: NVIDIA Corporation Device 13f2 (rev a1)
> > 
> > After nvidia.ko gets initialized, we can query the supported vGPU type by
> > accessing the "vgpu_supported_types" like following:
> > 
> > address@hidden ~]# cat 
> > /sys/bus/pci/devices/0000\:86\:00.0/vgpu_supported_types 
> > 11:GRID M60-0B
> > 12:GRID M60-0Q
> > 13:GRID M60-1B
> > 14:GRID M60-1Q
> > 15:GRID M60-2B
> > 16:GRID M60-2Q
> > 17:GRID M60-4Q
> > 18:GRID M60-8Q
> > 
> > For example the VM_UUID is c0b26072-dd1b-4340-84fe-bf338c510818, and we 
> > would
> > like to create "GRID M60-4Q" VM on it.
> > 
> > echo "c0b26072-dd1b-4340-84fe-bf338c510818:0:17" > 
> > /sys/bus/pci/devices/0000\:86\:00.0/vgpu_create
> > 
> > Note: the number 0 here is for vGPU device index. So far the change is not 
> > tested
> > for multiple vgpu devices yet, but we will support it.
> > 
> > At this moment, if you query the "vgpu_supported_types" it will still show 
> > all
> > supported virtual GPU types as no virtual GPU resource is committed yet.
> > 
> > Starting VM:
> > 
> > echo "c0b26072-dd1b-4340-84fe-bf338c510818" > /sys/class/vgpu/vgpu_start
> > 
> > then, the supported vGPU type query will return:
> > 
> > address@hidden /home/cjia]$
> > > cat /sys/bus/pci/devices/0000\:86\:00.0/vgpu_supported_types
> > 17:GRID M60-4Q
> > 
> > So vgpu_supported_config needs to be called whenever a new virtual device 
> > gets
> > created as the underlying HW might limit the supported types if there are
> > any existing VM runnings.
> > 
> > Then, VM gets shutdown, writes to /sys/class/vgpu/vgpu_shutdown will info 
> > the
> > GPU driver vendor to clean up resource.
> > 
> > Eventually, those virtual GPUs can be removed by writing to vgpu_destroy 
> > under
> > device sysfs.
> 
> 
> I'd like to hear Intel's thoughts on this interface.  Are there
> different vgpu capacities or priority classes that would necessitate
> different types of vcpus on Intel?
> 
> I think there are some gaps in translating from named vgpu types to
> indexes here, along with my previous mention of the UUID/set oddity.
> 
> Does Intel have a need for start and shutdown interfaces?
> 
> Neo, wasn't there at some point information about how many of each type
> could be supported through these interfaces?  How does a user know their
> capacity limits?
> 

Thanks for reminding me that, I think we probably forget to put that *important*
information as the output of "vgpu_supported_types".

Regarding the capacity, we can provide the frame buffer size as part of the
"vgpu_supported_types" output as well, I would imagine those will be eventually
show up on the openstack management interface or virt-mgr.

Basically, yes there would be a separate col to show the number of instance you
can create for each type of VGPU on a specific physical GPU.

Thanks,
Neo


> Thanks,
> Alex
>
[Prev in Thread]
Current Thread
[Next in Thread]
Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...), (continued)
Prev by Date: Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
Next by Date: Re: [Qemu-devel] Migrating decrementer
Previous by thread: Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
Next by thread: Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
Index(es):
- Date
- Thread