qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release


From: Neo Jia
Subject: Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)
Date: Wed, 27 Jan 2016 01:14:53 -0800
User-agent: Mutt/1.5.24 (2015-08-30)

On Tue, Jan 26, 2016 at 04:30:38PM -0700, Alex Williamson wrote:
> On Tue, 2016-01-26 at 14:28 -0800, Neo Jia wrote:
> > On Tue, Jan 26, 2016 at 01:06:13PM -0700, Alex Williamson wrote:
> > > > 1.1 Under per-physical device sysfs:
> > > > ----------------------------------------------------------------------------------
> > > >  
> > > > vgpu_supported_types - RO, list the current supported virtual GPU types 
> > > > and its
> > > > VGPU_ID. VGPU_ID - a vGPU type identifier returned from reads of
> > > > "vgpu_supported_types".
> > > >                             
> > > > vgpu_create - WO, input syntax <VM_UUID:idx:VGPU_ID>, create a virtual
> > > > gpu device on a target physical GPU. idx: virtual device index inside a 
> > > > VM
> > > >  
> > > > vgpu_destroy - WO, input syntax <VM_UUID:idx>, destroy a virtual gpu 
> > > > device on a
> > > > target physical GPU
> > > 
> > > 
> > > I've noted in previous discussions that we need to separate user policy
> > > from kernel policy here, the kernel policy should not require a "VM
> > > UUID".  A UUID simply represents a set of one or more devices and an
> > > index picks the device within the set.  Whether that UUID matches a VM
> > > or is independently used is up to the user policy when creating the
> > > device.
> > > 
> > > Personally I'd also prefer to get rid of the concept of indexes within a
> > > UUID set of devices and instead have each device be independent.  This
> > > seems to be an imposition on the nvidia implementation into the kernel
> > > interface design.
> > > 
>
> > Hi Alex,
>
> > I agree with you that we should not put UUID concept into a kernel API. At
> > this point (without any prototyping), I am thinking of using a list of 
> > virtual
> > devices instead of UUID.
> 
> Hi Neo,
> 
> A UUID is a perfectly fine name, so long as we let it be just a UUID and
> not the UUID matching some specific use case.
> 
> > > >  
> > > > int vgpu_map_virtual_bar
> > > > (
> > > >     uint64_t virt_bar_addr,
> > > >     uint64_t phys_bar_addr,
> > > >     uint32_t len,
> > > >     uint32_t flags
> > > > )
> > > >  
> > > > EXPORT_SYMBOL(vgpu_map_virtual_bar);
> > > 
> > > 
> > > Per the implementation provided, this needs to be implemented in the
> > > vfio device driver, not in the iommu interface.  Finding the DMA mapping
> > > of the device and replacing it is wrong.  It should be remapped at the
> > > vfio device file interface using vm_ops.
> > > 
>
> > So you are basically suggesting that we are going to take a mmap fault and
> > within that fault handler, we will go into vendor driver to look up the
> > "pre-registered" mapping and remap there.
>
> > Is my understanding correct?
> 
> Essentially, hopefully the vendor driver will have already registered
> the backing for the mmap prior to the fault, but either way could work.
> I think the key though is that you want to remap it onto the vma
> accessing the vfio device file, not scanning it out of an IOVA mapping
> that might be dynamic and doing a vma lookup based on the point in time
> mapping of the BAR.  The latter doesn't give me much confidence that
> mappings couldn't change while the former should be a one time fault.

Hi Alex,

The fact is that the vendor driver can only prevent such mmap fault by looking
up the <iova, hva> mapping table that we have saved from IOMMU memory listerner
when the guest region gets programmed. Also, like you have mentioned below, such
mapping between iova and hva shouldn't be changed as long as the SBIOS and
guest OS are done with their job. 

Yes, you are right it is one time fault, but the gpu work is heavily pipelined. 

Probably we should just limit this interface to guest MMIO region and we can 
have
some crosscheck between the VFIO driver who has monitored the config spcae
access to make sure nothing getting moved around?

> 
> In case it's not clear to folks at Intel, the purpose of this is that a
> vGPU may directly map a segment of the physical GPU MMIO space, but we
> may not know what segment that is at setup time, when QEMU does an mmap
> of the vfio device file descriptor.  The thought is that we can create
> an invalid mapping when QEMU calls mmap(), knowing that it won't be
> accessed until later, then we can fault in the real mmap on demand.  Do
> you need anything similar?
> 
> > > 
> > > > int vgpu_dma_do_translate(dma_addr_t *gfn_buffer, uint32_t count)
> > > >  
> > > > EXPORT_SYMBOL(vgpu_dma_do_translate);
> > > >  
> > > > Still a lot to be added and modified, such as supporting multiple VMs 
> > > > and 
> > > > multiple virtual devices, tracking the mapped / pinned region within 
> > > > VGPU IOMMU 
> > > > kernel driver, error handling, roll-back and locked memory size per 
> > > > user, etc. 
> > > 
> > > Particularly, handling of mapping changes is completely missing.  This
> > > cannot be a point in time translation, the user is free to remap
> > > addresses whenever they wish and device translations need to be updated
> > > accordingly.
> > > 
>
> > When you say "user", do you mean the QEMU?
> 
> vfio is a generic userspace driver interface, QEMU is a very, very
> important user of the interface, but not the only user.  So for this
> conversation, we're mostly talking about QEMU as the user, but we should
> be careful about assuming QEMU is the only user.
> 

Understood. I have to say that our focus at this moment is to support QEMU and
KVM, but I know VFIO interface is much more than that, and that is why I think
it is right to leverage this framework so we can together explore future use
case in the userland.


> > Here, whenever the DMA that
> > the guest driver is going to launch will be first pinned within VM, and then
> > registered to QEMU, therefore the IOMMU memory listener, eventually the 
> > pages
> > will be pinned by the GPU or DMA engine.
>
> > Since we are keeping the upper level code same, thinking about passthru 
> > case,
> > where the GPU has already put the real IOVA into his PTEs, I don't know how 
> > QEMU
> > can change that mapping without causing an IOMMU fault on a active DMA 
> > device.
> 
> For the virtual BAR mapping above, it's easy to imagine that mapping a
> BAR to a given address is at the guest discretion, it may be mapped and
> unmapped, it may be mapped to different addresses at different points in
> time, the guest BIOS may choose to map it at yet another address, etc.
> So if somehow we were trying to setup a mapping for peer-to-peer, there
> are lots of ways that IOVA could change.  But even with RAM, we can
> support memory hotplug in a VM.  What was once a DMA target may be
> removed or may now be backed by something else.  Chipset configuration
> on the emulated platform may change how guest physical memory appears
> and that might change between VM boots.
> 
> Currently with physical device assignment the memory listener watches
> for both maps and unmaps and updates the iotlb to match.  Just like real
> hardware doing these same sorts of things, we rely on the guest to stop
> using memory that's going to be moved as a DMA target prior to moving
> it.

Right,  you can only do that when the device is quiescent.

As long as this will be notified to the guest, I think we should be able to
support it although the real implementation will depend on how the device gets 
into 
quiescent state.

This is definitely a very interesting feature we should explore, but I hope we
probably can first focus on the most basic functionality.

Thanks,
Neo

> 
> > > > 4. Modules
> > > > ==================================================================================
> > > >  
> > > > Two new modules are introduced: vfio_iommu_type1_vgpu.ko and vgpu.ko
> > > >  
> > > > vfio_iommu_type1_vgpu.ko - IOMMU TYPE1 driver supporting the IOMMU 
> > > >                            TYPE1 v1 and v2 interface. 
> > > 
> > > Depending on how intrusive it is, this can possibly by done within the
> > > existing type1 driver.  Either that or we can split out common code for
> > > use by a separate module.
> > > 
> > > > vgpu.ko                  - provide registration interface and virtual 
> > > > device
> > > >                            VFIO access.
> > > >  
> > > > 5. QEMU note
> > > > ==================================================================================
> > > >  
> > > > To allow us focus on the VGPU kernel driver prototyping, we have 
> > > > introduced a new VFIO 
> > > > class - vgpu inside QEMU, so we don't have to change the existing 
> > > > vfio/pci.c file and 
> > > > use it as a reference for our implementation. It is basically just a 
> > > > quick c & p
> > > > from vfio/pci.c to quickly meet our needs.
> > > >  
> > > > Once this proposal is finalized, we will move to vfio/pci.c instead of 
> > > > a new
> > > > class, and probably the only thing required is to have a new way to 
> > > > discover the
> > > > device.
> > > >  
> > > > 6. Examples
> > > > ==================================================================================
> > > >  
> > > > On this server, we have two NVIDIA M60 GPUs.
> > > >  
> > > > address@hidden ~]# lspci -d 10de:13f2
> > > > 86:00.0 VGA compatible controller: NVIDIA Corporation Device 13f2 (rev 
> > > > a1)
> > > > 87:00.0 VGA compatible controller: NVIDIA Corporation Device 13f2 (rev 
> > > > a1)
> > > >  
> > > > After nvidia.ko gets initialized, we can query the supported vGPU type 
> > > > by
> > > > accessing the "vgpu_supported_types" like following:
> > > >  
> > > > address@hidden ~]# cat 
> > > > /sys/bus/pci/devices/0000\:86\:00.0/vgpu_supported_types 
> > > > 11:GRID M60-0B
> > > > 12:GRID M60-0Q
> > > > 13:GRID M60-1B
> > > > 14:GRID M60-1Q
> > > > 15:GRID M60-2B
> > > > 16:GRID M60-2Q
> > > > 17:GRID M60-4Q
> > > > 18:GRID M60-8Q
> > > >  
> > > > For example the VM_UUID is c0b26072-dd1b-4340-84fe-bf338c510818, and we 
> > > > would
> > > > like to create "GRID M60-4Q" VM on it.
> > > >  
> > > > echo "c0b26072-dd1b-4340-84fe-bf338c510818:0:17" >
> > > > /sys/bus/pci/devices/0000\:86\:00.0/vgpu_create
> > > >  
> > > > Note: the number 0 here is for vGPU device index. So far the change is 
> > > > not tested
> > > > for multiple vgpu devices yet, but we will support it.
> > > >  
> > > > At this moment, if you query the "vgpu_supported_types" it will still 
> > > > show all
> > > > supported virtual GPU types as no virtual GPU resource is committed yet.
> > > >  
> > > > Starting VM:
> > > >  
> > > > echo "c0b26072-dd1b-4340-84fe-bf338c510818" > /sys/class/vgpu/vgpu_start
> > > >  
> > > > then, the supported vGPU type query will return:
> > > >  
> > > > address@hidden /home/cjia]$
> > > > > cat /sys/bus/pci/devices/0000\:86\:00.0/vgpu_supported_types
> > > > 17:GRID M60-4Q
> > > >  
> > > > So vgpu_supported_config needs to be called whenever a new virtual 
> > > > device gets
> > > > created as the underlying HW might limit the supported types if there 
> > > > are
> > > > any existing VM runnings.
> > > >  
> > > > Then, VM gets shutdown, writes to /sys/class/vgpu/vgpu_shutdown will 
> > > > info the
> > > > GPU driver vendor to clean up resource.
> > > >  
> > > > Eventually, those virtual GPUs can be removed by writing to 
> > > > vgpu_destroy under
> > > > device sysfs.
> > > 
> > > 
> > > I'd like to hear Intel's thoughts on this interface.  Are there
> > > different vgpu capacities or priority classes that would necessitate
> > > different types of vcpus on Intel?
> > > 
> > > I think there are some gaps in translating from named vgpu types to
> > > indexes here, along with my previous mention of the UUID/set oddity.
> > > 
> > > Does Intel have a need for start and shutdown interfaces?
> > > 
> > > Neo, wasn't there at some point information about how many of each type
> > > could be supported through these interfaces?  How does a user know their
> > > capacity limits?
> > > 
>
> > Thanks for reminding me that, I think we probably forget to put that 
> > *important*
> > information as the output of "vgpu_supported_types".
>
> > Regarding the capacity, we can provide the frame buffer size as part of the
> > "vgpu_supported_types" output as well, I would imagine those will be 
> > eventually
> > show up on the openstack management interface or virt-mgr.
>
> > Basically, yes there would be a separate col to show the number of instance 
> > you
> > can create for each type of VGPU on a specific physical GPU.
> 
> Ok, Thanks,
> 
> Alex
> 



reply via email to

[Prev in Thread] Current Thread [Next in Thread]