[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release
From: |
Jike Song |
Subject: |
Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...) |
Date: |
Mon, 18 Jan 2016 16:56:13 +0800 |
User-agent: |
Mozilla/5.0 (X11; Linux i686 on x86_64; rv:17.0) Gecko/20130801 Thunderbird/17.0.8 |
On 01/18/2016 12:47 PM, Alex Williamson wrote:
> Hi Jike,
>
> On Mon, 2016-01-18 at 10:39 +0800, Jike Song wrote:
>> Hi Alex, let's continue with a new thread :)
>>
>> Basically we agree with you: exposing vGPU via VFIO can make
>> QEMU share as much code as possible with pcidev(PF or VF) assignment.
>> And yes, different vGPU vendors can share quite a lot of the
>> QEMU part, which will do good for upper layers such as libvirt.
>>
>>
>> To achieve this, there are quite a lot to do, I'll summarize
>> it below. I dived into VFIO for a while but still may have
>> things misunderstood, so please correct me :)
>>
>>
>>
>> First, let me illustrate my understanding of current VFIO
>> framework used to pass through a pcidev to guest:
>>
>>
>> +----------------------------------+
>> | vfio qemu |
>> +-----+------------------------+---+
>> |DMA ^ |CFG
>> QEMU |map IRQ| |
>> -----------------------|---------------------|--|-----------
>> KERNEL +------------|---------------------|--|----------+
>> | VFIO | | | |
>> | v | v |
>> | +-------------------+ +-----+-----------+ |
>> IOMMU | | vfio iommu driver | | vfio bus driver | |
>> API <-------+ | | | |
>> Layer | | e.g. type1 | | e.g. vfio_pci | |
>> | +-------------------+ +-----------------+ |
>> +------------------------------------------------+
>>
>>
>> Here when a particular pcidev is passed-through to a KVM guest,
>> it is attached to vfio_pci driver in host, and guest memory
>> is mapped into IOMMU via the type1 iommu driver.
>>
>>
>> Then, the draft infrastructure of future VFIO-based vgpu:
>>
>>
>>
>> +-------------------------------------+
>> | vfio qemu |
>> +----+-------------------------+------+
>> |DMA ^ |CFG
>> QEMU |map IRQ| |
>> ----------------------|----------------------|--|-----------
>> KERNEL | | |
>> +------------|----------------------|--|----------+
>> |VFIO | | | |
>> | v | v |
>> | +--------------------+ +-----+-----------+ |
>> DMA | | vfio iommu driver | | vfio bus driver | |
>> API <------+ | | | |
>> Layer | | e.g. vfio_type2 | | e.g. vfio_vgpu | |
>> | +--------------------+ +-----------------+ |
>> | | ^ | ^ |
>> +---------|--|----------------------|--|----------+
>> | | | |
>> | | v |
>> +---------|--|----------+ +---------------------+
>> | +-------v-----------+ | | |
>> | | | | | |
>> | | KVMGT | | | |
>> | | | | | host gfx driver |
>> | +-------------------+ | | |
>> | | | |
>> | KVM hypervisor | | |
>> +-----------------------+ +---------------------+
>>
>> NOTE vfio_type2 and vfio_vgpu are only *logically* parts
>> of VFIO, they may be implemented in KVM hypervisor
>> or host gfx driver.
>>
>>
>>
>> Here we need to implement a new vfio IOMMU driver instead of type1,
>> let's call it vfio_type2 temporarily. The main difference from pcidev
>> assignment is, vGPU doesn't have its own DMA requester id, so it has
>> to share mappings with host and other vGPUs.
>>
>> - type1 iommu driver maps gpa to hpa for passing through;
>> whereas type2 maps iova to hpa;
>>
>> - hardware iommu is always needed by type1, whereas for
>> type2, hardware iommu is optional;
>>
>> - type1 will invoke low-level IOMMU API (iommu_map et al) to
>> setup IOMMU page table directly, whereas type2 dosen't (only
>> need to invoke higher level DMA API like dma_map_page);
>
> Yes, the current type1 implementation is not compatible with vgpu since
> there are not separate requester IDs on the bus and you probably don't
> want or need to pin all of guest memory like we do for direct
> assignment. However, let's separate the type1 user API from the
> current implementation. It's quite easy within the vfio code to
> consider "type1" to be an API specification that may have multiple
> implementations. A minor code change would allow us to continue
> looking for compatible iommu backends if the group we're trying to
> attach is rejected.
Would you elaborate a bit about 'iommu backends' here? Previously
I thought that entire type1 will be duplicated. If not, what is supposed
to add, a new vfio_dma_do_map?
> The benefit here is that QEMU could work
> unmodified, using the type1 vfio-iommu API regardless of whether a
> device is directly assigned or virtual.
>
> Let's look at the type1 interface; we have simple map and unmap
> interfaces which map and unmap process virtual address space (vaddr) to
> the device address space (iova). The host physical address is obtained
> by pinning the vaddr. In the current implementation, a map operation
> pins pages and populates the hardware iommu. A vgpu compatible
> implementation might simply register the translation into a kernel-
> based database to be called upon later. When the host graphics driver
> needs to enable dma for the vgpu, it doesn't need to go to QEMU for the
> translation, it already possesses the iova to vaddr mapping, which
> becomes iova to hpa after a pinning operation.
>
> So, I would encourage you to look at creating a vgpu vfio iommu
> backened that makes use of the type1 api since it will reduce the
> changes necessary for userspace.
>
Yes, keeping type1 API sounds a great idea.
>> We also need to implement a new 'bus' driver instead of vfio_pci,
>> let's call it vfio_vgpu temporarily:
>>
>> - vfio_pci is a real pci driver, it has a probe method called
>> during dev attaching; whereas the vfio_vgpu is a pseudo
>> driver, it won't attach any devivce - the GPU is always owned by
>> host gfx driver. It has to do 'probing' elsewhere, but
>> still in host gfx driver attached to the device;
>>
>> - pcidev(PF or VF) attached to vfio_pci has a natural path
>> in sysfs; whereas vgpu is purely a software concept:
>> vfio_vgpu needs to create create/destory vgpu instances,
>> maintain their paths in sysfs (e.g. "/sys/class/vgpu/intel/vgpu0")
>> etc. There should be something added in a higher layer
>> to do this (VFIO or DRM).
>>
>> - vfio_pci in most case will allow QEMU to access pcidev
>> hardware; whereas vfio_vgpu is to access virtual resource
>> emulated by another device model;
>>
>> - vfio_pci will inject an IRQ to guest only when physical IRQ
>> generated; whereas vfio_vgpu may inject an IRQ for emulation
>> purpose. Anyway they can share the same injection interface;
>
> Here too, I think you're making assumptions based on an implementation
> path. Personally, I think each vgpu should be a struct device and that
> an iommu group should be created for each. I think this is a valid
> abstraction; dma isolation is provided through something other than a
> system-level iommu, but it's still provided. Without this, the entire
> vfio core would need to be aware of vgpu, since the core operates on
> devices and groups. I believe creating a struct device also gives you
> basic probe and release support for a driver.
>
Indeed.
BTW, that should be done in the 'bus' driver, right?
> There will be a need for some sort of lifecycle management of a vgpu.
> How is it created? Destroyed? Can it be given more or less resources
> than other vgpus, etc. This could be implemented in sysfs for each
> physical gpu with vgpu support, sort of like how we support sr-iov now,
> the PF exports controls for creating VFs. The more commonality we can
> get for lifecycle and device access for userspace, the better.
>
Will have a look at the VF managements, thanks for the info.
> As for virtual vs physical resources and interrupts, part of the
> purpose of vfio is to abstract a device into basic components. It's up
> to the bus driver how accesses to each space map to the physical
> device. Take for instance PCI config space, the existing vfio-pci
> driver emulates some portions of config space for the user.
>
>> Questions:
>>
>> [1] For VFIO No-IOMMU mode (!iommu_present), I saw it was reverted
>> in upstream ae5515d66362(Revert: "vfio: Include No-IOMMU mode").
>> In my opinion, vfio_type2 doesn't rely on it to support No-IOMMU
>> case, instead it needs a new implementation which fits both
>> w/ and w/o IOMMU. Is this correct?
>>
>
> vfio no-iommu has also been re-added for v4.5 (03a76b60f8ba), this was
> simply a case that the kernel development outpaced the intended user
> and I didn't want to commit to the user api changes until it had been
> completely vetted. In any case, vgpu should have no dependency
> whatsoever on no-iommu. As above, I think vgpu should create virtual
> devices and add them to an iommu group, similar to how no-iommu does,
> but without the kernel tainting because you are actually providing
> isolation through other means than a system iommu.
>
Thanks for confirmation.
>> For things not mentioned above, we might have them discussed in
>> other threads, or temporarily maintained in a TODO list (we might get
>> back to them after the big picture get agreed):
>>
>>
>> - How to expose guest framebuffer via VFIO for SPICE;
>
> Potentially through a new, device specific region, which I think can be
> done within the existing vfio API. The API can already expose an
> arbitrary number of regions to the user, it's just a matter of how we
> tell the user the purpose of a region index beyond the fixed set we map
> to PCI resources.
>
>> - How to avoid double translation with two-stage: GTT + IOMMU,
>> whether identity map is possible, and if yes, how to make it
>> more effectively;
>>
>> - Application acceleration
>> You mentioned that with VFIO, a vGPU may be used by
>> applications to get GPU acceleration. It's a potential
>> opportunity to use vGPU for container usage, worthy of
>> further investigation.
>
> Yes, interesting topics. Thanks,
>
Looks that things get more clear overall, with small exceptions.
Thanks for the advice:)
> Alex
>
--
Thanks,
Jike
- [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...), Jike Song, 2016/01/17
- Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...), Alex Williamson, 2016/01/17
- Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...),
Jike Song <=
- Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...), Alex Williamson, 2016/01/18
- Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...), Jike Song, 2016/01/20
- Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...), Tian, Kevin, 2016/01/20
- Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...), Jike Song, 2016/01/25
- Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...), Alex Williamson, 2016/01/25
- Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...), Tian, Kevin, 2016/01/25
- Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...), Tian, Kevin, 2016/01/25
- Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...), Neo Jia, 2016/01/26
- Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...), Neo Jia, 2016/01/26
- Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...), Tian, Kevin, 2016/01/26