[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [RFC PATCH] vfio: VFIO PCI driver for Qemu
From: |
Avi Kivity |
Subject: |
Re: [Qemu-devel] [RFC PATCH] vfio: VFIO PCI driver for Qemu |
Date: |
Tue, 31 Jul 2012 15:34:38 +0300 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:13.0) Gecko/20120615 Thunderbird/13.0.1 |
On 07/31/2012 01:29 AM, Alex Williamson wrote:
>>
>> If the region size is zero, then both memory_region_del_subregion()
>> (assuming the region is parented) and munmap() do nothing. So you could
>> call this unconditionally.
>
> I suppose parenting them is the key. I'm counting on memory_region_size
> of zero for an uninitialized, g_malloc0() MemoryRegion.
That's a no-no. We have APIs for a reason. Maybe I'll start encrypting
the contents by xoring with a private variable.
> Initializing
> them just to have a parent so we can unconditionally remove them here
> seems like it's just shifting complexity from one function to another.
> The majority of BARs aren't even implemented, so we'd actually be
> setting up a lot of dummy infrastructure for a slightly cleaner unmap
> function. I'll keep looking at this, but I'm not optimistic there's an
> overall simplification here.
Ok. But use your own bool, don't overload an something from MemoryRegion.
>
>> >> > +
>> >> > + if (vdev->msix && vdev->msix->table_bar == nr) {
>> >> > + size = memory_region_size(&vdev->msix->mmap_mem);
>> >> > + if (size) {
>> >> > + memory_region_del_subregion(&bar->mem,
>> >> > &vdev->msix->mmap_mem);
>> >> > + munmap(vdev->msix->mmap, size);
>> >> > + }
>> >> > + }
>> >
>> > And this one potentially unmaps the overlap after the vector table if
>> > there's any space for one.
>> >
>> >> Are the three size checks needed? Everything should work without them
>> >> from the memory core point of view.
>> >
>> > I haven't tried, but I strongly suspect I shouldn't be munmap'ing
>> > NULL... no?
>>
>> NULL isn't the problem (well some kernels protect against mmaping NULL
>> to avoid kernel exploits), but it seems the kernel doesn't like a zero
>> length.
>
> in mm/mmap.c:do_munmap() I see:
>
> if ((len = PAGE_ALIGN(len)) == 0)
> return -EINVAL;
>
> Before anything scary happens, so that should be ok. It's not really
> worthwhile to call the munmaps unconditionally if we already have the
> condition tests because the subregions are unparented though.
Yeah.
>
>> >> > +
>> >> > + /*
>> >> > + * We can't mmap areas overlapping the MSIX vector table, so we
>> >> > + * potentially insert a direct-mapped subregion before and after
>> >> > it.
>> >> > + */
>> >>
>> >> This splitting is what the memory core really enjoys. You can just
>> >> place the MSIX page over the RAM page and let it do the cut-n-paste.
>> >
>> > Sure, but VFIO won't allow us to mmap over the MSI-X table for security
>> > reasons. It might be worthwhile to someday make VFIO insert an
>> > anonymous page over the MSI-X table to allow this, but it didn't look
>> > trivial for my novice mm abilities. Easy to add a flag from the VFIO
>> > kernel structure where we learn about this BAR if we add it in the
>> > future.
>>
>> I meant due it purely in qemu. Instead of an emulated region overlaid
>> by two assigned regions, have an assigned region overlaid by the
>> emulated region. The regions seen by the vfio listener will be the same.
>
> Sure, that's what KVM device assignment does, but it requires being able
> to mmap the whole BAR, including an MSI-X table. The VFIO kernel side
> can't assume userspace isn't malicious so it has to prevent this.
I wonder whether it should prevent the mmap(), or let it though and just
SIGBUS on accesses.
>> >
>> > This is actually kind of complicated. Opening /dev/vfio/vfio gives us
>> > an instance of a container in the kernel. A group can only be attached
>> > to one container. So whoever calls us with passed fds needs to track
>> > this very carefully. This is also why I've dropped any kind of shared
>> > IOMMU option to give us a hint whether to try to cram everything in the
>> > same container (~= iommu domain). It's too easy to pass conflicting
>> > info to share a container for one device, but not another... yet they
>> > may be in the same group. I'll work on the fd passing though and try to
>> > come up with a reasonable model.
>>
>> I didn't really follow the container stuff so I can't comment here. But
>> suppose all assigned devices are done via fd passing, isn't it
>> sufficient to just pass the fd for the device (and keep the iommu group
>> fd in the managment tool)?
>
> Nope.
>
> containerfd = open(/dev/vfio/vfio)
> groupfd = open(/dev/vfio/$GROUPID)
> devicefd = ioctl(groupfd, VFIO_GROUP_GET_DEVICE_FD)
>
> The container provides access to the iommu, the group is the unit of
> ownership and privilege, and device cannot be accessed without iommu
> protection. Therefore to get to a devicefd, we first need to privilege
> the container by attaching a group to it, that let's us initialize the
> iommu, which allows us to get the device fd. At a minimum, we'd need
> both container and device fds, which means libvirt would be responsible
> for determining what type of iommu interface to initialize. Doing that
> makes adding another device tenuous. It's not impossible, but VFIO is
> design such that /dev/vfio/vfio is completely harmless on it's own, safe
> for mode 0666 access, just like /dev/kvm. The groupfd is the important
> access point, so maybe it's sufficient that libvirt could pass only that
> and let qemu open /dev/vfio/vfio on it's own. The only problem then is
> that libvirt needs to pass the same groupfd for each device that gets
> assigned within a group.
What I was thinking was that libvirt would do all the setup, including
attaching the iommu, then pass something that is safe to qemu. I don't
see an issue with libvirt keeping tracks of groups; libvirt is supposed
to be doing the host-side management anyway. But I'm not familiar with
the API so I guess it can't be done. Maybe an extension?
>
>> >> > +
>> >> > +
>> >> > +typedef struct MSIVector {
>> >> > + EventNotifier interrupt; /* eventfd triggered on interrupt */
>> >> > + struct VFIODevice *vdev; /* back pointer to device */
>> >> > + int vector; /* the vector number for this element */
>> >> > + int virq; /* KVM irqchip route for Qemu bypass */
>> >>
>> >> This calls for an abstraction (don't we have a cache where we look those
>> >> up?)
>> >
>> > I haven't see one, pointer? I tried to follow vhost's lead here.
>>
>> See kvm_irqchip_send_msi(). But this isn't integrated with irqfd yet.
>
> Right, the irqfd is what we're really after.
Ok, I guess both vhost and vfio could use a qemu_irq_eventfd() which
creates an irqfd if available, or emulates it by adding a listener to
that eventfd and injecting the interrupt (either through tcg or kvm) itself.
>
>> >> > + bool use;
>> >> > +} MSIVector;
>> >> > +
>> >> > +
>> >> > +typedef struct VFIOContainer {
>> >> > + int fd; /* /dev/vfio/vfio, empowered by the attached groups */
>> >> > + struct {
>> >> > + /* enable abstraction to support various iommu backends */
>> >> > + union {
>> >> > + MemoryListener listener; /* Used by type1 iommu */
>> >> > + };
>> >>
>> >> The usual was is to have a Type1VFIOContainer deriving from
>> >> VFIOContainer and adding a MemoryListener.
>> >
>> > Yep, that would work too. It gets a bit more complicated that way
>> > though because we have to know when the container is allocated what type
>> > it's going to be. This way we can step though possible iommu types and
>> > support the right one. Eventually there may be more than one type
>> > supported on the same platform (ex. one that enables PRI). Do-able, but
>> > I'm not sure it's worth it at this point.
>>
>> An alternative alternative is to put a pointer to an abstract type here,
>> then you can defer the decision on the concrete type later. But I agree
>> it's not worth it at this point. Maybe just drop the union and decide
>> later when a second iommu type is added.
>
> A pointer doesn't allow us to use container_of to get back to the
> VFIOContainer from the memory listener callback, so we'd have to create
> some new struct just to hold that back pointer. Alexey's proposed POWER
> support for VFIO already makes use of the union, so it seems like a
> sufficient solution for now. We'll have to re-evaluate if it's getting
> unwieldy after we get a few though.
Ok.
--
error compiling committee.c: too many arguments to function
Re: [Qemu-devel] [RFC PATCH] vfio: VFIO PCI driver for Qemu, Alex Williamson, 2012/07/26
Re: [Qemu-devel] [RFC PATCH] vfio: VFIO PCI driver for Qemu, Avi Kivity, 2012/07/26
Re: [Qemu-devel] [RFC PATCH] vfio: VFIO PCI driver for Qemu, Blue Swirl, 2012/07/27