[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [RFC PATCH] vfio: VFIO Driver core framework
From: |
Alex Williamson |
Subject: |
Re: [Qemu-devel] [RFC PATCH] vfio: VFIO Driver core framework |
Date: |
Wed, 09 Nov 2011 16:40:39 -0700 |
On Wed, 2011-11-09 at 15:08 -0600, Christian Benvenuti (benve) wrote:
<snip>
> > > > +
> > > > +struct vfio_group {
> > > > + dev_t devt;
> > > > + unsigned int groupid;
> > >
> > > This groupid is returned by the device_group callback you recently
> > added
> > > with a separate (not yet in tree) IOMMU patch.
> > > Is it correct to say that the scope of this ID is the bus the iommu
> > > belongs too (but you use it as if it was global)?
> > > I believe there is nothing right now to ensure the uniqueness of such
> > > ID across bus types (assuming there will be other bus drivers in the
> > > future besides vfio-pci).
> > > If that's the case, the vfio.group_list global list and the
> > __vfio_lookup_dev
> > > routine should be changed to account for the bus too?
> > > Ops, I just saw the error msg in vfio_group_add_dev about the group
> > id conflict.
> > > Is that warning related to what I mentioned above?
> >
> > Yeah, this is a concern, but I can't think of a system where we would
> > manifest a collision. The IOMMU driver is expected to provide unique
> > groupids for all devices below them, but we could imagine a system that
> > implements two different bus_types, each with a different IOMMU driver
> > and we have no coordination between them. Perhaps since we have
> > iommu_ops per bus, we should also expose the bus in the vfio group
> > path,
> > ie. /dev/vfio/%s/%u, dev->bus->name, iommu_device_group(dev,..). This
> > means userspace would need to do a readlink of the subsystem entry
> > where
> > it finds the iommu_group to find the vfio group. Reasonable?
>
> Most probably we won't see use cases with multiple buses anytime soon, but
> this scheme you proposed (with the per-bus subdir) looks good to me.
Ok, I think that's easier than any scheme of trying to organize globally
unique groupids instead of just bus_type unique. That makes group
objects internally matched by the {groupid, bus} pair.
<snip>
> > >
> > > I looked at how you take care of ref counts ...
> > >
> > > This is how the tree of vfio_iommu/vfio_group/vfio_device data
> > > Structures is organized (I'll use just iommu/group/dev to make
> > > the graph smaller):
> > >
> > > iommu
> > > / \
> > > / \
> > > group ... group
> > > / \ / \
> > > / \ / \
> > > dev .. dev dev .. dev
> > >
> > > This is how you get a file descriptor for the three kind of objects:
> > >
> > > - group : open /dev/vfio/xxx for group xxx
> > > - iommu : group ioctl VFIO_GROUP_GET_IOMMU_FD
> > > - device: group ioctl VFIO_GROUP_GET_DEVICE_FD
> > >
> > > Given the above topology, I would assume that:
> > >
> > > (1) an iommu is 'inuse' if : a) iommu refcnt > 0, or
> > > b) any of its groups is 'inuse'
> > >
> > > (2) a group is 'inuse' if : a) group refcnt > 0, or
> > > b) any of its devices is 'inuse'
> > >
> > > (3) a device is 'inuse' if : a) device refcnt > 0
> >
> > (2) is a bit debatable. I've wrestled with this one for a while. The
> > vfio_iommu serves two purposes. First, it is the object we use for
> > managing iommu domains, which includes allocating domains and attaching
> > devices to domains. Groups objects aren't involved here, they just
> > manage the set of devices. The second role is to manage merged groups,
> > because whether or not groups can be merged is a function of iommu
> > domain compatibility.
> >
> > So if we look at "is the iommu in use?" ie. can I destroy the mapping
> > context, detach devices and free the domain, the reference count on the
> > group is irrelevant. The user has to have a device or iommu file
> > descriptor opened somewhere, across the group or merged group, for that
> > context to be maintained. A reasonable requirement, I think.
>
> OK, then if you close all devices and the iommu, keeping the group open
> Would not protect the iommu domain mapping. This means that if you (or
> A management application) need to close all devices+iommu and reopen
> right away again the same devices+iommu you may get a failure on the
> iommu domain creation (supposing the system goes out of resources).
> Is this just a very unlikely scenario?
Can you think of a use case that would require such? I can't.
> I guess in this case you would simply have to avoid releasing the iommu
> fd, right?
Right. We could also debate whether we should drop all iommu mappings
when the iommu refcnt goes to zero. We don't currently do that, but it
might make sense.
>
> > However, if we ask "is the group in use?" ie. can I not only destroy
> > the
> > mappings above, but also automatically tear apart merged groups, then I
> > think we need to look at the group refcnt.
>
> Correct.
>
> > There's also a symmetry factor, the group is a benign entry point to
> > device access. It's only when device or iommu access is granted that
> > the group gains any real power. Therefore, shouldn't that power also
> > be
> > removed when those access points are closed?
> >
> > > You have coded the 'inuse' logic with these three routines:
> > >
> > > __vfio_iommu_inuse, which implements (1) above
> > >
> > > and
> > > __vfio_iommu_groups_inuse
> >
> > Implements (2.a)
>
> Yes, but for al groups at once.
Right
> > > __vfio_group_devs_inuse
> >
> > Implements (2.b)
>
> Yes
>
> > > which are used by __vfio_iommu_inuse.
> > > Why don't you check the group refcnt in __vfio_iommu_groups_inuse?
> >
> > Hopefully explained above, but open for discussion.
> >
> > > Would it make sense (and the code more readable) to structure the
> > > nested refcnt/inuse check like this?
> > > (The numbers (1)(2)(3) refer to the three 'inuse' conditions above)
> > >
> > > (1)__vfio_iommu_inuse
> > > |
> > > +-> check iommu refcnt
> > > +-> __vfio_iommu_groups_inuse
> > > |
> > > +->LOOP: (2)__vfio_iommu_group_inuse<--MISSING
> > > |
> > > +-> check group refcnt<--MISSING
> > > +-> __vfio_group_devs_inuse()
> > > |
> > > +-> LOOP: (3)__vfio_group_dev_inuse<--MISSING
> > > |
> > > +-> check device refcnt
> >
> > We currently do:
> >
> > (1)__vfio_iommu_inuse
> > |
> > +-> check iommu refcnt
> > +-> __vfio_group_devs_inuse
> > |
> > +->LOOP: (2.b)__vfio_group_devs_inuse
> > |
> > +-> LOOP: (3) check device refcnt
> >
> > If that passes, the iommu context can be dissolved and we follow up
> > with:
> >
> > __vfio_iommu_groups_inuse
> > |
> > +-> LOOP: (2.a)__vfio_iommu_groups_inuse
> > |
> > +-> check group refcnt
> >
> > If that passes, groups can also be umerged.
> >
> > Is this right?
>
> Yes, assuming we stick to the "benign" role of groups you
> described above.
Ok, no change then. Thanks for looking at that so closely.
<snip>
> > > > +static int vfio_group_merge(struct vfio_group *group, int fd)
> > >
> > > The documentation in vfio.txt explains clearly the logic implemented
> > by
> > > the merge/unmerge group ioctls.
> > > However, what you are doing is not merging groups, but rather
> > adding/removing
> > > groups to/from iommus (and creating flat lists of groups).
> > > For example, when you do
> > >
> > > merge(A,B)
> > >
> > > you actually mean to say "merge B to the list of groups assigned to
> > the
> > > same iommu as group A".
> >
> > It's actually a little more than that. After you've merged B into A,
> > you can close the file descriptor for B and access all of the devices
> > for the merged group from A.
>
> It is actually more...
>
> Scenario 1:
>
> create_grp(A)
> create_grp(B)
> ...
> merge_grp(A,B)
> create_grp(C)
> merge_grp(C,B) ... this works, right?
No, but merge_grp(B,C) does. I currently require that the incoming
group has no open device or iommu file descriptors and is a singular
group. The device/iommu is a hard requirement since we'll be changing
the iommu context and can't leave an attack window. The singular group
is an implementation detail. Given the iommu/device requirement, it's
just as easy for userspace to tear apart the group and pass each
individually.
> Scenario 2:
>
> create_grp(A)
> create_grp(B)
> fd_x = get_dev_fd(B,x)
> ...
> merge_grp(A,B)
NAK, fails no open device test. Again, merge_grp(B,A) is supported.
> create_grp(C)
> merge_grp(A,C)
Yep, this works.
> fd_x = get_dev_fd(C,x)
Yep, and if x is they same in both cases, you'll get 2 different file
descriptors backed by the same device.
> Those two examples seems to suggest me more of a list-abstraction than a
> merge abstraction.
> However, if it fits into the agreed syntax/logic it is ok, as long as we
> document it
> properly.
Can you suggest documentation changes that would make this more clear?
> > > For the same reason, you do not really need to provide the group you
> > want
> > > to unmerge from, which means that instead of
> > >
> > > unmerge(A,B)
> > >
> > > you would just need
> > >
> > > unmerge(B)
> >
> > Good point, we can avoid the awkward reference via file descriptor for
> > the unmerge.
> >
> > > I understand the reason why it is not a real merge/unmerge (ie, to
> > keep the
> > > original groups so that you can unmerge later)
> >
> > Right, we still need to have visibility of the groups comprising the
> > merged group, but the abstraction provided to the user seems to be
> > deeper than you're thinking.
> >
> > > ... however I just wonder if
> > > it wouldn't be more natural to implement the
> > VFIO_IOMMU_ADD_GROUP/DEL_GROUP
> > > iommu ioctls instead? (the relationships between the data structure
> > would
> > > remain the same)
> > > I guess you already discarded this option for some reasons, right?
> > What was
> > > the reason?
> >
> > It's a possibility, I'm not sure it was discussed or really what
> > advantage it provides. It seems like we'd logically lose the ability
> > to
> > access devices from other groups,
>
> What is the real (immediate) benefit of this capability?
Mostly convenience, but also promotes the peer idea where merged groups
simply create a "super" group that can access the iommu and all the
devices of the member groups. On x86 we expect that merging groups will
always succeed and groups will typically have a single device, so a
driver could merge them all together, throw away all the extra group
file descriptors and manage the whole super group via a single group fd.
> > whether that's good or bad, I don't know. I think the notion of "merge"
> > promotes the idea that the groups
> > are peers and an iommu_add/del feels a bit more hierarchical.
>
> I agree.
<snip>
> > > > + if (!device) {
> > > > + if (__vfio_group_devs_inuse(group) ||
> > > > + (group->iommu && group->iommu->refcnt)) {
> > > > + printk(KERN_WARNING
> > > > + "Adding device %s to group %u while
> > > > group is
> > > > already in use!!\n",
> > > > + dev_name(dev), group->groupid);
> > > > + /* XXX How to prevent other drivers from
> > > > claiming? */
> > >
> > > Here we are adding a device (not yet assigned to a vfio bus) to a
> > group
> > > that is already in use.
> > > Given that it would not be acceptable for this device to get assigned
> > > to a non vfio driver, why not forcing such assignment here then?
> >
> > Exactly, I just don't know the mechanics of how to make that happen and
> > was hoping for suggestions...
> >
> > > I am not sure though what the best way to do it would be.
> > > What about something like this:
> > >
> > > - when the bus vfio-pci processes the BUS_NOTIFY_ADD_DEVICE
> > > notification it assigns to the device a PCI ID that will make sure
> > > the vfio-pci's probe routine will be invoked (and no other driver
> > can
> > > therefore claim the device). That PCI ID would have to be added
> > > to the vfio_pci_driver's id_table (it would be the exception to the
> > > "only dynamic IDs" rule). Too hackish?
> >
> > Presumably some other driver also has the ID in it's id_table, how do
> > we make sure we win?
>
> By mangling such ID (when processing the BUS_NOTIFY_ADD_DEVICE notification)
> to
> match against a 'fake' ID registered in the vfio-pci table (it would be like a
> sort of driver redirect/divert). The vfio-pci's probe routine would restore
> the original ID (we do not want to confuse userspace). This is hackish, I
> agree.
>
> What about this:
> - When vfio-pci processes the BUS_NOTIFY_ADD_DEVICE notification it can
> pre-initialize the driver pointer (via an API). We would then need to change
> the match/probe PCI mechanism too: for example, the PCI core will have to
> check
> and honor such pre-driver-initialization when present (and give it higher
> priority over the match callbacks).
> How to do this? For example, when vfio_group_add_dev is invoked, it checks
> whether the device is getting added to an already existent group where
> the other devices (well, you would need to check just one of the devices in
> the group) are already assigned to vfio-pci, and in such a case it
> pre-initialize the driver to vfio-pci.
It's ok to make a group "non-viable", we only want to intervene if the
iommu is inuse (iommu or device refcnt > 0).
>
> NOTE: By "preinit" I mean "save into the device a reference to a driver before
> the 'match' callbacks".
>
> This would be the timeline:
>
> |
> +-> new device gets added to (PCI) bus
> |
> +-> PCI: send BUS_NOTIFIER_ADD_DEVICE notification
> |
> +-> VFIO:vfio_pci_device_notifier
> | |
> | +-> BUS_NOTIFIER_ADD_DEVICE: vfio_group_add_dev
> | |
> | +->iommu_device_group(dev,&groupid)
> | +->group = <search groupid in vfio.group_list>
> | +->if (group && group_is_vfio(group))
> | | <preinit device driver to vfio-pci>
> | ...
> |
> +-> PCI: xxx
> | |
> | +-> if (!device_driver_is_preinit(dev))
> | | probe=<search driver's probe callback using 'match'>
> | | else
> | | probe=<get it from preint driver config>
> | | (+fallback to 'match' if preinit driver disappeared?)
> | |
> | +-> rc = probe(...)
> | |
> | ...
> v
> ...
>
> Of course, what if multiple drivers decide to preinit the device ?
Yep, we'd have to have a policy to BUG_ON if the preinit driver is
already set.
> One way to make it cleaner would be to:
> - have the PCI layer export an API that allows (for example) the bus
> notification callbacks (like vfio_pci_device_notifier) to preinit a driver
> - make such API reject calls on devices that already have a preinit
> driver.
> - make VFIO detect the case where vfio_pci_device_notifier can not
> preinit the driver (to vfio-pci) for the new device (because already
> preinited) and raise an error/warning.
>
> Would this look a bit cleaner?
It looks like there might already be infrastructure that we can set
dev->driver and call the driver probe() function, so maybe we're only in
trouble if dev->driver is already set when we get the bus add
notification. I just wasn't sure if that was entirely kosher. I'll
have to try that and figure out how to test it; fake hotplug maybe.
<snip>
> > > This fn below does not return any error code. Ok ...
> > > However, there are a number of errors case that you test, for example
> > > - device that does not belong to any group (according to iommu API)
> > > - device that belongs to a group but that does not appear in the list
> > > of devices of the vfio_group structure.
> > > Are the above two errors checks just paranoia or are those errors
> > actually possible?
> > > If they were possible, shouldn't we generate a warning (most probably
> > > it would be a bug in the code)?
> >
> > They're all vfio-bus driver bugs of some sort, so it's just a matter of
> > how much we want to scream about them. I'll comments on each below.
> >
> > > > +void vfio_group_del_dev(struct device *dev)
> > > > +{
> > > > + struct list_head *pos;
> > > > + struct vfio_group *group = NULL;
> > > > + struct vfio_device *device = NULL;
> > > > + unsigned int groupid;
> > > > +
> > > > + if (iommu_device_group(dev, &groupid))
> > > > + return;
> >
> > Here the bus driver is probably just sitting on a notifier list for
> > their bus_type and a device is getting removed. Unless we want to
> > require the bus driver to track everything it's attempted to add and
> > whether it worked, we can just ignore this.
>
> OK, I see what you mean. If vfio_group_add_dev fails for some reasons we
> do not keep track of it. Right?
The primary thing I'm thinking of here is not vfio_group_add_dev()
failing for "some reason", but specifically failing because the device
doesn't have a groupid, ie. it's not behind an iommu. In that case it's
just a random device that can't be used by vfio.
> Would it make sense to add one special group to vfio.group_list (or better
> On a separate field of the vfio structure) whose goal
> would be just that: keep track of those devices that failed to be added
> to the VFIO framework (can it help for debugging too?)?
For the above case, no, we shouldn't need to track those. But it does
seem like there's a gap for devices that fail vfio_group_add_dev() for
other reasons. I don't think we want a special group for them, because
that isolates them from other devices that are potentially in the same
group. I think instead what we want to do is set a taint flag on the
group. We can do a BUG_ON not being able to allocate a group, then a
WARN_ON if we fail elsewhere and mark the group tainted so it's
effectively never viable.
<snip>
> > > > + if (!vfio_device_removeable(device)) {
> > > > + /* XXX signal for all devices in group to be removed or
> > > > + * resort to killing the process holding the device fds.
> > > > + * For now just block waiting for releases to wake us.
> > > > */
> > > > + wait_event(vfio.release_q,
> > > > vfio_device_removeable(device));
> > >
> > > Any new idea/proposal on how to handle this situation?
> > > The last one I remember was to leave the soft/hard/etc timeout
> > handling in
> > > userspace and implement it as a sort of policy. Is that one still the
> > most
> > > likely candidate solution to handle this situation?
> >
> > I haven't heard any new proposals. I think we need the hard timeout
> > handling in the kernel. We can't leave it to userspace to decide they
> > get to keep the device. We could have this tunable via an ioctl, but I
> > don't see how we wouldn't require CAP_SYS_ADMIN (or similar) to tweak
> > it. I was intending to re-implement the netlink interface to signal
> > the
> > removal, but expect to get allergic reactions to that.
>
> (I personally like the async netlink signaling, but I am OK with an ioctl
> based
> mechanism if it provides the same flexibility)
>
> What would be a reasonable hard timeout?
I think we were looking at 10s of seconds in the old vfio code. Tough
call though. Could potentially provide a module_param override so an
admin that trusts their users could set long/infinite timeout. Thanks,
Alex
Re: [Qemu-devel] [RFC PATCH] vfio: VFIO Driver core framework, Konrad Rzeszutek Wilk, 2011/11/11