Re: [Qemu-devel] [PATCH 3/3] VFIO Type1 IOMMU: Add support for mediated

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH 3/3] VFIO Type1 IOMMU: Add support for mediated

From:	Alex Williamson
Subject:	Re: [Qemu-devel] [PATCH 3/3] VFIO Type1 IOMMU: Add support for mediated devices
Date:	Tue, 28 Jun 2016 20:46:00 -0600

On Tue, 28 Jun 2016 18:32:44 +0530
Kirti Wankhede <address@hidden> wrote:

> On 6/22/2016 9:16 AM, Alex Williamson wrote:
> > On Mon, 20 Jun 2016 22:01:48 +0530
> > Kirti Wankhede <address@hidden> wrote:
> >   
> >>  
> >>  struct vfio_iommu {
> >>    struct list_head        domain_list;
> >> +  struct vfio_domain      *mediated_domain;  
> > 
> > I'm not really a fan of how this is so often used to special case the
> > code...
> >   
> >>    struct mutex            lock;
> >>    struct rb_root          dma_list;
> >>    bool                    v2;
> >> @@ -67,6 +69,13 @@ struct vfio_domain {
> >>    struct list_head        group_list;
> >>    int                     prot;           /* IOMMU_CACHE */
> >>    bool                    fgsp;           /* Fine-grained super pages */
> >> +
> >> +  /* Domain for mediated device which is without physical IOMMU */
> >> +  bool                    mediated_device;  
> > 
> > But sometimes we use this to special case the code and other times we
> > use domain_list being empty.  I thought the argument against pulling
> > code out to a shared file was that this approach could be made
> > maintainable.
> >   
> 
> Functions where struct vfio_domain *domain is argument which are
> intended to perform for that domain only, checked if
> (domain->mediated_device), like map_try_harder(), vfio_iommu_replay(),
> vfio_test_domain_fgsp(). Checks in these functions can be removed but
> then it would be callers responsibility to make sure that they don't
> call these functions for mediated_domain.
> Whereas functions where struct vfio_iommu *iommu is argument and
> domain_list is traversed to find domain or perform for each domain in
> domain_list, checked if (list_empty(&iommu->domain_list)), like
> vfio_unmap_unpin(), vfio_iommu_map(), vfio_dma_do_map().

My point is that we have different test elements at different points in
the data structures and they all need to be kept in sync and the right
one used at the right place, which makes the code all that much more
complex versus the alternative approach of finding commonality,
extracting it into a shared file, and creating a mediated version of
the type1 iommu that doesn't try to overload dual functionality into a
single code block. 

> >> +
> >> +  struct mm_struct        *mm;
> >> +  struct rb_root          pfn_list;       /* pinned Host pfn list */
> >> +  struct mutex            pfn_list_lock;  /* mutex for pfn_list */  
> > 
> > Seems like we could reduce overhead for the existing use cases by just
> > adding a pointer here and making these last 3 entries part of the
> > structure that gets pointed to.  Existence of the pointer would replace
> > @mediated_device.
> >  
> 
> Ok.
> 
> >>  };
> >>  
> >>  struct vfio_dma {
> >> @@ -79,10 +88,26 @@ struct vfio_dma {
> >>  
> >>  struct vfio_group {
> >>    struct iommu_group      *iommu_group;
> >> +#if defined(CONFIG_MDEV) || defined(CONFIG_MDEV_MODULE)  
> > 
> > Where does CONFIG_MDEV_MODULE come from?
> > 
> > Plus, all the #ifdefs... <cringe>
> >   
> 
> Config option MDEV is tristate and when selected as module
> CONFIG_MDEV_MODULE is set in include/generated/autoconf.h.
> Symbols mdev_bus_type, mdev_get_device_by_group() and mdev_put_device()
> are only available when MDEV option is selected as built-in or modular.
> If MDEV option is not selected, vfio_iommu_type1 modules should still
> work for device direct assignment. If these #ifdefs are not there
> vfio_iommu_type1 module fails to load with undefined symbols when MDEV
> is not selected.

I guess I just hadn't seen the _MODULE define used before, but it does
appear to be fairly common.  Another option might be to provide stubs
or static inline abstractions in a header file so the #ifdefs can be
isolated.  It also seems like this is going to mean that type1 now
depends on and will autoload the mdev module even for physical
assignment.  That's not terribly desirable.

> >> +  struct mdev_device      *mdev;  
> > 
> > This gets set on attach_group where we use the iommu_group to lookup
> > the mdev, so why can't we do that on the other paths that make use of
> > this?  I think this is just holding a reference.
> >   
> 
> mdev is retrieved from attach_group for 2 reasons:
> 1. to increase the ref count of mdev, mdev_get_device_by_group(), when
> its iommu_group is attached. That should be decremented, by
> mdev_put_device(), from detach while detaching its iommu_group. This is
> make sure that mdev is not freed until it's iommu_group is detached from
> the container.
> 
> 2. save reference to iommu_data so that vendor driver would use to call
> vfio_pin_pages() and vfio_unpin_pages(). More details below.
> 
> 
> 
> >> -static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long 
> >> *pfn)
> >> +static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
> >> +                   int prot, unsigned long *pfn)
> >>  {
> >>    struct page *page[1];
> >>    struct vm_area_struct *vma;
> >> +  struct mm_struct *local_mm = mm;
> >>    int ret = -EFAULT;
> >>  
> >> -  if (get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE), page) == 1) {
> >> +  if (!local_mm && !current->mm)
> >> +          return -ENODEV;
> >> +
> >> +  if (!local_mm)
> >> +          local_mm = current->mm;  
> > 
> > The above would be much more concise if we just initialized local_mm
> > as: mm ? mm : current->mm
> >   
> >> +
> >> +  down_read(&local_mm->mmap_sem);
> >> +  if (get_user_pages_remote(NULL, local_mm, vaddr, 1,
> >> +                          !!(prot & IOMMU_WRITE), 0, page, NULL) == 1) {  
> > 
> > Um, the comment for get_user_pages_remote says:
> > 
> > "See also get_user_pages_fast, for performance critical applications."
> > 
> > So what penalty are we imposing on the existing behavior of type1
> > here?  Previously we only needed to acquire mmap_sem if
> > get_user_pages_fast() didn't work, so the existing use case seems to be
> > compromised.
> >  
> 
> Yes.
> get_user_pages_fast() pins pages from current->mm, but for mediated
> device mm could be different than current->mm.
> 
> This penalty for existing behavior could be avoided by:
> if (!mm && current->mm)
>     get_user_pages_fast(); //take fast path
> else
>     get_user_pages_remote(); // take slow path


How to avoid it is pretty obvious, the concern is that overhead of the
existing use case isn't being prioritized.

> >> +long vfio_unpin_pages(void *iommu_data, dma_addr_t *pfn, long npage,
> >> +               int prot)
> >> +{
> >> +  struct vfio_iommu *iommu = iommu_data;
> >> +  struct vfio_domain *domain = NULL;
> >> +  long unlocked = 0;
> >> +  int i;
> >> +
> >> +  if (!iommu || !pfn)
> >> +          return -EINVAL;
> >> +
> >> +  if (!iommu->mediated_domain)
> >> +          return -EINVAL;
> >> +
> >> +  domain = iommu->mediated_domain;  
> > 
> > Again, domain is already validated here.
> >   
> >> +
> >> +  for (i = 0; i < npage; i++) {
> >> +          struct vfio_pfn *p;
> >> +
> >> +          /* verify if pfn exist in pfn_list */
> >> +          p = vfio_find_pfn(domain, *(pfn + i));  
> > 
> > Why are we using array indexing above and array math here?  Were these
> > functions written by different people?
> >   
> 
> No, input argument to vfio_unpin_pages() was always array of pfns to be
> unpinned.

My comment is in regard to how the code added for vfio_pin_pages() uses
array indexing (ie, pfn[i]) while here we use array math (ie, *(pfn +
i)).  Don't arbitrarily mix them, be consistent.

> >> +          if (!p)
> >> +                  continue;  
> > 
> > Hmm, this seems like more of a bad thing than a continue.
> >   
> 
> Caller of vfio_unpin_pages() are other modules. I feel its better to do
> sanity check than crash.

Who said anything about crashing?  We have a return value for a reason,
right?

> >>  
> >>  static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma 
> >> *dma)
> >>  {
> >> @@ -341,6 +580,10 @@ static void vfio_unmap_unpin(struct vfio_iommu 
> >> *iommu, struct vfio_dma *dma)
> >>  
> >>    if (!dma->size)
> >>            return;
> >> +
> >> +  if (list_empty(&iommu->domain_list))
> >> +          return;  
> > 
> > Huh?  This would be a serious consistency error if this happened for
> > the existing use case.
> >  
> 
> This will not happen for existing use case, i.e. device direct
> assignment. This case is true when there is only mediated device
> assigned and there are no direct assigned devices.

Which is sort of my point, you're using an arbitrary property to
identify a mediated device vfio_iommu vs a directed assigned one.  This
fits in with my complaint of how many different special cases are being
thrown into the code which increases the overall code complexity.

> >> @@ -569,6 +819,7 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
> >>    uint64_t mask;
> >>    struct vfio_dma *dma;
> >>    unsigned long pfn;
> >> +  struct vfio_domain *domain = NULL;
> >>  
> >>    /* Verify that none of our __u64 fields overflow */
> >>    if (map->size != size || map->vaddr != vaddr || map->iova != iova)
> >> @@ -611,10 +862,21 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
> >>    /* Insert zero-sized and grow as we map chunks of it */
> >>    vfio_link_dma(iommu, dma);
> >>  
> >> +  /*
> >> +   * Skip pin and map if and domain list is empty
> >> +   */
> >> +  if (list_empty(&iommu->domain_list)) {
> >> +          dma->size = size;
> >> +          goto map_done;
> >> +  }  
> > 
> > Again, this would be a serious consistency error for the existing use
> > case.  Let's use indicators that are explicit.
> >  
> 
> Why? for existing use case (i.e. direct device assignment) domain_list
> will not be empty, domain_list will only be empty when there is mediated
> device assigned and no direct device assigned.

I'm trying to be cautious whether it actually makes sense to
dual-purpose the existing type1 iommu backend.  What's the benefit of
putting an exit path in the middle of a function versus splitting it in
two separate functions with two separate callers, one of which only
calls the first function.  What's the benefit of a struct vfio_iommu
that hosts both mediated and directly assigned devices?  Is the
benefit to the functionality or to the code base?  Should the fact that
they use the same iommu API dictate that they're also managed in the
same data structures?  When we have too many special case exits or
branches, the code complexity increases, bugs are harder to flush out,
and possible exploits are more likely.  Convince me that this is the
right approach.
 
> >>  static int vfio_iommu_type1_attach_group(void *iommu_data,
> >>                                     struct iommu_group *iommu_group)
> >>  {
> >>    struct vfio_iommu *iommu = iommu_data;
> >> -  struct vfio_group *group, *g;
> >> +  struct vfio_group *group;
> >>    struct vfio_domain *domain, *d;
> >>    struct bus_type *bus = NULL;
> >>    int ret;
> >> @@ -746,14 +1030,21 @@ static int vfio_iommu_type1_attach_group(void 
> >> *iommu_data,
> >>    mutex_lock(&iommu->lock);
> >>  
> >>    list_for_each_entry(d, &iommu->domain_list, next) {
> >> -          list_for_each_entry(g, &d->group_list, next) {
> >> -                  if (g->iommu_group != iommu_group)
> >> -                          continue;
> >> +          if (is_iommu_group_present(d, iommu_group)) {
> >> +                  mutex_unlock(&iommu->lock);
> >> +                  return -EINVAL;
> >> +          }
> >> +  }
> >>  
> >> +#if defined(CONFIG_MDEV) || defined(CONFIG_MDEV_MODULE)
> >> +  if (iommu->mediated_domain) {
> >> +          if (is_iommu_group_present(iommu->mediated_domain,
> >> +                                     iommu_group)) {
> >>                    mutex_unlock(&iommu->lock);
> >>                    return -EINVAL;
> >>            }
> >>    }
> >> +#endif
> >>  
> >>    group = kzalloc(sizeof(*group), GFP_KERNEL);
> >>    domain = kzalloc(sizeof(*domain), GFP_KERNEL);
> >> @@ -769,6 +1060,36 @@ static int vfio_iommu_type1_attach_group(void 
> >> *iommu_data,
> >>    if (ret)
> >>            goto out_free;
> >>  
> >> +#if defined(CONFIG_MDEV) || defined(CONFIG_MDEV_MODULE)
> >> +  if (!iommu_present(bus) && (bus == &mdev_bus_type)) {
> >> +          struct mdev_device *mdev = NULL;  
> > 
> > Unnecessary initialization.
> >   
> >> +
> >> +          mdev = mdev_get_device_by_group(iommu_group);
> >> +          if (!mdev)
> >> +                  goto out_free;
> >> +
> >> +          mdev->iommu_data = iommu;  
> > 
> > This looks rather sketchy to me, we don't have a mediated driver in
> > this series, but presumably the driver blindly calls vfio_pin_pages
> > passing mdev->iommu_data and hoping that it's either NULL to generate
> > an error or relevant to this iommu backend.  How would we add a second
> > mediated driver iommu backend?  We're currently assuming the user
> > configured this backend.    
> 
> If I understand correctly, your question is if two different mediated
> devices are assigned to same container. In such case, the two mediated
> devices will have different iommu_groups and will be added to
> mediated_domain's group_list (iommu->mediated_domain->group_list).

No, my concern is that mdev->iommu_data is opaque data specific to the
type1 extensions here, but vfio_pin_pages() is effectively a completely
separate API.  Mediated devices end up with sort of a side channel
into this one iommu, which breaks the modularity of vfio iommus.  So
let's say we create a type2 interface that also supports mediated
devices, do the mediated drivers still call vfio_pin_pages()?  Do we
need to export new interfaces for every iommu backend to support this?
The interface should probably be through the vfio container, IOW
extensions to the vfio_iommu_driver_ops or maybe direct use of the
ioctl callback within that interface such that the pinning is actually
paired with the correct driver and extensible.
 
> > Should vfio_pin_pages instead have a struct
> > device* parameter from which we would lookup the iommu_group and get to
> > the vfio_domain?  That's a bit heavy weight, but we need something
> > along those lines.
> >   
> 
> There could be multiple mdev devices from same mediated vendor driver in
> one container. In that case, that vendor driver need reference of
> container or container->iommu_data to pin and unpin pages.
> Similarly, there could be mutiple mdev devices from different mediated
> vendor driver in one container, in that case both vendor driver need
> reference to container or container->iommu_data to pin and unpin pages
> in their driver.

As above, I think something like this is necessary, the proposed
interface here is a bit of a side-channel hack.

> >> +          mdev->iommu_data = iommu;  
> With the above line, a reference to container->iommu_data is kept in
> mdev structure when the iommu_group is attached to a container so that
> vendor drivers can find reference to pin and unpin pages.
> 
> If struct device* is passed as an argument to vfio_pin_pages, to find
> reference to container of struct device *dev, have to find
> vfio_device/vfio_group from dev that means traverse vfio.group_list for
> each pin and unpin call. This list would be long when there are many
> mdev devices in the system.
> 
> Is there any better way to find reference to container from struct device*?

struct vfio_device *device = dev_get_drvdata(mdev->dev);

device->group->container

Note of course that vfio_device and vfio_group are private to vfio.c.
We do already have a vfio_device_get_from_dev() function though, so the
vendor could call into a vfio.c function with a reference to the
vfio_device.

> >>  
> >> @@ -930,8 +1289,28 @@ static void vfio_iommu_type1_release(void 
> >> *iommu_data)
> >>    struct vfio_domain *domain, *domain_tmp;
> >>    struct vfio_group *group, *group_tmp;
> >>  
> >> +#if defined(CONFIG_MDEV) || defined(CONFIG_MDEV_MODULE)
> >> +  if (iommu->mediated_domain) {
> >> +          domain = iommu->mediated_domain;
> >> +          list_for_each_entry_safe(group, group_tmp,
> >> +                                   &domain->group_list, next) {
> >> +                  if (group->mdev) {
> >> +                          group->mdev->iommu_data = NULL;
> >> +                          mdev_put_device(group->mdev);
> >> +                  }
> >> +                  list_del(&group->next);
> >> +                  kfree(group);
> >> +          }
> >> +          vfio_iommu_unpin_api_domain(domain);
> >> +          kfree(domain);
> >> +          iommu->mediated_domain = NULL;
> >> +  }
> >> +#endif  
> > 
> > I'm not really seeing how this is all that much more maintainable than
> > what was proposed previously, has this aspect been worked on since last
> > I reviewed this patch?
> >   
> 
> There aren't many changes from v4 to v5 version of this patch.
> Can you more specific on you concerns about maintainability? I'll
> definitely address your concerns.

In reply to comments on v3 of the series you said you'd prefer to work
on making the bimodal approach more maintainable rather than splitting
out common code and creating a separate module for type1-direct vs
type1-mediated.  To make the code more maintainable, I think we need to
see fewer special cases, clean data paths for each time with some
attention paid to how we detect one type vs another.  The killer
feature that would that would really justify the complexity of this
approach would be if locked page accounting avoid duplicate counts
between interfaces.  As it is, vfio_pin_pages() calls
vfio_pin_pages_internal() which adds to the user's locked_vm regardless
of whether those pages are already locked by a direct assigned device,
or even a previous call to vfio_pin_pages() for the same range, perhaps
by a different mediated device.  So we end up with userspace needing to
set the locked memory limit once for any number of direct assigned
devices, but then bump it up for each mediated device, by a value which
may depend on the type of mediated device.  That's clearly a rat's nest
for userspace to guess what it needs to do and pretty strongly
justifies such tight integration.  Thanks,

Alex

[Prev in Thread]

Current Thread

[Next in Thread]

[Qemu-devel] [PATCH 1/3] Mediated device Core driver, (continued)
- [Qemu-devel] [PATCH 1/3] Mediated device Core driver, Kirti Wankhede, 2016/06/20
  - Re: [Qemu-devel] [PATCH 1/3] Mediated device Core driver, Jike Song, 2016/06/21
  - Re: [Qemu-devel] [PATCH 1/3] Mediated device Core driver, Alex Williamson, 2016/06/21
    - Re: [Qemu-devel] [PATCH 1/3] Mediated device Core driver, Kirti Wankhede, 2016/06/24
    - Re: [Qemu-devel] [PATCH 1/3] Mediated device Core driver, Alex Williamson, 2016/06/24
  - Re: [Qemu-devel] [PATCH 1/3] Mediated device Core driver, Xiao Guangrong, 2016/06/29
    - Re: [Qemu-devel] [PATCH 1/3] Mediated device Core driver, Jike Song, 2016/06/30
- [Qemu-devel] [PATCH 3/3] VFIO Type1 IOMMU: Add support for mediated devices, Kirti Wankhede, 2016/06/20
  - Re: [Qemu-devel] [PATCH 3/3] VFIO Type1 IOMMU: Add support for mediated devices, Alex Williamson, 2016/06/21
    - Re: [Qemu-devel] [PATCH 3/3] VFIO Type1 IOMMU: Add support for mediated devices, Kirti Wankhede, 2016/06/28
    - Re: [Qemu-devel] [PATCH 3/3] VFIO Type1 IOMMU: Add support for mediated devices, Alex Williamson <=
    - Re: [Qemu-devel] [PATCH 3/3] VFIO Type1 IOMMU: Add support for mediated devices, Tian, Kevin, 2016/06/30

Prev by Date: Re: [Qemu-devel] [PATCH 1/2] ppc: Add proper real mode translation support
Next by Date: Re: [Qemu-devel] [PATCH 0/3] drive-mirror: limit niov to MAX_IOV
Previous by thread: Re: [Qemu-devel] [PATCH 3/3] VFIO Type1 IOMMU: Add support for mediated devices
Next by thread: Re: [Qemu-devel] [PATCH 3/3] VFIO Type1 IOMMU: Add support for mediated devices
Index(es):
- Date
- Thread