qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v8 3/6] vfio iommu: Add support for mediated dev


From: Alex Williamson
Subject: Re: [Qemu-devel] [PATCH v8 3/6] vfio iommu: Add support for mediated devices
Date: Thu, 13 Oct 2016 11:12:01 -0600

On Thu, 13 Oct 2016 20:04:43 +0530
Kirti Wankhede <address@hidden> wrote:

> On 10/12/2016 3:36 AM, Alex Williamson wrote:
> > On Tue, 11 Oct 2016 01:58:34 +0530
> > Kirti Wankhede <address@hidden> wrote:
> >  
> ...
> 
> 
> >> +static struct vfio_group *vfio_group_from_dev(struct device *dev)
> >> +{
> >> +  struct vfio_device *device;
> >> +  struct vfio_group *group;
> >> +  int ret;
> >> +
> >> +  device = vfio_device_get_from_dev(dev);  
> >
> > Note how this does dev->iommu_group->vfio_group->vfio_device and then
> > we back out one level to get the vfio_group, it's not a terribly
> > lightweight path.  Perhaps we should have:
> >
> > struct vfio_device *vfio_group_get_from_dev(struct device *dev)
> > {
> >         struct iommu_group *iommu_group;
> >         struct vfio_group *group;
> >
> >         iommu_group = iommu_group_get(dev);
> >         if (!iommu_group)
> >                 return NULL;
> >
> >         group = vfio_group_get_from_iommu(iommu_group);
> >     iommu_group_put(iommu_group);
> >
> >     return group;
> > }
> >
> > vfio_device_get_from_dev() would make use of this.
> >
> > Then create a separate:
> >
> > static int vfio_group_add_container_user(struct vfio_group *group)
> > {
> >  
> >> +  if (!atomic_inc_not_zero(&group->container_users)) {  
> >             return -EINVAL;  
> >> +  }
> >> +
> >> +  if (group->noiommu) {
> >> +          atomic_dec(&group->container_users);  
> >             return -EPERM;  
> >> +  }
> >> +
> >> +  if (!group->container->iommu_driver ||
> >> +      !vfio_group_viable(group)) {
> >> +          atomic_dec(&group->container_users);  
> >             return -EINVAL;  
> >> +  }
> >> +  
> >     return 0;
> > }
> >
> > vfio_group_get_external_user() would be updated to use this.  In fact,
> > creating these two functions and updating the existing code to use
> > these should be a separate patch.
> >  
> 
> Ok. I'll update.
> 
> 
> > Note that your version did not hold a group reference while doing the
> > pin/unpin operations below, which seems like a bug.
> >  
> 
> container->group_lock is held for pin/unpin. I think then we don't have
> to hold the reference to group, because groups are attached and detached
> holding this lock, right?
> 
> 
> >> +
> >> +err_ret:
> >> +  vfio_device_put(device);
> >> +  return ERR_PTR(ret);
> >> +}
> >> +
> >> +/*
> >> + * Pin a set of guest PFNs and return their associated host PFNs for  
> local
> >> + * domain only.
> >> + * @dev [in] : device
> >> + * @user_pfn [in]: array of user/guest PFNs
> >> + * @npage [in]: count of array elements
> >> + * @prot [in] : protection flags
> >> + * @phys_pfn[out] : array of host PFNs
> >> + */
> >> +long vfio_pin_pages(struct device *dev, unsigned long *user_pfn,
> >> +              long npage, int prot, unsigned long *phys_pfn)
> >> +{
> >> +  struct vfio_container *container;
> >> +  struct vfio_group *group;
> >> +  struct vfio_iommu_driver *driver;
> >> +  ssize_t ret = -EINVAL;
> >> +
> >> +  if (!dev || !user_pfn || !phys_pfn)
> >> +          return -EINVAL;
> >> +
> >> +  group = vfio_group_from_dev(dev);
> >> +  if (IS_ERR(group))
> >> +          return PTR_ERR(group);  
> >
> > As suggested above:
> >
> >     group = vfio_group_get_from_dev(dev);
> >     if (!group)
> >             return -ENODEV;
> >
> >     ret = vfio_group_add_container_user(group)
> >     if (ret)
> >             vfio_group_put(group);
> >             return ret;
> >     }
> >  
> 
> Ok.
> 
> 
> >> +
> >> +  container = group->container;
> >> +  if (IS_ERR(container))
> >> +          return PTR_ERR(container);
> >> +
> >> +  down_read(&container->group_lock);
> >> +
> >> +  driver = container->iommu_driver;
> >> +  if (likely(driver && driver->ops->pin_pages))
> >> +          ret = driver->ops->pin_pages(container->iommu_data, user_pfn,
> >> +                                       npage, prot, phys_pfn);
> >> +
> >> +  up_read(&container->group_lock);
> >> +  vfio_group_try_dissolve_container(group);  
> >
> > Even if you're considering that the container_user reference holds the
> > driver, I think we need a group reference throughout this and this
> > should end with a vfio_group_put(group);
> >  
> 
> Same as I mentioned above, container->group_lock is held here.

What allows you to assume that your @group pointer is valid when you
finish with vfio_group_try_dissolve_container()?  You have no reference
to the group, the only device reference you have is the struct device,
not the vfio_device, so that might have been unbound from vfio.  I'm
still inclined to believe you need to hold the reference to the group.

> >> +
> >> +static long vfio_iommu_type1_unpin_pages(void *iommu_data, unsigned  
> long *pfn,
> >> +                                   long npage)
> >> +{
> >> +  struct vfio_iommu *iommu = iommu_data;
> >> +  struct vfio_domain *domain = NULL;
> >> +  long unlocked = 0;
> >> +  int i;
> >> +
> >> +  if (!iommu || !pfn)
> >> +          return -EINVAL;
> >> +  
> >
> > We need iommu->lock here, right?
> >  
> 
> Oh, yes.
> 
> >> +  domain = iommu->local_domain;
> >> +
> >> +  for (i = 0; i < npage; i++) {
> >> +          struct vfio_pfn *p;
> >> +
> >> +          mutex_lock(&domain->local_addr_space->pfn_list_lock);
> >> +
> >> +          /* verify if pfn exist in pfn_list */
> >> +          p = vfio_find_pfn(domain, pfn[i]);
> >> +          if (p)
> >> +                  unlocked += vfio_unpin_pfn(domain, p, true);
> >> +
> >> +          mutex_unlock(&domain->local_addr_space->pfn_list_lock);  
> >
> > We hold this mutex outside the loop in the pin unwind case, why is it
> > different here?
> >  
> 
> pin_unwind is error condition, so should be done in one go.
> Here this is not error case. Holding lock for long could block other
> threads if there are multiple threads.


Ok, iommu->lock will need to be inside the loop then too or else
there's likely no gain anyway.


> >> +static int vfio_pin_map_dma(struct vfio_iommu *iommu, struct  
> vfio_dma *dma,
> >> +                      size_t map_size)
> >> +{
> >> +  dma_addr_t iova = dma->iova;
> >> +  unsigned long vaddr = dma->vaddr;
> >> +  size_t size = map_size, dma_size = 0;
> >> +  long npage;
> >> +  unsigned long pfn;
> >> +  int ret = 0;
> >> +
> >> +  while (size) {
> >> +          /* Pin a contiguous chunk of memory */
> >> +          npage = __vfio_pin_pages_remote(vaddr + dma_size,
> >> +                                          size >> PAGE_SHIFT, dma->prot,
> >> +                                          &pfn);
> >> +          if (npage <= 0) {
> >> +                  WARN_ON(!npage);
> >> +                  ret = (int)npage;
> >> +                  break;
> >> +          }
> >> +
> >> +          /* Map it! */
> >> +          ret = vfio_iommu_map(iommu, iova + dma_size, pfn, npage,
> >> +                               dma->prot);
> >> +          if (ret) {
> >> +                  __vfio_unpin_pages_remote(pfn, npage, dma->prot, true);
> >> +                  break;
> >> +          }
> >> +
> >> +          size -= npage << PAGE_SHIFT;
> >> +          dma_size += npage << PAGE_SHIFT;
> >> +  }
> >> +
> >> +  if (ret)
> >> +          vfio_remove_dma(iommu, dma);  
> >
> >
> > There's a bug introduced here, vfio_remove_dma() needs dma->size to be
> > accurate to the point of failure, it's not updated until the success
> > branch below, so it's never going to unmap/unpin anything.
> >  
> 
> Ops, yes. I'll fix this.
> 
> >> +  else {
> >> +          dma->size = dma_size;
> >> +          dma->iommu_mapped = true;
> >> +          vfio_update_accounting(iommu, dma);  
> >
> > I'm confused how this works, when called from vfio_dma_do_map() we're
> > populating a vfio_dma, that is we're populating part of the iova space
> > of the device.  How could we have pinned pfns in the local address
> > space that overlap that?  It would be invalid to have such pinned pfns
> > since that part of the iova space was not previously mapped.
> >
> > Another issue is that if there were existing overlaps, userspace would
> > need to have locked memory limits sufficient for this temporary double
> > accounting.  I'm not sure how they'd come up with heuristics to handle
> > that since we're potentially looking at the bulk of VM memory in a
> > single vfio_dma entry.
> >  
> 
> I see that when QEMU boots a VM, in the case when first vGPU device is
> attached and then pass through device is attached, then on first call to
> vfio_dma_do_map(), pin and iommu_mmap is skipped. Then when a pass
> through device is attached, all mappings are unmapped and then again
> vfio_dma_do_map() is called. At this moment IOMMU capable domain is
> present and so pin and iommu_mmap() on all sys mem is done. Now in
> between these two device attach, if any pages are pinned by vendor
> driver, then accounting should be updated.

So that actually points out something that was on my todo list to check
in this patch, when an unmap occurs, we need to invalidate the vendor
driver mappings.  For that period you describe above, the mappings the
vendor driver holds are invalid, we cannot assume that they will
return and certainly cannot assume they will have the same GPA to HVA
mapping.  So the sequence should be that the unmap causes invalidation
of any potential vendor mappings and then there's no reason that pfn
path would need to update accounting on a vfio_dma_do_map(), it should
not be possible that anything is currently pinned within that IOVA
range.

> >> +  }
> >> +
> >> +  return ret;
> >> +}
> >> +
> >>  static int vfio_dma_do_map(struct vfio_iommu *iommu,
> >>                       struct vfio_iommu_type1_dma_map *map)
> >>  {
> >>    dma_addr_t iova = map->iova;
> >>    unsigned long vaddr = map->vaddr;
> >>    size_t size = map->size;
> >> -  long npage;
> >>    int ret = 0, prot = 0;
> >>    uint64_t mask;
> >>    struct vfio_dma *dma;
> >> -  unsigned long pfn;
> >>
> >>    /* Verify that none of our __u64 fields overflow */
> >>    if (map->size != size || map->vaddr != vaddr || map->iova != iova)
> >> @@ -611,29 +981,11 @@ static int vfio_dma_do_map(struct vfio_iommu  
> *iommu,
> >>    /* Insert zero-sized and grow as we map chunks of it */
> >>    vfio_link_dma(iommu, dma);
> >>
> >> -  while (size) {
> >> -          /* Pin a contiguous chunk of memory */
> >> -          npage = vfio_pin_pages(vaddr + dma->size,
> >> -                                 size >> PAGE_SHIFT, prot, &pfn);
> >> -          if (npage <= 0) {
> >> -                  WARN_ON(!npage);
> >> -                  ret = (int)npage;
> >> -                  break;
> >> -          }
> >> -
> >> -          /* Map it! */
> >> -          ret = vfio_iommu_map(iommu, iova + dma->size, pfn, npage, prot);
> >> -          if (ret) {
> >> -                  vfio_unpin_pages(pfn, npage, prot, true);
> >> -                  break;
> >> -          }
> >> -
> >> -          size -= npage << PAGE_SHIFT;
> >> -          dma->size += npage << PAGE_SHIFT;
> >> -  }
> >> -
> >> -  if (ret)
> >> -          vfio_remove_dma(iommu, dma);
> >> +  /* Don't pin and map if container doesn't contain IOMMU capable  
> domain*/
> >> +  if (!IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu))
> >> +          dma->size = size;
> >> +  else
> >> +          ret = vfio_pin_map_dma(iommu, dma, size);
> >>
> >>    mutex_unlock(&iommu->lock);
> >>    return ret;
> >> @@ -662,10 +1014,6 @@ static int vfio_iommu_replay(struct vfio_iommu  
> *iommu,
> >>    d = list_first_entry(&iommu->domain_list, struct vfio_domain, next);
> >>    n = rb_first(&iommu->dma_list);
> >>
> >> -  /* If there's not a domain, there better not be any mappings */
> >> -  if (WARN_ON(n && !d))
> >> -          return -EINVAL;
> >> -
> >>    for (; n; n = rb_next(n)) {
> >>            struct vfio_dma *dma;
> >>            dma_addr_t iova;
> >> @@ -674,20 +1022,43 @@ static int vfio_iommu_replay(struct vfio_iommu  
> *iommu,
> >>            iova = dma->iova;
> >>
> >>            while (iova < dma->iova + dma->size) {
> >> -                  phys_addr_t phys = iommu_iova_to_phys(d->domain, iova);
> >> +                  phys_addr_t phys;
> >>                    size_t size;
> >>
> >> -                  if (WARN_ON(!phys)) {
> >> -                          iova += PAGE_SIZE;
> >> -                          continue;
> >> -                  }
> >> +                  if (dma->iommu_mapped) {
> >> +                          phys = iommu_iova_to_phys(d->domain, iova);
> >> +
> >> +                          if (WARN_ON(!phys)) {
> >> +                                  iova += PAGE_SIZE;
> >> +                                  continue;
> >> +                          }
> >>
> >> -                  size = PAGE_SIZE;
> >> +                          size = PAGE_SIZE;
> >>
> >> -                  while (iova + size < dma->iova + dma->size &&
> >> -                         phys + size == iommu_iova_to_phys(d->domain,
> >> +                          while (iova + size < dma->iova + dma->size &&
> >> +                              phys + size == iommu_iova_to_phys(d->domain,
> >>                                                             iova + size))
> >> -                          size += PAGE_SIZE;
> >> +                                  size += PAGE_SIZE;
> >> +                  } else {
> >> +                          unsigned long pfn;
> >> +                          unsigned long vaddr = dma->vaddr +
> >> +                                               (iova - dma->iova);
> >> +                          size_t n = dma->iova + dma->size - iova;
> >> +                          long npage;
> >> +
> >> +                          npage = __vfio_pin_pages_remote(vaddr,
> >> +                                                          n >> PAGE_SHIFT,
> >> +                                                          dma->prot,
> >> +                                                          &pfn);
> >> +                          if (npage <= 0) {
> >> +                                  WARN_ON(!npage);
> >> +                                  ret = (int)npage;
> >> +                                  return ret;
> >> +                          }
> >> +
> >> +                          phys = pfn << PAGE_SHIFT;
> >> +                          size = npage << PAGE_SHIFT;
> >> +                  }
> >>
> >>                    ret = iommu_map(domain->domain, iova, phys,
> >>                                    size, dma->prot | domain->prot);
> >> @@ -696,6 +1067,11 @@ static int vfio_iommu_replay(struct vfio_iommu  
> *iommu,
> >>
> >>                    iova += size;
> >>            }
> >> +
> >> +          if (!dma->iommu_mapped) {
> >> +                  dma->iommu_mapped = true;
> >> +                  vfio_update_accounting(iommu, dma);
> >> +          }  
> >
> > This is the case where we potentially have pinned pfns and we've added
> > an iommu mapped device and need to adjust accounting.  But we've fully
> > pinned and accounted the entire iommu mapped space while still holding
> > the accounting for any pfn mapped space.  So for a time, assuming some
> > pfn pinned pages, we have duplicate accounting.  How does userspace
> > deal with that?  For instance, if I'm using an mdev device where the
> > vendor driver has pinned 512MB of guest memory, then I hot-add an
> > assigned NIC and the entire VM address space gets pinned, that pinning
> > will fail unless my locked memory limits are at least 512MB in excess
> > of my VM size.  Additionally, the user doesn't know how much memory the
> > vendor driver is going to pin, it might be the whole VM address space,
> > so the user would need 2x the locked memory limits.
> >  
> 
> Is the RLIMIT_MEMLOCK set so low? I got your point. I'll update
> __vfio_pin_pages_remote() to check if page which is pinned is already
> accounted in __vfio_pin_pages_remote() itself.

I believe we currently support running a VM with RLIMIT set to exactly
the VM memory size.  We should not regress from that.  libvirt provides
a small "fudge factor", but we shouldn't count on it.
 
> >>    }
> >>
> >>    return 0;
> >> @@ -734,11 +1110,24 @@ static void vfio_test_domain_fgsp(struct  
> vfio_domain *domain)
> >>    __free_pages(pages, order);
> >>  }
> >>
> >> +static struct vfio_group *find_iommu_group(struct vfio_domain *domain,
> >> +                             struct iommu_group *iommu_group)
> >> +{
> >> +  struct vfio_group *g;
> >> +
> >> +  list_for_each_entry(g, &domain->group_list, next) {
> >> +          if (g->iommu_group == iommu_group)
> >> +                  return g;
> >> +  }
> >> +
> >> +  return NULL;
> >> +}  
> >
> > It would make review easier if changes like splitting this into a
> > separate function with no functional change on the calling path could
> > be a separate patch.
> >  
> 
> OK.
> 
> Thanks,
> Kirti
> 
> -----------------------------------------------------------------------------------
> This email message is for the sole use of the intended recipient(s) and may 
> contain
> confidential information.  Any unauthorized review, use, disclosure or 
> distribution
> is prohibited.  If you are not the intended recipient, please contact the 
> sender by
> reply email and destroy all copies of the original message.
> -----------------------------------------------------------------------------------




reply via email to

[Prev in Thread] Current Thread [Next in Thread]