qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v8 3/6] vfio iommu: Add support for mediated dev


From: Kirti Wankhede
Subject: Re: [Qemu-devel] [PATCH v8 3/6] vfio iommu: Add support for mediated devices
Date: Thu, 13 Oct 2016 20:04:43 +0530


On 10/12/2016 3:36 AM, Alex Williamson wrote:
> On Tue, 11 Oct 2016 01:58:34 +0530
> Kirti Wankhede <address@hidden> wrote:
>
...


>> +static struct vfio_group *vfio_group_from_dev(struct device *dev)
>> +{
>> +    struct vfio_device *device;
>> +    struct vfio_group *group;
>> +    int ret;
>> +
>> +    device = vfio_device_get_from_dev(dev);
>
> Note how this does dev->iommu_group->vfio_group->vfio_device and then
> we back out one level to get the vfio_group, it's not a terribly
> lightweight path.  Perhaps we should have:
>
> struct vfio_device *vfio_group_get_from_dev(struct device *dev)
> {
>         struct iommu_group *iommu_group;
>         struct vfio_group *group;
>
>         iommu_group = iommu_group_get(dev);
>         if (!iommu_group)
>                 return NULL;
>
>         group = vfio_group_get_from_iommu(iommu_group);
>       iommu_group_put(iommu_group);
>
>       return group;
> }
>
> vfio_device_get_from_dev() would make use of this.
>
> Then create a separate:
>
> static int vfio_group_add_container_user(struct vfio_group *group)
> {
>
>> +    if (!atomic_inc_not_zero(&group->container_users)) {
>               return -EINVAL;
>> +    }
>> +
>> +    if (group->noiommu) {
>> +            atomic_dec(&group->container_users);
>               return -EPERM;
>> +    }
>> +
>> +    if (!group->container->iommu_driver ||
>> +        !vfio_group_viable(group)) {
>> +            atomic_dec(&group->container_users);
>               return -EINVAL;
>> +    }
>> +
>       return 0;
> }
>
> vfio_group_get_external_user() would be updated to use this.  In fact,
> creating these two functions and updating the existing code to use
> these should be a separate patch.
>

Ok. I'll update.


> Note that your version did not hold a group reference while doing the
> pin/unpin operations below, which seems like a bug.
>

container->group_lock is held for pin/unpin. I think then we don't have
to hold the reference to group, because groups are attached and detached
holding this lock, right?


>> +
>> +err_ret:
>> +    vfio_device_put(device);
>> +    return ERR_PTR(ret);
>> +}
>> +
>> +/*
>> + * Pin a set of guest PFNs and return their associated host PFNs for
local
>> + * domain only.
>> + * @dev [in] : device
>> + * @user_pfn [in]: array of user/guest PFNs
>> + * @npage [in]: count of array elements
>> + * @prot [in] : protection flags
>> + * @phys_pfn[out] : array of host PFNs
>> + */
>> +long vfio_pin_pages(struct device *dev, unsigned long *user_pfn,
>> +                long npage, int prot, unsigned long *phys_pfn)
>> +{
>> +    struct vfio_container *container;
>> +    struct vfio_group *group;
>> +    struct vfio_iommu_driver *driver;
>> +    ssize_t ret = -EINVAL;
>> +
>> +    if (!dev || !user_pfn || !phys_pfn)
>> +            return -EINVAL;
>> +
>> +    group = vfio_group_from_dev(dev);
>> +    if (IS_ERR(group))
>> +            return PTR_ERR(group);
>
> As suggested above:
>
>       group = vfio_group_get_from_dev(dev);
>       if (!group)
>               return -ENODEV;
>
>       ret = vfio_group_add_container_user(group)
>       if (ret)
>               vfio_group_put(group);
>               return ret;
>       }
>

Ok.


>> +
>> +    container = group->container;
>> +    if (IS_ERR(container))
>> +            return PTR_ERR(container);
>> +
>> +    down_read(&container->group_lock);
>> +
>> +    driver = container->iommu_driver;
>> +    if (likely(driver && driver->ops->pin_pages))
>> +            ret = driver->ops->pin_pages(container->iommu_data, user_pfn,
>> +                                         npage, prot, phys_pfn);
>> +
>> +    up_read(&container->group_lock);
>> +    vfio_group_try_dissolve_container(group);
>
> Even if you're considering that the container_user reference holds the
> driver, I think we need a group reference throughout this and this
> should end with a vfio_group_put(group);
>

Same as I mentioned above, container->group_lock is held here.

...

>> +
>> +static long vfio_iommu_type1_unpin_pages(void *iommu_data, unsigned
long *pfn,
>> +                                     long npage)
>> +{
>> +    struct vfio_iommu *iommu = iommu_data;
>> +    struct vfio_domain *domain = NULL;
>> +    long unlocked = 0;
>> +    int i;
>> +
>> +    if (!iommu || !pfn)
>> +            return -EINVAL;
>> +
>
> We need iommu->lock here, right?
>

Oh, yes.

>> +    domain = iommu->local_domain;
>> +
>> +    for (i = 0; i < npage; i++) {
>> +            struct vfio_pfn *p;
>> +
>> +            mutex_lock(&domain->local_addr_space->pfn_list_lock);
>> +
>> +            /* verify if pfn exist in pfn_list */
>> +            p = vfio_find_pfn(domain, pfn[i]);
>> +            if (p)
>> +                    unlocked += vfio_unpin_pfn(domain, p, true);
>> +
>> +            mutex_unlock(&domain->local_addr_space->pfn_list_lock);
>
> We hold this mutex outside the loop in the pin unwind case, why is it
> different here?
>

pin_unwind is error condition, so should be done in one go.
Here this is not error case. Holding lock for long could block other
threads if there are multiple threads.



>> +static int vfio_pin_map_dma(struct vfio_iommu *iommu, struct
vfio_dma *dma,
>> +                        size_t map_size)
>> +{
>> +    dma_addr_t iova = dma->iova;
>> +    unsigned long vaddr = dma->vaddr;
>> +    size_t size = map_size, dma_size = 0;
>> +    long npage;
>> +    unsigned long pfn;
>> +    int ret = 0;
>> +
>> +    while (size) {
>> +            /* Pin a contiguous chunk of memory */
>> +            npage = __vfio_pin_pages_remote(vaddr + dma_size,
>> +                                            size >> PAGE_SHIFT, dma->prot,
>> +                                            &pfn);
>> +            if (npage <= 0) {
>> +                    WARN_ON(!npage);
>> +                    ret = (int)npage;
>> +                    break;
>> +            }
>> +
>> +            /* Map it! */
>> +            ret = vfio_iommu_map(iommu, iova + dma_size, pfn, npage,
>> +                                 dma->prot);
>> +            if (ret) {
>> +                    __vfio_unpin_pages_remote(pfn, npage, dma->prot, true);
>> +                    break;
>> +            }
>> +
>> +            size -= npage << PAGE_SHIFT;
>> +            dma_size += npage << PAGE_SHIFT;
>> +    }
>> +
>> +    if (ret)
>> +            vfio_remove_dma(iommu, dma);
>
>
> There's a bug introduced here, vfio_remove_dma() needs dma->size to be
> accurate to the point of failure, it's not updated until the success
> branch below, so it's never going to unmap/unpin anything.
>

Ops, yes. I'll fix this.

>> +    else {
>> +            dma->size = dma_size;
>> +            dma->iommu_mapped = true;
>> +            vfio_update_accounting(iommu, dma);
>
> I'm confused how this works, when called from vfio_dma_do_map() we're
> populating a vfio_dma, that is we're populating part of the iova space
> of the device.  How could we have pinned pfns in the local address
> space that overlap that?  It would be invalid to have such pinned pfns
> since that part of the iova space was not previously mapped.
>
> Another issue is that if there were existing overlaps, userspace would
> need to have locked memory limits sufficient for this temporary double
> accounting.  I'm not sure how they'd come up with heuristics to handle
> that since we're potentially looking at the bulk of VM memory in a
> single vfio_dma entry.
>

I see that when QEMU boots a VM, in the case when first vGPU device is
attached and then pass through device is attached, then on first call to
vfio_dma_do_map(), pin and iommu_mmap is skipped. Then when a pass
through device is attached, all mappings are unmapped and then again
vfio_dma_do_map() is called. At this moment IOMMU capable domain is
present and so pin and iommu_mmap() on all sys mem is done. Now in
between these two device attach, if any pages are pinned by vendor
driver, then accounting should be updated.


>> +    }
>> +
>> +    return ret;
>> +}
>> +
>>  static int vfio_dma_do_map(struct vfio_iommu *iommu,
>>                         struct vfio_iommu_type1_dma_map *map)
>>  {
>>      dma_addr_t iova = map->iova;
>>      unsigned long vaddr = map->vaddr;
>>      size_t size = map->size;
>> -    long npage;
>>      int ret = 0, prot = 0;
>>      uint64_t mask;
>>      struct vfio_dma *dma;
>> -    unsigned long pfn;
>>
>>      /* Verify that none of our __u64 fields overflow */
>>      if (map->size != size || map->vaddr != vaddr || map->iova != iova)
>> @@ -611,29 +981,11 @@ static int vfio_dma_do_map(struct vfio_iommu
*iommu,
>>      /* Insert zero-sized and grow as we map chunks of it */
>>      vfio_link_dma(iommu, dma);
>>
>> -    while (size) {
>> -            /* Pin a contiguous chunk of memory */
>> -            npage = vfio_pin_pages(vaddr + dma->size,
>> -                                   size >> PAGE_SHIFT, prot, &pfn);
>> -            if (npage <= 0) {
>> -                    WARN_ON(!npage);
>> -                    ret = (int)npage;
>> -                    break;
>> -            }
>> -
>> -            /* Map it! */
>> -            ret = vfio_iommu_map(iommu, iova + dma->size, pfn, npage, prot);
>> -            if (ret) {
>> -                    vfio_unpin_pages(pfn, npage, prot, true);
>> -                    break;
>> -            }
>> -
>> -            size -= npage << PAGE_SHIFT;
>> -            dma->size += npage << PAGE_SHIFT;
>> -    }
>> -
>> -    if (ret)
>> -            vfio_remove_dma(iommu, dma);
>> +    /* Don't pin and map if container doesn't contain IOMMU capable
domain*/
>> +    if (!IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu))
>> +            dma->size = size;
>> +    else
>> +            ret = vfio_pin_map_dma(iommu, dma, size);
>>
>>      mutex_unlock(&iommu->lock);
>>      return ret;
>> @@ -662,10 +1014,6 @@ static int vfio_iommu_replay(struct vfio_iommu
*iommu,
>>      d = list_first_entry(&iommu->domain_list, struct vfio_domain, next);
>>      n = rb_first(&iommu->dma_list);
>>
>> -    /* If there's not a domain, there better not be any mappings */
>> -    if (WARN_ON(n && !d))
>> -            return -EINVAL;
>> -
>>      for (; n; n = rb_next(n)) {
>>              struct vfio_dma *dma;
>>              dma_addr_t iova;
>> @@ -674,20 +1022,43 @@ static int vfio_iommu_replay(struct vfio_iommu
*iommu,
>>              iova = dma->iova;
>>
>>              while (iova < dma->iova + dma->size) {
>> -                    phys_addr_t phys = iommu_iova_to_phys(d->domain, iova);
>> +                    phys_addr_t phys;
>>                      size_t size;
>>
>> -                    if (WARN_ON(!phys)) {
>> -                            iova += PAGE_SIZE;
>> -                            continue;
>> -                    }
>> +                    if (dma->iommu_mapped) {
>> +                            phys = iommu_iova_to_phys(d->domain, iova);
>> +
>> +                            if (WARN_ON(!phys)) {
>> +                                    iova += PAGE_SIZE;
>> +                                    continue;
>> +                            }
>>
>> -                    size = PAGE_SIZE;
>> +                            size = PAGE_SIZE;
>>
>> -                    while (iova + size < dma->iova + dma->size &&
>> -                           phys + size == iommu_iova_to_phys(d->domain,
>> +                            while (iova + size < dma->iova + dma->size &&
>> +                                phys + size == iommu_iova_to_phys(d->domain,
>>                                                               iova + size))
>> -                            size += PAGE_SIZE;
>> +                                    size += PAGE_SIZE;
>> +                    } else {
>> +                            unsigned long pfn;
>> +                            unsigned long vaddr = dma->vaddr +
>> +                                                 (iova - dma->iova);
>> +                            size_t n = dma->iova + dma->size - iova;
>> +                            long npage;
>> +
>> +                            npage = __vfio_pin_pages_remote(vaddr,
>> +                                                            n >> PAGE_SHIFT,
>> +                                                            dma->prot,
>> +                                                            &pfn);
>> +                            if (npage <= 0) {
>> +                                    WARN_ON(!npage);
>> +                                    ret = (int)npage;
>> +                                    return ret;
>> +                            }
>> +
>> +                            phys = pfn << PAGE_SHIFT;
>> +                            size = npage << PAGE_SHIFT;
>> +                    }
>>
>>                      ret = iommu_map(domain->domain, iova, phys,
>>                                      size, dma->prot | domain->prot);
>> @@ -696,6 +1067,11 @@ static int vfio_iommu_replay(struct vfio_iommu
*iommu,
>>
>>                      iova += size;
>>              }
>> +
>> +            if (!dma->iommu_mapped) {
>> +                    dma->iommu_mapped = true;
>> +                    vfio_update_accounting(iommu, dma);
>> +            }
>
> This is the case where we potentially have pinned pfns and we've added
> an iommu mapped device and need to adjust accounting.  But we've fully
> pinned and accounted the entire iommu mapped space while still holding
> the accounting for any pfn mapped space.  So for a time, assuming some
> pfn pinned pages, we have duplicate accounting.  How does userspace
> deal with that?  For instance, if I'm using an mdev device where the
> vendor driver has pinned 512MB of guest memory, then I hot-add an
> assigned NIC and the entire VM address space gets pinned, that pinning
> will fail unless my locked memory limits are at least 512MB in excess
> of my VM size.  Additionally, the user doesn't know how much memory the
> vendor driver is going to pin, it might be the whole VM address space,
> so the user would need 2x the locked memory limits.
>

Is the RLIMIT_MEMLOCK set so low? I got your point. I'll update
__vfio_pin_pages_remote() to check if page which is pinned is already
accounted in __vfio_pin_pages_remote() itself.


>>      }
>>
>>      return 0;
>> @@ -734,11 +1110,24 @@ static void vfio_test_domain_fgsp(struct
vfio_domain *domain)
>>      __free_pages(pages, order);
>>  }
>>
>> +static struct vfio_group *find_iommu_group(struct vfio_domain *domain,
>> +                               struct iommu_group *iommu_group)
>> +{
>> +    struct vfio_group *g;
>> +
>> +    list_for_each_entry(g, &domain->group_list, next) {
>> +            if (g->iommu_group == iommu_group)
>> +                    return g;
>> +    }
>> +
>> +    return NULL;
>> +}
>
> It would make review easier if changes like splitting this into a
> separate function with no functional change on the calling path could
> be a separate patch.
>

OK.

Thanks,
Kirti



reply via email to

[Prev in Thread] Current Thread [Next in Thread]