[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [PATCH v8 3/6] vfio iommu: Add support for mediated dev
From: |
Kirti Wankhede |
Subject: |
Re: [Qemu-devel] [PATCH v8 3/6] vfio iommu: Add support for mediated devices |
Date: |
Thu, 13 Oct 2016 20:04:43 +0530 |
On 10/12/2016 3:36 AM, Alex Williamson wrote:
> On Tue, 11 Oct 2016 01:58:34 +0530
> Kirti Wankhede <address@hidden> wrote:
>
...
>> +static struct vfio_group *vfio_group_from_dev(struct device *dev)
>> +{
>> + struct vfio_device *device;
>> + struct vfio_group *group;
>> + int ret;
>> +
>> + device = vfio_device_get_from_dev(dev);
>
> Note how this does dev->iommu_group->vfio_group->vfio_device and then
> we back out one level to get the vfio_group, it's not a terribly
> lightweight path. Perhaps we should have:
>
> struct vfio_device *vfio_group_get_from_dev(struct device *dev)
> {
> struct iommu_group *iommu_group;
> struct vfio_group *group;
>
> iommu_group = iommu_group_get(dev);
> if (!iommu_group)
> return NULL;
>
> group = vfio_group_get_from_iommu(iommu_group);
> iommu_group_put(iommu_group);
>
> return group;
> }
>
> vfio_device_get_from_dev() would make use of this.
>
> Then create a separate:
>
> static int vfio_group_add_container_user(struct vfio_group *group)
> {
>
>> + if (!atomic_inc_not_zero(&group->container_users)) {
> return -EINVAL;
>> + }
>> +
>> + if (group->noiommu) {
>> + atomic_dec(&group->container_users);
> return -EPERM;
>> + }
>> +
>> + if (!group->container->iommu_driver ||
>> + !vfio_group_viable(group)) {
>> + atomic_dec(&group->container_users);
> return -EINVAL;
>> + }
>> +
> return 0;
> }
>
> vfio_group_get_external_user() would be updated to use this. In fact,
> creating these two functions and updating the existing code to use
> these should be a separate patch.
>
Ok. I'll update.
> Note that your version did not hold a group reference while doing the
> pin/unpin operations below, which seems like a bug.
>
container->group_lock is held for pin/unpin. I think then we don't have
to hold the reference to group, because groups are attached and detached
holding this lock, right?
>> +
>> +err_ret:
>> + vfio_device_put(device);
>> + return ERR_PTR(ret);
>> +}
>> +
>> +/*
>> + * Pin a set of guest PFNs and return their associated host PFNs for
local
>> + * domain only.
>> + * @dev [in] : device
>> + * @user_pfn [in]: array of user/guest PFNs
>> + * @npage [in]: count of array elements
>> + * @prot [in] : protection flags
>> + * @phys_pfn[out] : array of host PFNs
>> + */
>> +long vfio_pin_pages(struct device *dev, unsigned long *user_pfn,
>> + long npage, int prot, unsigned long *phys_pfn)
>> +{
>> + struct vfio_container *container;
>> + struct vfio_group *group;
>> + struct vfio_iommu_driver *driver;
>> + ssize_t ret = -EINVAL;
>> +
>> + if (!dev || !user_pfn || !phys_pfn)
>> + return -EINVAL;
>> +
>> + group = vfio_group_from_dev(dev);
>> + if (IS_ERR(group))
>> + return PTR_ERR(group);
>
> As suggested above:
>
> group = vfio_group_get_from_dev(dev);
> if (!group)
> return -ENODEV;
>
> ret = vfio_group_add_container_user(group)
> if (ret)
> vfio_group_put(group);
> return ret;
> }
>
Ok.
>> +
>> + container = group->container;
>> + if (IS_ERR(container))
>> + return PTR_ERR(container);
>> +
>> + down_read(&container->group_lock);
>> +
>> + driver = container->iommu_driver;
>> + if (likely(driver && driver->ops->pin_pages))
>> + ret = driver->ops->pin_pages(container->iommu_data, user_pfn,
>> + npage, prot, phys_pfn);
>> +
>> + up_read(&container->group_lock);
>> + vfio_group_try_dissolve_container(group);
>
> Even if you're considering that the container_user reference holds the
> driver, I think we need a group reference throughout this and this
> should end with a vfio_group_put(group);
>
Same as I mentioned above, container->group_lock is held here.
...
>> +
>> +static long vfio_iommu_type1_unpin_pages(void *iommu_data, unsigned
long *pfn,
>> + long npage)
>> +{
>> + struct vfio_iommu *iommu = iommu_data;
>> + struct vfio_domain *domain = NULL;
>> + long unlocked = 0;
>> + int i;
>> +
>> + if (!iommu || !pfn)
>> + return -EINVAL;
>> +
>
> We need iommu->lock here, right?
>
Oh, yes.
>> + domain = iommu->local_domain;
>> +
>> + for (i = 0; i < npage; i++) {
>> + struct vfio_pfn *p;
>> +
>> + mutex_lock(&domain->local_addr_space->pfn_list_lock);
>> +
>> + /* verify if pfn exist in pfn_list */
>> + p = vfio_find_pfn(domain, pfn[i]);
>> + if (p)
>> + unlocked += vfio_unpin_pfn(domain, p, true);
>> +
>> + mutex_unlock(&domain->local_addr_space->pfn_list_lock);
>
> We hold this mutex outside the loop in the pin unwind case, why is it
> different here?
>
pin_unwind is error condition, so should be done in one go.
Here this is not error case. Holding lock for long could block other
threads if there are multiple threads.
>> +static int vfio_pin_map_dma(struct vfio_iommu *iommu, struct
vfio_dma *dma,
>> + size_t map_size)
>> +{
>> + dma_addr_t iova = dma->iova;
>> + unsigned long vaddr = dma->vaddr;
>> + size_t size = map_size, dma_size = 0;
>> + long npage;
>> + unsigned long pfn;
>> + int ret = 0;
>> +
>> + while (size) {
>> + /* Pin a contiguous chunk of memory */
>> + npage = __vfio_pin_pages_remote(vaddr + dma_size,
>> + size >> PAGE_SHIFT, dma->prot,
>> + &pfn);
>> + if (npage <= 0) {
>> + WARN_ON(!npage);
>> + ret = (int)npage;
>> + break;
>> + }
>> +
>> + /* Map it! */
>> + ret = vfio_iommu_map(iommu, iova + dma_size, pfn, npage,
>> + dma->prot);
>> + if (ret) {
>> + __vfio_unpin_pages_remote(pfn, npage, dma->prot, true);
>> + break;
>> + }
>> +
>> + size -= npage << PAGE_SHIFT;
>> + dma_size += npage << PAGE_SHIFT;
>> + }
>> +
>> + if (ret)
>> + vfio_remove_dma(iommu, dma);
>
>
> There's a bug introduced here, vfio_remove_dma() needs dma->size to be
> accurate to the point of failure, it's not updated until the success
> branch below, so it's never going to unmap/unpin anything.
>
Ops, yes. I'll fix this.
>> + else {
>> + dma->size = dma_size;
>> + dma->iommu_mapped = true;
>> + vfio_update_accounting(iommu, dma);
>
> I'm confused how this works, when called from vfio_dma_do_map() we're
> populating a vfio_dma, that is we're populating part of the iova space
> of the device. How could we have pinned pfns in the local address
> space that overlap that? It would be invalid to have such pinned pfns
> since that part of the iova space was not previously mapped.
>
> Another issue is that if there were existing overlaps, userspace would
> need to have locked memory limits sufficient for this temporary double
> accounting. I'm not sure how they'd come up with heuristics to handle
> that since we're potentially looking at the bulk of VM memory in a
> single vfio_dma entry.
>
I see that when QEMU boots a VM, in the case when first vGPU device is
attached and then pass through device is attached, then on first call to
vfio_dma_do_map(), pin and iommu_mmap is skipped. Then when a pass
through device is attached, all mappings are unmapped and then again
vfio_dma_do_map() is called. At this moment IOMMU capable domain is
present and so pin and iommu_mmap() on all sys mem is done. Now in
between these two device attach, if any pages are pinned by vendor
driver, then accounting should be updated.
>> + }
>> +
>> + return ret;
>> +}
>> +
>> static int vfio_dma_do_map(struct vfio_iommu *iommu,
>> struct vfio_iommu_type1_dma_map *map)
>> {
>> dma_addr_t iova = map->iova;
>> unsigned long vaddr = map->vaddr;
>> size_t size = map->size;
>> - long npage;
>> int ret = 0, prot = 0;
>> uint64_t mask;
>> struct vfio_dma *dma;
>> - unsigned long pfn;
>>
>> /* Verify that none of our __u64 fields overflow */
>> if (map->size != size || map->vaddr != vaddr || map->iova != iova)
>> @@ -611,29 +981,11 @@ static int vfio_dma_do_map(struct vfio_iommu
*iommu,
>> /* Insert zero-sized and grow as we map chunks of it */
>> vfio_link_dma(iommu, dma);
>>
>> - while (size) {
>> - /* Pin a contiguous chunk of memory */
>> - npage = vfio_pin_pages(vaddr + dma->size,
>> - size >> PAGE_SHIFT, prot, &pfn);
>> - if (npage <= 0) {
>> - WARN_ON(!npage);
>> - ret = (int)npage;
>> - break;
>> - }
>> -
>> - /* Map it! */
>> - ret = vfio_iommu_map(iommu, iova + dma->size, pfn, npage, prot);
>> - if (ret) {
>> - vfio_unpin_pages(pfn, npage, prot, true);
>> - break;
>> - }
>> -
>> - size -= npage << PAGE_SHIFT;
>> - dma->size += npage << PAGE_SHIFT;
>> - }
>> -
>> - if (ret)
>> - vfio_remove_dma(iommu, dma);
>> + /* Don't pin and map if container doesn't contain IOMMU capable
domain*/
>> + if (!IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu))
>> + dma->size = size;
>> + else
>> + ret = vfio_pin_map_dma(iommu, dma, size);
>>
>> mutex_unlock(&iommu->lock);
>> return ret;
>> @@ -662,10 +1014,6 @@ static int vfio_iommu_replay(struct vfio_iommu
*iommu,
>> d = list_first_entry(&iommu->domain_list, struct vfio_domain, next);
>> n = rb_first(&iommu->dma_list);
>>
>> - /* If there's not a domain, there better not be any mappings */
>> - if (WARN_ON(n && !d))
>> - return -EINVAL;
>> -
>> for (; n; n = rb_next(n)) {
>> struct vfio_dma *dma;
>> dma_addr_t iova;
>> @@ -674,20 +1022,43 @@ static int vfio_iommu_replay(struct vfio_iommu
*iommu,
>> iova = dma->iova;
>>
>> while (iova < dma->iova + dma->size) {
>> - phys_addr_t phys = iommu_iova_to_phys(d->domain, iova);
>> + phys_addr_t phys;
>> size_t size;
>>
>> - if (WARN_ON(!phys)) {
>> - iova += PAGE_SIZE;
>> - continue;
>> - }
>> + if (dma->iommu_mapped) {
>> + phys = iommu_iova_to_phys(d->domain, iova);
>> +
>> + if (WARN_ON(!phys)) {
>> + iova += PAGE_SIZE;
>> + continue;
>> + }
>>
>> - size = PAGE_SIZE;
>> + size = PAGE_SIZE;
>>
>> - while (iova + size < dma->iova + dma->size &&
>> - phys + size == iommu_iova_to_phys(d->domain,
>> + while (iova + size < dma->iova + dma->size &&
>> + phys + size == iommu_iova_to_phys(d->domain,
>> iova + size))
>> - size += PAGE_SIZE;
>> + size += PAGE_SIZE;
>> + } else {
>> + unsigned long pfn;
>> + unsigned long vaddr = dma->vaddr +
>> + (iova - dma->iova);
>> + size_t n = dma->iova + dma->size - iova;
>> + long npage;
>> +
>> + npage = __vfio_pin_pages_remote(vaddr,
>> + n >> PAGE_SHIFT,
>> + dma->prot,
>> + &pfn);
>> + if (npage <= 0) {
>> + WARN_ON(!npage);
>> + ret = (int)npage;
>> + return ret;
>> + }
>> +
>> + phys = pfn << PAGE_SHIFT;
>> + size = npage << PAGE_SHIFT;
>> + }
>>
>> ret = iommu_map(domain->domain, iova, phys,
>> size, dma->prot | domain->prot);
>> @@ -696,6 +1067,11 @@ static int vfio_iommu_replay(struct vfio_iommu
*iommu,
>>
>> iova += size;
>> }
>> +
>> + if (!dma->iommu_mapped) {
>> + dma->iommu_mapped = true;
>> + vfio_update_accounting(iommu, dma);
>> + }
>
> This is the case where we potentially have pinned pfns and we've added
> an iommu mapped device and need to adjust accounting. But we've fully
> pinned and accounted the entire iommu mapped space while still holding
> the accounting for any pfn mapped space. So for a time, assuming some
> pfn pinned pages, we have duplicate accounting. How does userspace
> deal with that? For instance, if I'm using an mdev device where the
> vendor driver has pinned 512MB of guest memory, then I hot-add an
> assigned NIC and the entire VM address space gets pinned, that pinning
> will fail unless my locked memory limits are at least 512MB in excess
> of my VM size. Additionally, the user doesn't know how much memory the
> vendor driver is going to pin, it might be the whole VM address space,
> so the user would need 2x the locked memory limits.
>
Is the RLIMIT_MEMLOCK set so low? I got your point. I'll update
__vfio_pin_pages_remote() to check if page which is pinned is already
accounted in __vfio_pin_pages_remote() itself.
>> }
>>
>> return 0;
>> @@ -734,11 +1110,24 @@ static void vfio_test_domain_fgsp(struct
vfio_domain *domain)
>> __free_pages(pages, order);
>> }
>>
>> +static struct vfio_group *find_iommu_group(struct vfio_domain *domain,
>> + struct iommu_group *iommu_group)
>> +{
>> + struct vfio_group *g;
>> +
>> + list_for_each_entry(g, &domain->group_list, next) {
>> + if (g->iommu_group == iommu_group)
>> + return g;
>> + }
>> +
>> + return NULL;
>> +}
>
> It would make review easier if changes like splitting this into a
> separate function with no functional change on the calling path could
> be a separate patch.
>
OK.
Thanks,
Kirti
- Re: [Qemu-devel] [PATCH v8 2/6] vfio: VFIO based driver for Mediated devices, (continued)
[Qemu-devel] [PATCH v8 4/6] docs: Add Documentation for Mediated devices, Kirti Wankhede, 2016/10/10