qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH qemu v10 10/14] spapr_pci: Enable vfio-pci hotpl


From: Michael Roth
Subject: Re: [Qemu-devel] [PATCH qemu v10 10/14] spapr_pci: Enable vfio-pci hotplug
Date: Fri, 10 Jul 2015 16:33:01 -0500
User-agent: alot/0.3.6

Quoting Alexey Kardashevskiy (2015-07-05 21:11:06)
> sPAPR IOMMU is managing two copies of an TCE table:
> 1) a guest view of the table - this is what emulated devices use and
> this is where H_GET_TCE reads from;
> 2) a hardware TCE table - only present if there is at least one vfio-pci
> device on a PHB; it is updated via a memory listener on a PHB address
> space which forwards map/unmap requests to vfio-pci IOMMU host driver.
> 
> At the moment presence of vfio-pci devices on a bus affect the way
> the guest view table is allocated. If there is no vfio-pci on a PHB
> and the host kernel supports KVM acceleration of H_PUT_TCE, a table
> is allocated in KVM. However, if there is vfio-pci and we do yet not
> support KVM acceleration for these, the table has to be allocated
> by the userspace.
> 
> When vfio-pci device is hotplugged and there were no vfio-pci devices
> already, the guest view table could have been allocated by KVM which
> means that H_PUT_TCE is handled by the host kernel and since we
> do not support vfio-pci in KVM, the hardware table will not be updated.
> 
> This reallocates the guest view table in QEMU if the first vfio-pci
> device has just been plugged. spapr_tce_realloc_userspace() handles this.
> 
> This replays all the mappings to make sure that the tables are in sync.
> This will not have a visible effect though as for a new device
> the guest kernel will allocate-and-map new addresses and therefore
> existing mappings from emulated devices will not be used by vfio-pci
> devices.
> 
> This adds calls to spapr_phb_dma_capabilities_update() in PCI hotplug
> hooks.
> 
> Signed-off-by: Alexey Kardashevskiy <address@hidden>
> ---
> Changes:
> v10:
> * removed unnecessary  memory_region_del_subregion() and
> memory_region_add_subregion() as
> "vfio: Unregister IOMMU notifiers when container is destroyed" removes
> notifiers in a more correct way
> 
> v9:
> * spapr_phb_hotplug_dma_sync() enumerates TCE tables explicitely rather than
> via object_child_foreach()
> * spapr_phb_hotplug_dma_sync() does memory_region_del_subregion() +
> memory_region_add_subregion() as otherwise vfio_listener_region_del() is not
> called and we end up with vfio_iommu_map_notify registered twice (comments 
> welcome!)
> if we do hotplug+hotunplug+hotplug of the same device.
> * moved spapr_phb_hotplug_dma_sync() on unplug event to rcu as before calling
> spapr_phb_hotplug_dma_sync(), we need VFIO to release the container, otherwise
> spapr_phb_dma_capabilities_update() will decide that the PHB still has VFIO 
> device.
> Actual VFIO PCI device release happens from rcu and since we add ours later,
> it gets executed later and we are good.
> ---
>  hw/ppc/spapr_iommu.c        | 51 
> ++++++++++++++++++++++++++++++++++++++++++---
>  hw/ppc/spapr_pci.c          | 47 +++++++++++++++++++++++++++++++++++++++++
>  include/hw/pci-host/spapr.h |  1 +
>  include/hw/ppc/spapr.h      |  2 ++
>  trace-events                |  2 ++
>  5 files changed, 100 insertions(+), 3 deletions(-)
> 
> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
> index 45c00d8..2d99c3b 100644
> --- a/hw/ppc/spapr_iommu.c
> +++ b/hw/ppc/spapr_iommu.c
> @@ -78,12 +78,13 @@ static uint64_t *spapr_tce_alloc_table(uint32_t liobn,
>                                         uint32_t nb_table,
>                                         uint32_t page_shift,
>                                         int *fd,
> -                                       bool vfio_accel)
> +                                       bool vfio_accel,
> +                                       bool force_userspace)
>  {
>      uint64_t *table = NULL;
>      uint64_t window_size = (uint64_t)nb_table << page_shift;
> 
> -    if (kvm_enabled() && !(window_size >> 32)) {
> +    if (kvm_enabled() && !force_userspace && !(window_size >> 32)) {
>          table = kvmppc_create_spapr_tce(liobn, window_size, fd, vfio_accel);
>      }
> 
> @@ -222,7 +223,8 @@ static void spapr_tce_table_do_enable(sPAPRTCETable 
> *tcet, bool vfio_accel)
>                                          tcet->nb_table,
>                                          tcet->page_shift,
>                                          &tcet->fd,
> -                                        vfio_accel);
> +                                        vfio_accel,
> +                                        false);
> 
>      memory_region_set_size(&tcet->iommu,
>                             (uint64_t)tcet->nb_table << tcet->page_shift);
> @@ -495,6 +497,49 @@ int spapr_dma_dt(void *fdt, int node_off, const char 
> *propname,
>      return 0;
>  }
> 
> +static int spapr_tce_do_replay(sPAPRTCETable *tcet, uint64_t *table)
> +{
> +    target_ulong ioba = tcet->bus_offset, pgsz = (1ULL << tcet->page_shift);
> +    long i, ret = 0;
> +
> +    for (i = 0; i < tcet->nb_table; ++i, ioba += pgsz) {
> +        ret = put_tce_emu(tcet, ioba, table[i]);
> +        if (ret) {
> +            break;
> +        }
> +    }
> +
> +    return ret;
> +}
> +
> +int spapr_tce_replay(sPAPRTCETable *tcet)
> +{
> +    return spapr_tce_do_replay(tcet, tcet->table);
> +}
> +
> +int spapr_tce_realloc_userspace(sPAPRTCETable *tcet, bool replay)
> +{
> +    int ret = 0, oldfd;
> +    uint64_t *oldtable;
> +
> +    oldtable = tcet->table;
> +    oldfd = tcet->fd;
> +    tcet->table = spapr_tce_alloc_table(tcet->liobn,
> +                                        tcet->nb_table,
> +                                        tcet->page_shift,
> +                                        &tcet->fd,
> +                                        false,
> +                                        true); /* force_userspace */
> +
> +    if (replay) {
> +        ret = spapr_tce_do_replay(tcet, oldtable);
> +    }
> +
> +    spapr_tce_free_table(oldtable, oldfd, tcet->nb_table);
> +
> +    return ret;
> +}
> +
>  int spapr_tcet_dma_dt(void *fdt, int node_off, const char *propname,
>                        sPAPRTCETable *tcet)
>  {
> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> index 76c988f..d1fa157 100644
> --- a/hw/ppc/spapr_pci.c
> +++ b/hw/ppc/spapr_pci.c
> @@ -827,6 +827,43 @@ int spapr_phb_dma_reset(sPAPRPHBState *sphb)
>      return 0;
>  }
> 
> +static int spapr_phb_hotplug_dma_sync(sPAPRPHBState *sphb)
> +{
> +    int ret = 0, i;
> +    bool had_vfio = sphb->has_vfio;
> +    sPAPRTCETable *tcet;
> +
> +    spapr_phb_dma_capabilities_update(sphb);

So, in the unplug case, we update caps, but has_vfio = false so we don't do
anything else below.

Does that mean our KVM-accelerated TCE table won't get restored until reboot?
Would it make sense to re-enable it here?

> +
> +    if (!had_vfio && sphb->has_vfio) {
> +        for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
> +            tcet = spapr_tce_find_by_liobn(SPAPR_PCI_LIOBN(sphb->index, i));
> +            if (!tcet || !tcet->enabled) {
> +                continue;
> +            }
> +            if (tcet->fd >= 0) {
> +                /*
> +                 * We got first vfio-pci device on accelerated table.
> +                 * VFIO acceleration is not possible.
> +                 * Reallocate table in userspace and replay mappings.
> +                 */
> +                ret = spapr_tce_realloc_userspace(tcet, true);
> +                trace_spapr_pci_dma_realloc_update(tcet->liobn, ret);
> +            } else {
> +                /* There was no acceleration, so just replay mappings. */
> +                ret = spapr_tce_replay(tcet);
> +                trace_spapr_pci_dma_update(tcet->liobn, ret);
> +            }
> +            if (ret) {
> +                break;
> +            }
> +        }
> +        return ret;
> +    }
> +
> +    return 0;
> +}
> +
>  /* Macros to operate with address in OF binding to PCI */
>  #define b_x(x, p, l)    (((x) & ((1<<(l))-1)) << (p))
>  #define b_n(x)          b_x((x), 31, 1) /* 0 if relocatable */
> @@ -1106,6 +1143,7 @@ static void spapr_phb_add_pci_device(sPAPRDRConnector 
> *drc,
>              error_setg(errp, "Failed to create pci child device tree node");
>              goto out;
>          }
> +        spapr_phb_hotplug_dma_sync(phb);
>      }
> 
>      drck->attach(drc, DEVICE(pdev),
> @@ -1116,6 +1154,12 @@ out:
>      }
>  }
> 
> +static void spapr_phb_remove_sync_dma(struct rcu_head *head)
> +{
> +    sPAPRPHBState *sphb = container_of(head, sPAPRPHBState, rcu);
> +    spapr_phb_hotplug_dma_sync(sphb);
> +}
> +
>  static void spapr_phb_remove_pci_device_cb(DeviceState *dev, void *opaque)
>  {
>      /* some version guests do not wait for completion of a device
> @@ -1130,6 +1174,9 @@ static void spapr_phb_remove_pci_device_cb(DeviceState 
> *dev, void *opaque)
>       */
>      pci_device_reset(PCI_DEVICE(dev));
>      object_unparent(OBJECT(dev));
> +
> +    /* Actual VFIO device release happens from RCU so postpone DMA update */
> +    call_rcu1(&((sPAPRPHBState *)opaque)->rcu, spapr_phb_remove_sync_dma);

Hmm... can't think of any reason this wouldn't work, but would be nice
if there was something a bit more straightforward...

When the device is actually finalized, it does:

  static void vfio_instance_finalize(Object *obj)
  {
      PCIDevice *pci_dev = PCI_DEVICE(obj);
      VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pci_dev);
      VFIOGroup *group = vdev->vbasedev.group;

      ...
  
      vfio_put_device(vdev);
      vfio_put_group(group);
  }

When all the groups are removed from a VFIO container, there's a
call to container->iommu_data.release(container). This is the
event we really care about, not so much the fact that a device
got released.

Right now all it does it remove the memory listener, but maybe it
makes sense to allow an additional callback/opaque to register for
the event. Not sure what the best way to do that is though...

And, kind of a separate topic, but if we could do something
similar for the initial group attach, we could drop *all* the
plug/unplug hooks, and the hooks themselves could drop all
the !had_vfio / has_vfio logic/probing, since that would then
be clear from the context.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]