qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH qemu v10 10/14] spapr_pci: Enable vfio-pci hotpl


From: David Gibson
Subject: Re: [Qemu-devel] [PATCH qemu v10 10/14] spapr_pci: Enable vfio-pci hotplug
Date: Mon, 13 Jul 2015 11:10:12 +1000
User-agent: Mutt/1.5.23 (2014-03-12)

On Sun, Jul 12, 2015 at 09:41:27AM -0500, Michael Roth wrote:
> Quoting Alexey Kardashevskiy (2015-07-11 23:59:45)
> > On 07/11/2015 07:33 AM, Michael Roth wrote:
> > > Quoting Alexey Kardashevskiy (2015-07-05 21:11:06)
> > >> sPAPR IOMMU is managing two copies of an TCE table:
> > >> 1) a guest view of the table - this is what emulated devices use and
> > >> this is where H_GET_TCE reads from;
> > >> 2) a hardware TCE table - only present if there is at least one vfio-pci
> > >> device on a PHB; it is updated via a memory listener on a PHB address
> > >> space which forwards map/unmap requests to vfio-pci IOMMU host driver.
> > >>
> > >> At the moment presence of vfio-pci devices on a bus affect the way
> > >> the guest view table is allocated. If there is no vfio-pci on a PHB
> > >> and the host kernel supports KVM acceleration of H_PUT_TCE, a table
> > >> is allocated in KVM. However, if there is vfio-pci and we do yet not
> > >> support KVM acceleration for these, the table has to be allocated
> > >> by the userspace.
> > >>
> > >> When vfio-pci device is hotplugged and there were no vfio-pci devices
> > >> already, the guest view table could have been allocated by KVM which
> > >> means that H_PUT_TCE is handled by the host kernel and since we
> > >> do not support vfio-pci in KVM, the hardware table will not be updated.
> > >>
> > >> This reallocates the guest view table in QEMU if the first vfio-pci
> > >> device has just been plugged. spapr_tce_realloc_userspace() handles this.
> > >>
> > >> This replays all the mappings to make sure that the tables are in sync.
> > >> This will not have a visible effect though as for a new device
> > >> the guest kernel will allocate-and-map new addresses and therefore
> > >> existing mappings from emulated devices will not be used by vfio-pci
> > >> devices.
> > >>
> > >> This adds calls to spapr_phb_dma_capabilities_update() in PCI hotplug
> > >> hooks.
> > >>
> > >> Signed-off-by: Alexey Kardashevskiy <address@hidden>
> > >> ---
> > >> Changes:
> > >> v10:
> > >> * removed unnecessary  memory_region_del_subregion() and
> > >> memory_region_add_subregion() as
> > >> "vfio: Unregister IOMMU notifiers when container is destroyed" removes
> > >> notifiers in a more correct way
> > >>
> > >> v9:
> > >> * spapr_phb_hotplug_dma_sync() enumerates TCE tables explicitely rather 
> > >> than
> > >> via object_child_foreach()
> > >> * spapr_phb_hotplug_dma_sync() does memory_region_del_subregion() +
> > >> memory_region_add_subregion() as otherwise vfio_listener_region_del() is 
> > >> not
> > >> called and we end up with vfio_iommu_map_notify registered twice 
> > >> (comments welcome!)
> > >> if we do hotplug+hotunplug+hotplug of the same device.
> > >> * moved spapr_phb_hotplug_dma_sync() on unplug event to rcu as before 
> > >> calling
> > >> spapr_phb_hotplug_dma_sync(), we need VFIO to release the container, 
> > >> otherwise
> > >> spapr_phb_dma_capabilities_update() will decide that the PHB still has 
> > >> VFIO device.
> > >> Actual VFIO PCI device release happens from rcu and since we add ours 
> > >> later,
> > >> it gets executed later and we are good.
> > >> ---
> > >>   hw/ppc/spapr_iommu.c        | 51 
> > >> ++++++++++++++++++++++++++++++++++++++++++---
> > >>   hw/ppc/spapr_pci.c          | 47 
> > >> +++++++++++++++++++++++++++++++++++++++++
> > >>   include/hw/pci-host/spapr.h |  1 +
> > >>   include/hw/ppc/spapr.h      |  2 ++
> > >>   trace-events                |  2 ++
> > >>   5 files changed, 100 insertions(+), 3 deletions(-)
> > >>
> > >> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
> > >> index 45c00d8..2d99c3b 100644
> > >> --- a/hw/ppc/spapr_iommu.c
> > >> +++ b/hw/ppc/spapr_iommu.c
> > >> @@ -78,12 +78,13 @@ static uint64_t *spapr_tce_alloc_table(uint32_t 
> > >> liobn,
> > >>                                          uint32_t nb_table,
> > >>                                          uint32_t page_shift,
> > >>                                          int *fd,
> > >> -                                       bool vfio_accel)
> > >> +                                       bool vfio_accel,
> > >> +                                       bool force_userspace)
> > >>   {
> > >>       uint64_t *table = NULL;
> > >>       uint64_t window_size = (uint64_t)nb_table << page_shift;
> > >>
> > >> -    if (kvm_enabled() && !(window_size >> 32)) {
> > >> +    if (kvm_enabled() && !force_userspace && !(window_size >> 32)) {
> > >>           table = kvmppc_create_spapr_tce(liobn, window_size, fd, 
> > >> vfio_accel);
> > >>       }
> > >>
> > >> @@ -222,7 +223,8 @@ static void spapr_tce_table_do_enable(sPAPRTCETable 
> > >> *tcet, bool vfio_accel)
> > >>                                           tcet->nb_table,
> > >>                                           tcet->page_shift,
> > >>                                           &tcet->fd,
> > >> -                                        vfio_accel);
> > >> +                                        vfio_accel,
> > >> +                                        false);
> > >>
> > >>       memory_region_set_size(&tcet->iommu,
> > >>                              (uint64_t)tcet->nb_table << 
> > >> tcet->page_shift);
> > >> @@ -495,6 +497,49 @@ int spapr_dma_dt(void *fdt, int node_off, const 
> > >> char *propname,
> > >>       return 0;
> > >>   }
> > >>
> > >> +static int spapr_tce_do_replay(sPAPRTCETable *tcet, uint64_t *table)
> > >> +{
> > >> +    target_ulong ioba = tcet->bus_offset, pgsz = (1ULL << 
> > >> tcet->page_shift);
> > >> +    long i, ret = 0;
> > >> +
> > >> +    for (i = 0; i < tcet->nb_table; ++i, ioba += pgsz) {
> > >> +        ret = put_tce_emu(tcet, ioba, table[i]);
> > >> +        if (ret) {
> > >> +            break;
> > >> +        }
> > >> +    }
> > >> +
> > >> +    return ret;
> > >> +}
> > >> +
> > >> +int spapr_tce_replay(sPAPRTCETable *tcet)
> > >> +{
> > >> +    return spapr_tce_do_replay(tcet, tcet->table);
> > >> +}
> > >> +
> > >> +int spapr_tce_realloc_userspace(sPAPRTCETable *tcet, bool replay)
> > >> +{
> > >> +    int ret = 0, oldfd;
> > >> +    uint64_t *oldtable;
> > >> +
> > >> +    oldtable = tcet->table;
> > >> +    oldfd = tcet->fd;
> > >> +    tcet->table = spapr_tce_alloc_table(tcet->liobn,
> > >> +                                        tcet->nb_table,
> > >> +                                        tcet->page_shift,
> > >> +                                        &tcet->fd,
> > >> +                                        false,
> > >> +                                        true); /* force_userspace */
> > >> +
> > >> +    if (replay) {
> > >> +        ret = spapr_tce_do_replay(tcet, oldtable);
> > >> +    }
> > >> +
> > >> +    spapr_tce_free_table(oldtable, oldfd, tcet->nb_table);
> > >> +
> > >> +    return ret;
> > >> +}
> > >> +
> > >>   int spapr_tcet_dma_dt(void *fdt, int node_off, const char *propname,
> > >>                         sPAPRTCETable *tcet)
> > >>   {
> > >> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> > >> index 76c988f..d1fa157 100644
> > >> --- a/hw/ppc/spapr_pci.c
> > >> +++ b/hw/ppc/spapr_pci.c
> > >> @@ -827,6 +827,43 @@ int spapr_phb_dma_reset(sPAPRPHBState *sphb)
> > >>       return 0;
> > >>   }
> > >>
> > >> +static int spapr_phb_hotplug_dma_sync(sPAPRPHBState *sphb)
> > >> +{
> > >> +    int ret = 0, i;
> > >> +    bool had_vfio = sphb->has_vfio;
> > >> +    sPAPRTCETable *tcet;
> > >> +
> > >> +    spapr_phb_dma_capabilities_update(sphb);
> > >
> > > So, in the unplug case, we update caps, but has_vfio = false so we don't 
> > > do
> > > anything else below.
> > 
> > Yes.
> > 
> > 
> > > Does that mean our KVM-accelerated TCE table won't get restored until 
> > > reboot?
> > > Would it make sense to re-enable it here?
> > 
> > No, it shold be reenabled as DMA config is completely reset during the 
> > machine reset by "[PATCH qemu v10 08/14] spapr_pci: Do complete reset of 
> > DMA config when resetting PHB"
> 
> We don't get a PHB-level reset for PCI hotplug though, so it wouldn't
> get re-enabled till guest system reset. I'm not sure how big a deal that
> is performance-wise, but it seems a little unexpected.

Right.  I'm not particularly fussed by this though.  If you really
want to optimize performance, you're still better off segregating vfio
devices and emulated devices onto different PHBs.

It seems to me if you plug a vfio device onto a bus once, you might
well do so again in the future, which would mean ejecting all the
kernel accelerated emulated devices again.
> > >> +
> > >> +    if (!had_vfio && sphb->has_vfio) {
> > >> +        for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
> > >> +            tcet = spapr_tce_find_by_liobn(SPAPR_PCI_LIOBN(sphb->index, 
> > >> i));
> > >> +            if (!tcet || !tcet->enabled) {
> > >> +                continue;
> > >> +            }
> > >> +            if (tcet->fd >= 0) {
> > >> +                /*
> > >> +                 * We got first vfio-pci device on accelerated table.
> > >> +                 * VFIO acceleration is not possible.
> > >> +                 * Reallocate table in userspace and replay mappings.
> > >> +                 */
> > >> +                ret = spapr_tce_realloc_userspace(tcet, true);
> > >> +                trace_spapr_pci_dma_realloc_update(tcet->liobn, ret);
> > >> +            } else {
> > >> +                /* There was no acceleration, so just replay mappings. 
> > >> */
> > >> +                ret = spapr_tce_replay(tcet);
> > >> +                trace_spapr_pci_dma_update(tcet->liobn, ret);
> > >> +            }
> > >> +            if (ret) {
> > >> +                break;
> > >> +            }
> > >> +        }
> > >> +        return ret;
> > >> +    }
> > >> +
> > >> +    return 0;
> > >> +}
> > >> +
> > >>   /* Macros to operate with address in OF binding to PCI */
> > >>   #define b_x(x, p, l)    (((x) & ((1<<(l))-1)) << (p))
> > >>   #define b_n(x)          b_x((x), 31, 1) /* 0 if relocatable */
> > >> @@ -1106,6 +1143,7 @@ static void 
> > >> spapr_phb_add_pci_device(sPAPRDRConnector *drc,
> > >>               error_setg(errp, "Failed to create pci child device tree 
> > >> node");
> > >>               goto out;
> > >>           }
> > >> +        spapr_phb_hotplug_dma_sync(phb);
> > >>       }
> > >>
> > >>       drck->attach(drc, DEVICE(pdev),
> > >> @@ -1116,6 +1154,12 @@ out:
> > >>       }
> > >>   }
> > >>
> > >> +static void spapr_phb_remove_sync_dma(struct rcu_head *head)
> > >> +{
> > >> +    sPAPRPHBState *sphb = container_of(head, sPAPRPHBState, rcu);
> > >> +    spapr_phb_hotplug_dma_sync(sphb);
> > >> +}
> > >> +
> > >>   static void spapr_phb_remove_pci_device_cb(DeviceState *dev, void 
> > >> *opaque)
> > >>   {
> > >>       /* some version guests do not wait for completion of a device
> > >> @@ -1130,6 +1174,9 @@ static void 
> > >> spapr_phb_remove_pci_device_cb(DeviceState *dev, void *opaque)
> > >>        */
> > >>       pci_device_reset(PCI_DEVICE(dev));
> > >>       object_unparent(OBJECT(dev));
> > >> +
> > >> +    /* Actual VFIO device release happens from RCU so postpone DMA 
> > >> update */
> > >> +    call_rcu1(&((sPAPRPHBState *)opaque)->rcu, 
> > >> spapr_phb_remove_sync_dma);
> > >
> > > Hmm... can't think of any reason this wouldn't work, but would be nice
> > > if there was something a bit more straightforward...
> > >
> > > When the device is actually finalized, it does:
> > 
> > The problem is with "when". I looked at gdb, this vfio_instance_finalize() 
> > is called from an RCU handler because the last reference is dropped because 
> > of some memory region was removed and this was postponed to RCU.
> > 
> > If object_unparent(OBJECT(dev)) did call vfio_put_group() on the same 
> > stack, I would not need this call_rcu1.
> 
> Right, object_unparent() has no guaruntee of immediately finalizing it,
> but you *do* have the guaruntee that
> vfio_instance_finalize()->vfio_put_group() will only be called once the
> device is actually finalized, regardless of whether or not it's kicked
> off by the RCU thread. So it seems more straightforward to hook into
> that rather than needing to employ internal knowledge of
> object_unparent().
> 
> > 
> > >    static void vfio_instance_finalize(Object *obj)
> > >    {
> > >        PCIDevice *pci_dev = PCI_DEVICE(obj);
> > >        VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pci_dev);
> > >        VFIOGroup *group = vdev->vbasedev.group;
> > >
> > >        ...
> > >
> > >        vfio_put_device(vdev);
> > >        vfio_put_group(group);
> > >    }
> > >
> > > When all the groups are removed from a VFIO container, there's a
> > > call to container->iommu_data.release(container). This is the
> > > event we really care about, not so much the fact that a device
> > > got released.
> > >
> > > Right now all it does it remove the memory listener, but maybe it
> > > makes sense to allow an additional callback/opaque to register for
> > > the event. Not sure what the best way to do that is though...
> > 
> > In this context I rather care about container's fd being closed so 
> > VFIO_IOMMU_SPAPR_TCE_GET_INFO would fail in my dma-sync and this way I know 
> > that there is no more VFIO devices.
> 
> VFIO container getting closed also corresponds to the last group being
> removed though. Even if it didn't, I think VFIO_IOMMU_SPAPR_TCE_GET_INFO
> would fail unless at least one iommu group was attached to the
> container? So knowing when the first/last group is removed seems to
> be the real main event.

So, I think what your basically suggesting here is that the
"vfio-active" state of the PHB should be tied to the presence of an
active container associated with the PHB, rather than whether there
are actually vfio devices present on the bus.

The container approach does seem more conceptually correct to me, as
well as possibly simplifying the implementation a bit.

> > > And, kind of a separate topic, but if we could do something
> > > similar for the initial group attach,  we could drop *all* the
> > > plug/unplug hooks, and the hooks themselves could drop all
> > > the !had_vfio / has_vfio logic/probing, since that would then
> > > be clear from the context.
> > 
> > Drop all hooks? HotplugHandlerClass hooks? Can you do that? :) Are not they 
> > what HMP calls on "device_add"?
> 
> I mean all the places we call into code that ends up doing:
>   spapr_phb_dma_capabilities_update(sphb);
>   /* do something special if has_vfio changed */
> 
> We currently have one in PCI plug, PCI unplug, and PHB reset. If we
> hooked into vfio_put_group(), we could drop each of those hooks. PHB
> reset would still have the special case of restoring default 32-bit
> window config, but it wouldn't need to care about has_vfio status
> anymore, all that code could be handled by
> vfio_put_group/vfio_get_group callbacks.
> 
> I wouldn't hold up the series for it, but I think it would greatly
> simplify tracking has_vfio changes.
> 
> But I do think spapr_phb_hotplug_dma_sync() should have some logic
> squashed in for re-enabling TCE acceleration on
> (had_vfio && !has_vfio).
> 
> > 
> > 
> 

-- 
David Gibson                    | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
                                | _way_ _around_!
http://www.ozlabs.org/~dgibson

Attachment: pgp0nZMxILMKp.pgp
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]