qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH qemu v10 14/14] spapr_pci/spapr_pci_vfio: Suppor


From: Alexey Kardashevskiy
Subject: Re: [Qemu-devel] [PATCH qemu v10 14/14] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)
Date: Mon, 6 Jul 2015 21:27:42 +1000
User-agent: Mozilla/5.0 (X11; Linux i686 on x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.0.1

On 07/06/2015 09:06 PM, David Gibson wrote:
On Mon, Jul 06, 2015 at 12:11:10PM +1000, Alexey Kardashevskiy wrote:
This adds support for Dynamic DMA Windows (DDW) option defined by
the SPAPR specification which allows to have additional DMA window(s)

This implements DDW for emulated and VFIO devices. As all TCE root regions
are mapped at 0 and 64bit long (and actual tables are child regions),
this replaces memory_region_add_subregion() with _overlap() to make
QEMU memory API happy.

This reserves RTAS token numbers for DDW calls.

This implements helpers to interact with VFIO kernel interface.

This changes the TCE table migration descriptor to support dynamic
tables as from now on, PHB will create as many stub TCE table objects
as PHB can possibly support but not all of them might be initialized at
the time of migration because DDW might or might not be requested by
the guest.

The "ddw" property is enabled by default on a PHB but for compatibility
the pseries-2.3 machine and older disable it.

This implements DDW for VFIO. The host kernel support is required.
This adds a "levels" property to PHB to control the number of levels
in the actual TCE table allocated by the host kernel, 0 is the default
value to tell QEMU to calculate the correct value. Current hardware
supports up to 5 levels.

The existing linux guests try creating one additional huge DMA window
with 64K or 16MB pages and map the entire guest RAM to. If succeeded,
the guest switches to dma_direct_ops and never calls TCE hypercalls
(H_PUT_TCE,...) again. This enables VFIO devices to use the entire RAM
and not waste time on map/unmap later. This adds a "dma64_win_addr"
property which is a bus address for the 64bit window and by default
set to 0x800.0000.0000.0000 as this is what the modern POWER8 hardware
uses and this allows having emulated and VFIO devices on the same bus.

This adds 4 RTAS handlers:
* ibm,query-pe-dma-window
* ibm,create-pe-dma-window
* ibm,remove-pe-dma-window
* ibm,reset-pe-dma-window
These are registered from type_init() callback.

These RTAS handlers are implemented in a separate file to avoid polluting
spapr_iommu.c with PCI.

Signed-off-by: Alexey Kardashevskiy <address@hidden>
---
Changes:
v10:
* added dma64_win_addr property to PHB
* removed redundand check for "!migtable" in spapr_tce_table_post_load()

v9:
* fixed default 64bit window start (from mdroth)
* fixed type cast in dma window update code (from mdroth)
* spapr_phb_dma_update() now can fail and cause hotplug failure if
hardware TCE table cannot be mapped to the same bus address as the emulated one

v7:
* fixed uninitialized variables

v6:
* rework as there is no more special device for VFIO PHB

v5:
* total rework
* enabled for machines >2.3
* fixed migration
* merged rtas handlers here

v4:
* reset handler is back in generalized form

v3:
* removed reset
* windows_num is now 1 or bigger rather than 0-based value and it is only
changed in PHB code, not in RTAS
* added page mask check in create()
* added SPAPR_PCI_DDW_MAX_WINDOWS to track how many windows are already
created

v2:
* tested on hacked emulated E1000
* implemented DDW reset on the PHB reset
* spapr_pci_ddw_remove/spapr_pci_ddw_reset are public for reuse by VFIO
---
  hw/ppc/Makefile.objs        |   3 +
  hw/ppc/spapr.c              |   5 +
  hw/ppc/spapr_iommu.c        |  32 ++++-
  hw/ppc/spapr_pci.c          | 110 ++++++++++++++--
  hw/ppc/spapr_pci_vfio.c     |  88 +++++++++++++
  hw/ppc/spapr_rtas_ddw.c     | 300 ++++++++++++++++++++++++++++++++++++++++++++
  hw/vfio/common.c            |   2 +
  include/hw/pci-host/spapr.h |  21 +++-
  include/hw/ppc/spapr.h      |  17 ++-
  trace-events                |   6 +
  10 files changed, 568 insertions(+), 16 deletions(-)
  create mode 100644 hw/ppc/spapr_rtas_ddw.c

diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
index c8ab06e..0b2ff6d 100644
--- a/hw/ppc/Makefile.objs
+++ b/hw/ppc/Makefile.objs
@@ -7,6 +7,9 @@ obj-$(CONFIG_PSERIES) += spapr_pci.o spapr_rtc.o spapr_drc.o
  ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy)
  obj-y += spapr_pci_vfio.o
  endif
+ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES), yy)
+obj-y += spapr_rtas_ddw.o
+endif
  # PowerPC 4xx boards
  obj-y += ppc405_boards.o ppc4xx_devs.o ppc405_uc.o ppc440_bamboo.o
  obj-y += ppc4xx_pci.o
diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 5ca817c..d50d50b 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -1860,6 +1860,11 @@ static const TypeInfo spapr_machine_info = {
              .driver   = "spapr-pci-host-bridge",\
              .property = "dynamic-reconfiguration",\
              .value    = "off",\
+        },\
+        {\
+            .driver   = TYPE_SPAPR_PCI_HOST_BRIDGE,\
+            .property = "ddw",\
+            .value    = stringify(off),\
          },

  #define SPAPR_COMPAT_2_2 \
diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
index 2d99c3b..b54c3d8 100644
--- a/hw/ppc/spapr_iommu.c
+++ b/hw/ppc/spapr_iommu.c
@@ -136,6 +136,15 @@ static IOMMUTLBEntry 
spapr_tce_translate_iommu(MemoryRegion *iommu, hwaddr addr,
      return ret;
  }

+static void spapr_tce_table_pre_save(void *opaque)
+{
+    sPAPRTCETable *tcet = SPAPR_TCE_TABLE(opaque);
+
+    tcet->migtable = tcet->table;
+}
+
+static void spapr_tce_table_do_enable(sPAPRTCETable *tcet, bool vfio_accel);
+
  static int spapr_tce_table_post_load(void *opaque, int version_id)
  {
      sPAPRTCETable *tcet = SPAPR_TCE_TABLE(opaque);
@@ -144,22 +153,39 @@ static int spapr_tce_table_post_load(void *opaque, int 
version_id)
          spapr_vio_set_bypass(tcet->vdev, tcet->bypass);
      }

+    if (tcet->enabled) {
+        if (!tcet->table) {
+            tcet->enabled = false;
+            /* VFIO does not migrate so pass vfio_accel == false */
+            spapr_tce_table_do_enable(tcet, false);
+        }
+        memcpy(tcet->table, tcet->migtable,
+               tcet->nb_table * sizeof(tcet->table[0]));
+        free(tcet->migtable);
+        tcet->migtable = NULL;
+    }
+
      return 0;
  }

  static const VMStateDescription vmstate_spapr_tce_table = {
      .name = "spapr_iommu",
-    .version_id = 2,
+    .version_id = 3,
      .minimum_version_id = 2,
+    .pre_save = spapr_tce_table_pre_save,
      .post_load = spapr_tce_table_post_load,
      .fields      = (VMStateField []) {
          /* Sanity check */
          VMSTATE_UINT32_EQUAL(liobn, sPAPRTCETable),
-        VMSTATE_UINT32_EQUAL(nb_table, sPAPRTCETable),

          /* IOMMU state */
+        VMSTATE_BOOL_V(enabled, sPAPRTCETable, 3),
+        VMSTATE_UINT64_V(bus_offset, sPAPRTCETable, 3),
+        VMSTATE_UINT32_V(page_shift, sPAPRTCETable, 3),
+        VMSTATE_UINT32(nb_table, sPAPRTCETable),
          VMSTATE_BOOL(bypass, sPAPRTCETable),
-        VMSTATE_VARRAY_UINT32(table, sPAPRTCETable, nb_table, 0, 
vmstate_info_uint64, uint64_t),
+        VMSTATE_VARRAY_UINT32_ALLOC(migtable, sPAPRTCETable, nb_table, 0,
+                                    vmstate_info_uint64, uint64_t),

          VMSTATE_END_OF_LIST()
      },
diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index d1fa157..b7113b5 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -778,6 +778,9 @@ static int spapr_phb_dma_capabilities_update(sPAPRPHBState 
*sphb)

      sphb->dma32_window_start = 0;
      sphb->dma32_window_size = SPAPR_PCI_DMA32_SIZE;
+    sphb->windows_supported = SPAPR_PCI_DMA_MAX_WINDOWS;
+    sphb->page_size_mask = (1ULL << 12) | (1ULL << 16) | (1ULL << 24);
+    sphb->dma64_window_size = pow2ceil(ram_size);

      ret = spapr_phb_vfio_dma_capabilities_update(sphb);
      sphb->has_vfio = (ret == 0);
@@ -785,12 +788,35 @@ static int 
spapr_phb_dma_capabilities_update(sPAPRPHBState *sphb)
      return 0;
  }

-static int spapr_phb_dma_init_window(sPAPRPHBState *sphb,
-                                     uint32_t liobn, uint32_t page_shift,
-                                     uint64_t window_size)
+int spapr_phb_dma_init_window(sPAPRPHBState *sphb,
+                              uint32_t liobn, uint32_t page_shift,
+                              uint64_t window_size)
  {
      uint64_t bus_offset = sphb->dma32_window_start;
      sPAPRTCETable *tcet = spapr_tce_find_by_liobn(liobn);
+    int ret;
+
+    if (SPAPR_PCI_DMA_WINDOW_NUM(liobn) && !sphb->ddw_enabled) {
+        return -1;
+    }
+
+    if (sphb->ddw_enabled) {
+        if (sphb->has_vfio) {
+            ret = spapr_phb_vfio_dma_init_window(sphb,
+                                                 page_shift, window_size,
+                                                 &bus_offset);
+            if (ret) {
+                return ret;
+            }
+        } else if (SPAPR_PCI_DMA_WINDOW_NUM(liobn)) {
+            /*
+             * There is no VFIO so we choose a huge window address.
+             * If VFIO is added later, spapr_phb_dma_update() will fail
+             * and cause hotplug failure.
+             */
+            bus_offset = sphb->dma64_window_start;
+        }
+    }

      spapr_tce_table_enable(tcet, bus_offset, page_shift,
                             window_size >> page_shift,
@@ -802,9 +828,14 @@ static int spapr_phb_dma_init_window(sPAPRPHBState *sphb,
  int spapr_phb_dma_remove_window(sPAPRPHBState *sphb,
                                  sPAPRTCETable *tcet)
  {
+    int ret = 0;
+
+    if (sphb->has_vfio && sphb->ddw_enabled) {
+        ret = spapr_phb_vfio_dma_remove_window(sphb, tcet);
+    }
      spapr_tce_table_disable(tcet);

-    return 0;
+    return ret;
  }

  int spapr_phb_dma_reset(sPAPRPHBState *sphb)
@@ -832,15 +863,46 @@ static int spapr_phb_hotplug_dma_sync(sPAPRPHBState *sphb)
      int ret = 0, i;
      bool had_vfio = sphb->has_vfio;
      sPAPRTCETable *tcet;
+    uint64_t bus_offset = 0;

      spapr_phb_dma_capabilities_update(sphb);

+    /*
+     * PHB got first VFIO device or lost last VFIO device;
+     * If it is the last VFIO device, we do not need windows anymore so
+     * remove them.
+     * If it is the first VFIO device, we have to remove them as
+     * we cannot request a specific window from the host kernel so we
+     * remove all windows and recreate them later if necessary.

Am I right in thinking that there never should be (VFIO enabled)
windows when the first VFIO device is added though?
>
If you're removing the windows when VFIO devices are removed, and any
windows created while !has_vfio shouldn't result in the kernel being
requested from the kernel..?

Yes, I do not need this chunk. I realized that when I posted the patchset :(




--
Alexey



reply via email to

[Prev in Thread] Current Thread [Next in Thread]