Re: [Qemu-devel] [RFC PATCH 09/10] spapr_pci

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC PATCH 09/10] spapr_pci_vfio: Enable DDW

From:	Alexander Graf
Subject:	Re: [Qemu-devel] [RFC PATCH 09/10] spapr_pci_vfio: Enable DDW
Date:	Fri, 15 Aug 2014 09:37:27 +0200
User-agent:	Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:31.0) Gecko/20100101 Thunderbird/31.0


On 15.08.14 05:16, Alexey Kardashevskiy wrote:

On 08/14/2014 11:38 PM, Alexander Graf wrote:

On 13.08.14 02:18, Alexey Kardashevskiy wrote:

On 08/13/2014 01:28 AM, Alexander Graf wrote:

On 12.08.14 17:10, Alexey Kardashevskiy wrote:

On 08/12/2014 07:37 PM, Alexander Graf wrote:

On 12.08.14 02:03, Alexey Kardashevskiy wrote:

On 08/12/2014 03:30 AM, Alexander Graf wrote:

On 11.08.14 17:01, Alexey Kardashevskiy wrote:

On 08/11/2014 10:02 PM, Alexander Graf wrote:

On 31.07.14 11:34, Alexey Kardashevskiy wrote:

This implements DDW for VFIO. Host kernel support is required for
this.

Signed-off-by: Alexey Kardashevskiy <address@hidden>
---
       hw/ppc/spapr_pci_vfio.c | 75
+++++++++++++++++++++++++++++++++++++++++++++++++
       1 file changed, 75 insertions(+)

diff --git a/hw/ppc/spapr_pci_vfio.c b/hw/ppc/spapr_pci_vfio.c
index d3bddf2..dc443e2 100644
--- a/hw/ppc/spapr_pci_vfio.c
+++ b/hw/ppc/spapr_pci_vfio.c
@@ -69,6 +69,77 @@ static void
spapr_phb_vfio_finish_realize(sPAPRPHBState *sphb, Error **errp)
           /* Register default 32bit DMA window */
           memory_region_add_subregion(&sphb->iommu_root,
tcet->bus_offset,
                                       spapr_tce_get_iommu(tcet));
+
+    sphb->ddw_supported = !!(info.flags &
VFIO_IOMMU_SPAPR_TCE_FLAG_DDW);
+}
+
+static int spapr_pci_vfio_ddw_query(sPAPRPHBState *sphb,
+                                    uint32_t *windows_available,
+                                    uint32_t *page_size_mask)
+{
+    sPAPRPHBVFIOState *svphb = SPAPR_PCI_VFIO_HOST_BRIDGE(sphb);
+    struct vfio_iommu_spapr_tce_query query = { .argsz =
sizeof(query) };
+    int ret;
+
+    ret = vfio_container_ioctl(&sphb->iommu_as,
svphb->iommugroupid,
+                               VFIO_IOMMU_SPAPR_TCE_QUERY, &query);
+    if (ret) {
+        return ret;
+    }
+
+    *windows_available = query.windows_available;
+    *page_size_mask = query.page_size_mask;
+
+    return ret;
+}
+
+static int spapr_pci_vfio_ddw_create(sPAPRPHBState *sphb, uint32_t
page_shift,
+                                     uint32_t window_shift,
uint32_t
liobn,
+                                     sPAPRTCETable **ptcet)
+{
+    sPAPRPHBVFIOState *svphb = SPAPR_PCI_VFIO_HOST_BRIDGE(sphb);
+    struct vfio_iommu_spapr_tce_create create = {
+        .argsz = sizeof(create),
+        .page_shift = page_shift,
+        .window_shift = window_shift,
+        .start_addr = 0
+    };
+    int ret;
+
+    ret = vfio_container_ioctl(&sphb->iommu_as,
svphb->iommugroupid,
+                               VFIO_IOMMU_SPAPR_TCE_CREATE,
&create);
+    if (ret) {
+        return ret;
+    }
+
+    *ptcet = spapr_tce_new_table(DEVICE(sphb), liobn,
create.start_addr,
+                                 page_shift, 1 << (window_shift -
page_shift),

I spot a 1 without ULL again - this time it might work out ok, but
please
just always use ULL when you pass around addresses.

My bad. I keep forgetting this, I'll adjust my own checkpatch.py :)

Please walk me though the abstraction levels on what each page size
honoration means. If I use THP, what page size granularity can I use
for
TCE entries?

[RFC PATCH 06/10] spapr_rtas: Add Dynamic DMA windows (DDW) RTAS calls
support

+        const struct { int shift; uint32_t mask; } masks[] = {
+            { 12, DDW_PGSIZE_4K },
+            { 16, DDW_PGSIZE_64K },
+            { 24, DDW_PGSIZE_16M },
+            { 25, DDW_PGSIZE_32M },
+            { 26, DDW_PGSIZE_64M },
+            { 27, DDW_PGSIZE_128M },
+            { 28, DDW_PGSIZE_256M },
+            { 34, DDW_PGSIZE_16G },
+        };


Supported page sizes are returned by the host kernel via "query". For
16MB
pages, page shift will return
DDW_PGSIZE_4K|DDW_PGSIZE_64K|DDW_PGSIZE_16M.
Or I did not understand the question...

Why do we care about the sizes? Anything bigger than what we support
should
always work, no? What happens if the guest creates a 16MB map but my
pages
are 4kb mapped? Wouldn't the same logic be able to deal with 16G pages?

It is DMA memory, if I split "virtual" 16M page to a bunch of real 4K
pages, I have to make sure these 16M are continuous - there will be one
TCE
entry for it and no more translations besides IOMMU. What do I miss now?

Who does the shadow translation where? Does it exist at all?

IOMMU? I am not sure I am following you... This IOMMU will look as direct
DMA for the guest but the real IOMMU table is sparse and it is populated
via a bunch of H_PUT_TCE calls as the default small window.

There is a direct mapping in the host called "bypass window" but it is not
used here as sPAPR does not define that for paravirtualization.

Ok, imagine I have 16MB of guest physical memory that is in reality backed
by 256 64k pages on the host. The guest wants to create a 16M TCE entry for
this (from its point of view contiguous) chunk of memory.

Do we allow this?

No, we do not. We tell the guest what it can use.

Or do we force the guest to create 64k TCE entries?

16MB TCE pages are only allowed if qemu is running with hugepages.

That's unfortunate ;)


This is my limitation, not SPAPR spec or anything like that.

but as long as we have to pin TCEd memory anyway, I
guess it doesn't hurt as badly.

Yep.

If we allow it, why would we ever put any restriction at the upper end of
TCE entry sizes? If we already implement enough logic to map things lazily
around, we could as well have the guest create a 256M TCE entry and just
split it on the host view to 64k TCE entries.

Oh, thiiiiiis is what you meant...

Well, we could, just for now current linux guests support 4K/64K/16M only
and they choose depending on what hypervisor supports - look at
enable_ddw() in the guest. What you suggest seems to be an unnecessary code
duplication for 16MB pages case. For bigger page sizes - for example, for
64GB guest, a TCE table with 16MB TCEs will be 32KB which is already
awesome enough, no?

In "normal" invironments guests won't be backed by 16M pages, but by 64k
pages with the occasional THP huge page merge that you can't rely on.

That's why I figured it'd be smart to support 16MB TCEs even when the
underlying memory is only backed by 64k pages.


Real TCE table will be using 64K pages anyway - whether we fake 16MB pages
to the guest and split them silently in QEMU/host OR guest itself uses 64K
pages, in any case we end up with a bigger 64K TCE table so I do not see
what we win if we fake 16MB pages for the guest.

Well, what if you for example want to live migrate from a hugepagebacked guest to a non-hugepage backed guest? In that case you have tomigrate the TCEs as well, but the new location can't use the 16MB ones,right?

We also win in that we have fewer hypercalls to add TCEs, but I supposethat part is almost negligible.



Alex

My point it the code is already in the guest, it can do 64K and 16MB pages,
if it was not there, yes, everything you suggest would make sense to
implement :)

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [Qemu-devel] [RFC PATCH 09/10] spapr_pci_vfio: Enable DDW, (continued)
- Re: [Qemu-devel] [RFC PATCH 09/10] spapr_pci_vfio: Enable DDW, David Gibson, 2014/08/11

Prev by Date: [Qemu-devel] [Question] Why doesn't PCIe hotplug work for Q35 machine?
Next by Date: Re: [Qemu-devel] [PATCH 1/2] qapi.py: avoid Python 2.5+ any() function
Previous by thread: Re: [Qemu-devel] [RFC PATCH 09/10] spapr_pci_vfio: Enable DDW
Next by thread: Re: [Qemu-devel] [RFC PATCH 09/10] spapr_pci_vfio: Enable DDW
Index(es):
- Date
- Thread