[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH 1/1] docs: adding NUMA documentation for pseries
From: |
Greg Kurz |
Subject: |
Re: [PATCH 1/1] docs: adding NUMA documentation for pseries |
Date: |
Mon, 3 Aug 2020 14:53:11 +0200 |
On Mon, 3 Aug 2020 09:14:22 -0300
Daniel Henrique Barboza <danielhb413@gmail.com> wrote:
>
>
> On 8/3/20 8:49 AM, Greg Kurz wrote:
> > On Thu, 30 Jul 2020 10:58:52 +1000
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >
> >> On Wed, Jul 29, 2020 at 09:57:56AM -0300, Daniel Henrique Barboza wrote:
> >>> This patch adds a new documentation file, ppc-spapr-numa.rst,
> >>> informing what developers and user can expect of the NUMA distance
> >>> support for the pseries machine, up to QEMU 5.1.
> >>>
> >>> In the (hopefully soon) future, when we rework the NUMA mechanics
> >>> of the pseries machine to at least attempt to contemplate user
> >>> choice, this doc will be extended to inform about the new
> >>> support.
> >>>
> >>> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
> >>
> >> Applied to ppc-for-5.2, thanks.
> >>
> >
> > I'm now hitting this:
> >
> > Warning, treated as error:
> > docs/specs/ppc-spapr-numa.rst:document isn't included in any toctree
>
> How are you hitting this? I can't reproduce this error. Tried running
> ./autogen.sh and 'make' and didn't see it.
>
I do out-of-tree builds and my configure line is:
configure \
--enable-docs \
--disable-strip \
--disable-xen \
--enable-trace-backend=log \
--enable-kvm \
--enable-linux-aio \
--enable-vhost-net \
--enable-virtfs \
--enable-seccomp \
--target-list='ppc64-softmmu'
> Checking what other docs are doing I figure that this might be missing:
>
> $ git diff
> diff --git a/docs/specs/index.rst b/docs/specs/index.rst
> index 426632a475..1b0eb979d5 100644
> --- a/docs/specs/index.rst
> +++ b/docs/specs/index.rst
> @@ -12,6 +12,7 @@ Contents:
>
> ppc-xive
> ppc-spapr-xive
> + ppc-spapr-numa
> acpi_hw_reduced_hotplug
> tpm
> acpi_hest_ghes
>
>
>
> Can you please check if this solves the error?
>
Yes it does ! Thanks !
>
>
> Thanks,
>
>
> Daniel
>
> >
> >>> ---
> >>> docs/specs/ppc-spapr-numa.rst | 191 ++++++++++++++++++++++++++++++++++
> >>> 1 file changed, 191 insertions(+)
> >>> create mode 100644 docs/specs/ppc-spapr-numa.rst
> >>>
> >>> diff --git a/docs/specs/ppc-spapr-numa.rst b/docs/specs/ppc-spapr-numa.rst
> >>> new file mode 100644
> >>> index 0000000000..e762038022
> >>> --- /dev/null
> >>> +++ b/docs/specs/ppc-spapr-numa.rst
> >>> @@ -0,0 +1,191 @@
> >>> +
> >>> +NUMA mechanics for sPAPR (pseries machines)
> >>> +============================================
> >>> +
> >>> +NUMA in sPAPR works different than the System Locality Distance
> >>> +Information Table (SLIT) in ACPI. The logic is explained in the LOPAPR
> >>> +1.1 chapter 15, "Non Uniform Memory Access (NUMA) Option". This
> >>> +document aims to complement this specification, providing details
> >>> +of the elements that impacts how QEMU views NUMA in pseries.
> >>> +
> >>> +Associativity and ibm,associativity property
> >>> +--------------------------------------------
> >>> +
> >>> +Associativity is defined as a group of platform resources that has
> >>> +similar mean performance (or in our context here, distance) relative to
> >>> +everyone else outside of the group.
> >>> +
> >>> +The format of the ibm,associativity property varies with the value of
> >>> +bit 0 of byte 5 of the ibm,architecture-vec-5 property. The format with
> >>> +bit 0 equal to zero is deprecated. The current format, with the bit 0
> >>> +with the value of one, makes ibm,associativity property represent the
> >>> +physical hierarchy of the platform, as one or more lists that starts
> >>> +with the highest level grouping up to the smallest. Considering the
> >>> +following topology:
> >>> +
> >>> +::
> >>> +
> >>> + Mem M1 ---- Proc P1 |
> >>> + ----------------- | Socket S1 ---|
> >>> + chip C1 | |
> >>> + | HW module 1 (MOD1)
> >>> + Mem M2 ---- Proc P2 | |
> >>> + ----------------- | Socket S2 ---|
> >>> + chip C2 |
> >>> +
> >>> +The ibm,associativity property for the processors would be:
> >>> +
> >>> +* P1: {MOD1, S1, C1, P1}
> >>> +* P2: {MOD1, S2, C2, P2}
> >>> +
> >>> +Each allocable resource has an ibm,associativity property. The LOPAPR
> >>> +specification allows multiple lists to be present in this property,
> >>> +considering that the same resource can have multiple connections to the
> >>> +platform.
> >>> +
> >>> +Relative Performance Distance and ibm,associativity-reference-points
> >>> +--------------------------------------------------------------------
> >>> +
> >>> +The ibm,associativity-reference-points property is an array that is used
> >>> +to define the relevant performance/distance related boundaries, defining
> >>> +the NUMA levels for the platform.
> >>> +
> >>> +The definition of its elements also varies with the value of bit 0 of
> >>> byte 5
> >>> +of the ibm,architecture-vec-5 property. The format with bit 0 equal to
> >>> zero
> >>> +is also deprecated. With the current format, each integer of the
> >>> +ibm,associativity-reference-points represents an 1 based ordinal index
> >>> (i.e.
> >>> +the first element is 1) of the ibm,associativity array. The first
> >>> +boundary is the most significant to application performance, followed by
> >>> +less significant boundaries. Allocated resources that belongs to the
> >>> +same performance boundaries are expected to have relative NUMA distance
> >>> +that matches the relevancy of the boundary itself. Resources that belongs
> >>> +to the same first boundary will have the shortest distance from each
> >>> +other. Subsequent boundaries represents greater distances and degraded
> >>> +performance.
> >>> +
> >>> +Using the previous example, the following setting reference points
> >>> defines
> >>> +three NUMA levels:
> >>> +
> >>> +* ibm,associativity-reference-points = {0x3, 0x2, 0x1}
> >>> +
> >>> +The first NUMA level (0x3) is interpreted as the third element of each
> >>> +ibm,associativity array, the second level is the second element and
> >>> +the third level is the first element. Let's also consider that elements
> >>> +belonging to the first NUMA level have distance equal to 10 from each
> >>> +other, and each NUMA level doubles the distance from the previous. This
> >>> +means that the second would be 20 and the third level 40. For the P1 and
> >>> +P2 processors, we would have the following NUMA levels:
> >>> +
> >>> +::
> >>> +
> >>> + * ibm,associativity-reference-points = {0x3, 0x2, 0x1}
> >>> +
> >>> + * P1: associativity{MOD1, S1, C1, P1}
> >>> +
> >>> + First NUMA level (0x3) => associativity[2] = C1
> >>> + Second NUMA level (0x2) => associativity[1] = S1
> >>> + Third NUMA level (0x1) => associativity[0] = MOD1
> >>> +
> >>> + * P2: associativity{MOD1, S2, C2, P2}
> >>> +
> >>> + First NUMA level (0x3) => associativity[2] = C2
> >>> + Second NUMA level (0x2) => associativity[1] = S2
> >>> + Third NUMA level (0x1) => associativity[0] = MOD1
> >>> +
> >>> + P1 and P2 have the same third NUMA level, MOD1: Distance between them
> >>> = 40
> >>> +
> >>> +Changing the ibm,associativity-reference-points array changes the
> >>> performance
> >>> +distance attributes for the same associativity arrays, as the following
> >>> +example illustrates:
> >>> +
> >>> +::
> >>> +
> >>> + * ibm,associativity-reference-points = {0x2}
> >>> +
> >>> + * P1: associativity{MOD1, S1, C1, P1}
> >>> +
> >>> + First NUMA level (0x2) => associativity[1] = S1
> >>> +
> >>> + * P2: associativity{MOD1, S2, C2, P2}
> >>> +
> >>> + First NUMA level (0x2) => associativity[1] = S2
> >>> +
> >>> + P1 and P2 does not have a common performance boundary. Since this is a
> >>> one level
> >>> + NUMA configuration, distance between them is one boundary above the
> >>> first
> >>> + level, 20.
> >>> +
> >>> +
> >>> +In a hypothetical platform where all resources inside the same hardware
> >>> module
> >>> +is considered to be on the same performance boundary:
> >>> +
> >>> +::
> >>> +
> >>> + * ibm,associativity-reference-points = {0x1}
> >>> +
> >>> + * P1: associativity{MOD1, S1, C1, P1}
> >>> +
> >>> + First NUMA level (0x1) => associativity[0] = MOD0
> >>> +
> >>> + * P2: associativity{MOD1, S2, C2, P2}
> >>> +
> >>> + First NUMA level (0x1) => associativity[0] = MOD0
> >>> +
> >>> + P1 and P2 belongs to the same first order boundary. The distance
> >>> between then
> >>> + is 10.
> >>> +
> >>> +
> >>> +How the pseries Linux guest calculates NUMA distances
> >>> +=====================================================
> >>> +
> >>> +Another key difference between ACPI SLIT and the LOPAPR regarding NUMA is
> >>> +how the distances are expressed. The SLIT table provides the NUMA
> >>> distance
> >>> +value between the relevant resources. LOPAPR does not provide a standard
> >>> +way to calculate it. We have the ibm,associativity for each resource,
> >>> which
> >>> +provides a common-performance hierarchy, and the
> >>> ibm,associativity-reference-points
> >>> +array that tells which level of associativity is considered to be
> >>> relevant
> >>> +or not.
> >>> +
> >>> +The result is that each OS is free to implement and to interpret the
> >>> distance
> >>> +as it sees fit. For the pseries Linux guest, each level of NUMA
> >>> duplicates
> >>> +the distance of the previous level, and the maximum amount of levels is
> >>> +limited to MAX_DISTANCE_REF_POINTS = 4 (from arch/powerpc/mm/numa.c in
> >>> the
> >>> +kernel tree). This results in the following distances:
> >>> +
> >>> +* both resources in the first NUMA level: 10
> >>> +* resources one NUMA level apart: 20
> >>> +* resources two NUMA levels apart: 40
> >>> +* resources three NUMA levels apart: 80
> >>> +* resources four NUMA levels apart: 160
> >>> +
> >>> +
> >>> +Consequences for QEMU NUMA tuning
> >>> +---------------------------------
> >>> +
> >>> +The way the pseries Linux guest calculates NUMA distances has a direct
> >>> effect
> >>> +on what QEMU users can expect when doing NUMA tuning. As of QEMU 5.1,
> >>> this is
> >>> +the default ibm,associativity-reference-points being used in the pseries
> >>> +machine:
> >>> +
> >>> +ibm,associativity-reference-points = {0x4, 0x4, 0x2}
> >>> +
> >>> +The first and second level are equal, 0x4, and a third one was added in
> >>> +commit a6030d7e0b35 exclusively for NVLink GPUs support. This means that
> >>> +regardless of how the ibm,associativity properties are being created in
> >>> +the device tree, the pseries Linux guest will only recognize three
> >>> scenarios
> >>> +as far as NUMA distance goes:
> >>> +
> >>> +* if the resources belongs to the same first NUMA level = 10
> >>> +* second level is skipped since it's equal to the first
> >>> +* all resources that aren't a NVLink GPU, it is guaranteed that they
> >>> will belong
> >>> + to the same third NUMA level, having distance = 40
> >>> +* for NVLink GPUs, distance = 80 from everything else
> >>> +
> >>> +In short, we can summarize the NUMA distances seem in pseries Linux
> >>> guests, using
> >>> +QEMU up to 5.1, as follows:
> >>> +
> >>> +* local distance, i.e. the distance of the resource to its own NUMA
> >>> node: 10
> >>> +* if it's a NVLink GPU device, distance: 80
> >>> +* every other resource, distance: 40
> >>> +
> >>> +This also means that user input in QEMU command line does not change the
> >>> +NUMA distancing inside the guest for the pseries machine.
> >>
> >