[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-ppc] qemu-ppc and NUMA topology
From: |
Nishanth Aravamudan |
Subject: |
Re: [Qemu-ppc] qemu-ppc and NUMA topology |
Date: |
Wed, 28 May 2014 20:39:28 -0700 |
User-agent: |
Mutt/1.5.21 (2010-09-15) |
On 29.05.2014 [11:57:01 +1000], Alexey Kardashevskiy wrote:
> On 05/29/2014 05:57 AM, Nishanth Aravamudan wrote:
> > On 27.05.2014 [14:59:03 -0700], Nishanth Aravamudan wrote:
> >> On 20.05.2014 [12:44:15 +1000], Alexey Kardashevskiy wrote:
> >>> On 05/20/2014 10:06 AM, Nishanth Aravamudan wrote:
> >>>> On 19.05.2014 [15:37:52 -0700], Nishanth Aravamudan wrote:
> >>>>> Hi Alexey,
> >>>>>
> >>>>> I've been looking at hw/ppc/spapr.c::spapr_populate_memory() and ran
> >>>>> into a few questions:
> >>>>>
> >>>>> 1) The values from 1 to nb_numa_nodes are used as indices into the
> >>>>> node_mem array, but that is not populated, necessarily, linearly.
> >>>>> vl.c::add_node() uses the nodeid parameter as the index into node_mem,
> >>>>> if it is specified.
> >>>>>
> >>>>> 2) The node ID is based upon the index into the array, but it seems like
> >>>>> it should actually be based upon the nodeid specified, if any. That is,
> >>>>> we set the value at index 4 (which is statically the reference point in
> >>>>> 'ibm,associativity-reference-points') of 'ibm,associativty' for each
> >>>>> 'ibm,address@hidden' node to the index we are currently at. But as
> >>>>> mentioned in 1) above that index isn't necessarily currently the nodeid
> >>>>> specified on the command-line.
> >>>>>
> >>>>> What this all means, is that if I specify something like:
> >>>>>
> >>>>> -numa node,nodeid=1,cpus=0-7,mem=2048 -numa
> >>>>> node,nodeid=5,cpus=8-15,mem=0 -numa node,nodeid=9,mem=2048
> >>>>>
> >>>>> Linux sees:
> >>>>>
> >>>>> numactl --hardware
> >>>>> available: 3 nodes (0-2)
> >>>>> node 0 cpus: 8 9 10 11 12 13 14 15
> >>>>> node 0 size: 0 MB
> >>>>> node 0 free: 0 MB
> >>>>> node 1 cpus: 0 1 2 3 4 5 6 7
> >>>>> node 1 size: 2024 MB
> >>>>> node 1 free: 1560 MB
> >>>>> node 2 cpus:
> >>>>> node 2 size: 0 MB
> >>>>> node 2 free: 0 MB
> >>>>>
> >>>>> Maybe we don't really care about this, but I just noticed it when trying
> >>>>> to reproduce some really weird topologies from PowerVM.
> >>>>
> >>>> Upon further investigation into node_mem, it seems like this assumption
> >>>> is present throughout the qemu code, e.g, the qemu monitor 'info numa'
> >>>> command. Will just document it for myself as a weird way to make
> >>>> memoryless nodes show up :)
> >>>
> >>> I never looked closely at this NUMA business so I know as much as you do
> >>> :)
> >>> You seem to be right, vl.c seems to get things right (it uses nodeid as an
> >>> index) but spapr.c is broken and we probably should fix it but it does not
> >>> sound very urgent to me...
> >>
> >> Well, and looking at it more, it feels like perhaps that none of the
> >> qemu code is particularly careful about this -- and since you can
> >> explicitly assign 0 memory to a node, you can't simply check for 0 in
> >> node_mem for an unassigned node (and node_mem is an unsigned array).
> >>
> >> I'll look at the behavior on x86 and get back to you.
> >
> > Well, it looks like ppc is no worse off than x86 here -- passing a
> > similar command-line to qemu-system-x86_64, I get the same result in the
> > VM (nodes numbered starting at 0, etc).
> >
> > Perhaps it makes sense to not allow non-sequential NUMA node ordering,
> > since it isn't really supported anyways? I'm not entirely sure I see why
> > it'd be necessary for a guest in any case.
>
>
> How urgent is that thing? I do not have much time for experiments now (but
> I am still planning this) but I do not really think we need to put new
> limit here (even if x86 does the same thing). If phyp can do non sequential
> nodes, then guests most probably support it and all we have to do is cook
> correct device tree...
I can try and work on this, too, as I'm trying to fix up the memoryless
node/numa support upstream for the kernel.
As to the device-tree, I think the basic thing is changing how
ibm,associativity is populated. But we also need to be able to parse
node_mem itself. I guess we could flag all 1s as unset? Otherwise, I see
no way in the current code to distinguish between a node having 0 memory
being assigned and a node that was not assigned at all.
In any case, it's not urgent :)
-Nish