qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC PATCH] Exporting Guest RAM information for NUMA bi


From: Dipankar Sarma
Subject: Re: [Qemu-devel] [RFC PATCH] Exporting Guest RAM information for NUMA binding
Date: Thu, 1 Dec 2011 22:55:20 +0530
User-agent: Mutt/1.5.21 (2010-09-15)

On Wed, Nov 30, 2011 at 06:41:13PM +0100, Andrea Arcangeli wrote:
> On Wed, Nov 30, 2011 at 09:52:37PM +0530, Dipankar Sarma wrote:
> > create the guest topology correctly and optimize for NUMA. This
> > would work for us.
> 
> Even on the case of 1 guest that fits in one node, you're not going to
> max out the full bandwidth of all memory channels with this.
> 
> qemu all can do with ms_mbind/tbind is to create a vtopology that
> matches the hardware topology. It has these limits:
> 
> 1) requires all userland applications to be modified to scan either
>    the physical topology if run on host, or the vtopology if run on
>    guest to get the full benefit.

Not sure why you would need that. qemu can reflect the
topology based on -numa specifications and the corresponding
ms_tbind/mbind in FDT (in the case of Power, I guess ACPI
tables for x86) and guest kernel would detect this virtualized
topology. So there is no need for two types of topologies afaics.
It will all be reflected in /sys/devices/system/node in the guest.

> 
> 2) breaks across live migration if host physical topology changes

That is indeed an issue. Either VM placement software needs to
be really smart to migrate VMs that fit well or, more likely,
we will have to find a way to make guest kernels aware of
topology changes. But the latter has impact on userspace
as well for applications that might have optimized for NUMA.

> 3) 1 small guest on a idle numa system that fits in one numa node will
>    tell not enough information to the host kernel
> 
> 4) if used outside of qemu and one threads allocates more memory than
>    what fits in one node it won't tell enough info to the host kernel.
> 
> About 3): if you've just one guest that fits in one node, each vcpu
> should be spread across all the nodes probably, and behave like
> MADV_INTERLEAVE if the guest CPU scheduler migrate guests processes in
> reverse, the global memory bandwidth will still be used full even if
> they will both access remote memory. I've just seen benchmarks where
> no pinning runs more than _twice_ as fast than pinning with just 1
> guest and only 10 vcpu threads, probably because of that.

I agree. Specifying NUMA topology for guest can result in
sub-optimal performance in some cases, it is a tradeoff.


> In short it's an incremental step that moves some logic to the kernel
> but I don't see it solving all situations optimally and it shares a
> lot of the limits of the hard bindings.

Agreed.

Thanks
Dipankar




reply via email to

[Prev in Thread] Current Thread [Next in Thread]