qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH QEMU] transparent hugepage support


From: Andrea Arcangeli
Subject: Re: [Qemu-devel] [PATCH QEMU] transparent hugepage support
Date: Fri, 12 Mar 2010 15:52:28 +0100

On Fri, Mar 12, 2010 at 11:36:33AM +0000, Paul Brook wrote:
> > On Thu, Mar 11, 2010 at 05:55:10PM +0000, Paul Brook wrote:
> > > sysconf(_SC_HUGEPAGESIZE); would seem to be the obvious answer.
> > 
> > There's not just one hugepage size 
> 
> We only have one madvise flag...

Transparent hugepage support means _really_ transparent, it's not up
to userland to know what hugepage size the kernel uses. There is no
way for userland to notice anything but that it runs faster.

The madvise flag is one and it only exists for 1 reason: embedded
systems that may want to turn off the transparency feature to avoid
the risk of using a little more memory during anonymous memory
copy-on-writes after fork or similar. But for things like kvm there is
absolutely zero memory waste in enabling hugepages so even embedded
definitely wants to enable transparent hugepage and run faster on
their underpowered CPU.

If it wasn't for embedded the madvise flag would need to be dropped as
it would be pointless. It's not about the page size at all.

> > and that thing doesn't exist yet
> > plus it'd require mangling over glibc too. If it existed I could use
> > it but I think this is better:
>  
> > $ cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size
> > 2097152
> 
> Is "pmd" x86 specific?

It's linux specific, this is common code, nothing x86 specific. In
fact on x86 it's not called pmd but Page Directory. I've actually no
idea what pmd stands for but it's definitely not x86 specific and it's
just about the linux common code common to all archs. The reason this
is called hpage_pmd_size is because it's a #define HPAGE_PMD_SIZE in
the kernel code. So this entirely match the kernel internals
_common_code_.

> > > If the allocation size is not a multiple of the preferred alignment, then
> > > you probably loose either way, and we shouldn't be requesting increased
> > > alignment.
> > 
> > That's probably good idea. Also note, if we were to allocate
> > separately the 0-640k 1m-end, for NPT to work we'd need to start the
> > second block misaligned at a 1m address. So maybe I should move the
> > alignment out of qemu_ram_alloc and have it in the caller?
> 
> I think the only viable solution if you care about EPT/NPT is to not do that. 
> With your current code the 1m-end region will be misaligned - your code 

Well with my current code on top of current qemu code, there is no
risk of misalignment because the 0-4G is allocated in a single
qemu_ram_alloc. I'm sure it works right because
/debugfs/kvm/largepages shows all ram in largepages and otherwise I
wouldn't get a reproducible 6% boost on kernel compiles in guest even
on a common $150 quad core workstation (without even thinking at the
boost on huge systems).

> allocates it on a 2M boundary. I suspect you actually want (base % 2M) == 1M. 
> Aligning on a 1M boundary will only DTRT half the time.

The 1m-end is an hypothetical worry that come to mind as I was
discussing the issue with you. Basically my point is that if the pc.c
code will change and it'll pretend to qemu_ram_alloc the 0-640k and
1M-4G range with two separate calls (this is _not_ what qemu does
right now), the alignment in qemu_ram_alloc that works right now,
would then stop working.

This is why I thought maybe it's more correct (and less
virtual-ram-wasteful) to move the alignment in the caller even if the
patch will grow in size and it'll be pc.c specific (which it wouldn't
need to if other archs will support transparent hugepage).

I think with what you're saying above you're basically agreeing with
me I should move the alignment in the caller. Correct me if I
misunderstood.

> But that's only going to happen if you align the allocation.

Yep, this is why I agree with you, it's better to always align even
when kvm_enabled() == 0.

> It can't choose what align to use, but it can (should?) choose how to achieve 
> that alignment.

Ok but I don't see a problem in how it achieves it, in fact I think
it's more efficient than a kernel assisted alignment that would then
force to split the vma generating a micro-slowdown.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]