qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH 0/17 v3] Localhost migration with side channel f


From: Andrea Arcangeli
Subject: Re: [Qemu-devel] [PATCH 0/17 v3] Localhost migration with side channel for ram
Date: Wed, 27 Nov 2013 17:48:53 +0100

On Tue, Nov 26, 2013 at 12:17:09PM +0100, Paolo Bonzini wrote:
> Il 26/11/2013 12:07, Lei Li ha scritto:
> > For this, I am not quite sure I understand it correctly, seems the latest
> > update of post copy migration was sent on last Oct, would you please give
> > some insights on what else could I do for the coupling with postcopy
> > migration?
> 
> I don't know the state exactly.  Orit and Andrea should know.

Ok, about the last update sent, so I'm not optimistic the kernel
backend is good because it uses a device driver that allocates the
memory locally and effectively disables THP KSM swap compression
overcommit and automatic NUMA balancing.

I wrote a new kernel backend by introducing two new kernel features:

1) MADV_USERFAULT (to deliver the KVM/qemu page fault to qemu userland)

2) remap_anon_pages (new syscall that qemu will use inside the
   migration thread that gets out of band events from the userland
   page fault, and also to do the background network transfer of all
   RAM while the guest already runs on the destination node)

Now you use vmsplice so you don't need remap_anon_pages in your case.

You only need MADV_USERFAULT.

I added a FOLL_USERFAULT too, as if it's KVM trapping on it, it will
have to deliver the fault to qemu through a vmexit and it's not doing
that yet. KVM page faults calling gup_fast, will have to use
FOLL_USERFAULT. This also means changing the API of all gup_fast to
get a "foll" parameter, but we need to do that anyway to remove the
FOLL_GET and fix /dev/mem mapped as guest physical memory (FOLL_GET on
/dev/mem backfires), and to speedup the page fault too to avoid those
useless get_page/put_page during every fault (MMU notifier don't
require FOLL_GET or any page reference at any time as long as the page
goes in the spte and the proper spte locks are hold to serialize
against the MMU notifier events).

For the non-local case, remap_anon_pages should be faster than
vmsplice as it doesn't need to pass through a pipe and just mangles
two pagetables and two pmds based on the virtual address given as
parameter.

If you want to review the kernel backend I implemented for postcopy,
this is updated on my latest aa.git tree:

http://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/commit/?id=e69e1067f1d7e0f441c0c222a1017a07afe0bfc9
http://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/commit/?id=d182b5118e2b22dd73018b75dce027c4ebabce14

I also looked into sharing code with the volatile range for android
temporary page mappings that can be discared but that has various
reasons to want putting placeholders into the pagetable. And the
functionality is different too, which is why the volatile range needs
to put placeholders into the empty pagetables, after all...

I don't think we can use the volatile range because that would discard
the pages too. MADV_USERFAULT is also somewhat simpler and it provides
just the user fault functionality (it cannot discard the pages). It
sends a sigbus instead of mapping a zero page and it doesn't even
require to allocate empty pagetables for the userfault range.

Once the live migration is complete MADV_USERFAULT should be cleared
from the vma simply with an madvise call, and any sign of it will go
away (unlike the device driver that stays forever). And once postcopy
completes all RAM is already entirely anonymous, already backed by THP
(if the out of band network transfers are 2M large it'll create 2M
pages in zero copy and there will never any sign of 4k pages for the
whole duration of migration) and the userfaulted memory can be NUMA
migrated or swapped out at any time. MADV_USERFAULT doesn't interfere
with swapouts.

remap_anon_pages also doesn't interfere with swapouts or automatic
NUMA migrations: if the received page gets swapped out before the
migration threads maps it in the guest physical address space, the
swap entry is transferred from the temporary address to the guest
physical address still with a single copy that reads and writes 8
bytes (just 1 cacheline written, modulo PT locks), and no I/O
triggers.

It would have been possible to also extend remap_file_pages to work on
anonymous memory instead of only nonlinear file mappings, however that
would alter the API as it wouldn't return -EINVAL anymore. It's easy
to change things if we want to use remap_file_pages for anonymous
memory too. Some larger discussion on the API details will be needed
but we're not at that point yet I think, and currently I'm more
interested to sort out the lowlevel details first, the kernel backend
API should be frozen at the last possible moment I think.

The qemu userland details of postcopy using the new kernel features are
still not finished, but conceptually the design is pretty clear.

This is far from definitive, if somebody has better ideas, please
comment of course.

Thanks,
Andrea



reply via email to

[Prev in Thread] Current Thread [Next in Thread]