qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC PATCH v2 01/12] mc: add documentation for micro-ch


From: Michael R. Hines
Subject: Re: [Qemu-devel] [RFC PATCH v2 01/12] mc: add documentation for micro-checkpointing
Date: Wed, 19 Feb 2014 09:40:07 +0800
User-agent: Mozilla/5.0 (X11; Linux i686; rv:24.0) Gecko/20100101 Thunderbird/24.3.0

On 02/18/2014 08:45 PM, Dr. David Alan Gilbert wrote:
+The Micro-Checkpointing Process
+Basic Algorithm
+Micro-Checkpoints (MC) work against the existing live migration path in QEMU, and can 
effectively be understood as a "live migration that never ends". As such, 
iteration rounds happen at the granularity of 10s of milliseconds and perform the 
following steps:
+
+1. After N milliseconds, stop the VM.
+3. Generate a MC by invoking the live migration software path to identify and 
copy dirty memory into a local staging area inside QEMU.
+4. Resume the VM immediately so that it can make forward progress.
+5. Transmit the checkpoint to the destination.
+6. Repeat
+Upon failure, load the contents of the last MC at the destination back into 
memory and run the VM normally.
Later you talk about the memory allocation and how you grow the memory as needed
to fit the checkpoint, have you tried going the other way and triggering the
checkpoints sooner if they're taking too much memory?

There is a 'knob' in this patch called "mc-set-delay" which was designed
to solve exactly that problem. It allows policy or management software
to make an independent decision about what the frequency of the
checkpoints should be.

I wasn't comfortable implementing policy directly inside the patch as
that seemed less likely to get accepted by the community sooner.

+1. MC over TCP/IP: Once the socket connection breaks, we assume
failure. This happens very early in the loss of the latest MC not only
because a very large amount of bytes is typically being sequenced in a
TCP stream but perhaps also because of the timeout in acknowledgement
of the receipt of a commit message by the destination.
+
+2. MC over RDMA: Since Infiniband does not provide any underlying
timeout mechanisms, this implementation enhances QEMU's RDMA migration
protocol to include a simple keep-alive. Upon the loss of multiple
keep-alive messages, the sender is deemed to have failed.
+
+In both cases, either due to a failed TCP socket connection or lost RDMA 
keep-alive group, both the sender or the receiver can be deemed to have failed.
+
+If the sender is deemed to have failed, the destination takes over immediately 
using the contents of the last checkpoint.
+
+If the destination is deemed to be lost, we perform the same action
as a live migration: resume the sender normally and wait for management
software to make a policy decision about whether or not to re-protect
the VM, which may involve a third-party to identify a new destination
host again to use as a backup for the VM.
In this world what is making the decision about whether the sender/destination
should win - how do you avoid a split brain situation where both
VMs are running but the only thing that failed is the comms between them?
Is there any guarantee that you'll have received knowledge of the comms
failure before you pull the plug out and enable the corked packets to be
sent on the sender side?

Good question in general - I'll add it to the FAQ. The patch implements
a basic 'transaction' mechanism in coordination with an outbound I/O
buffer (documented further down). With these two things in
places, split-brain is not possible because the destination is not running.
We don't allow the destination to resume execution until a committed
transaction has been acknowledged by the destination and only until
then do we allow any outbound network traffic to be release to the
outside world.

<snip>

+RDMA is used for two different reasons:
+
+1. Checkpoint generation (RDMA-based memcpy):
+2. Checkpoint transmission
+Checkpoint generation must be done while the VM is paused. In the
worst case, the size of the checkpoint can be equal in size to the amount
of memory in total use by the VM. In order to resume VM execution as
fast as possible, the checkpoint is copied consistently locally into
a staging area before transmission. A standard memcpy() of potentially
such a large amount of memory not only gets no use out of the CPU cache
but also potentially clogs up the CPU pipeline which would otherwise
be useful by other neighbor VMs on the same physical node that could be
scheduled for execution. To minimize the effect on neighbor VMs, we use
RDMA to perform a "local" memcpy(), bypassing the host processor. On
more recent processors, a 'beefy' enough memory bus architecture can
move memory just as fast (sometimes faster) as a pure-software CPU-only
optimized memcpy() from libc. However, on older computers, this feature
only gives you the benefit of lower CPU-utilization at the expense of
Isn't there a generic kernel DMA ABI for doing this type of thing (I
think there was at one point, people have suggested things like using
graphics cards to do it but I don't know if it ever happened).
The other question is, do you always need to copy - what about something
like COWing the pages?

Excellent question! Responding in two parts:

1) The kernel ABI 'vmsplice' is what I think you're referring to. Correct
     me if I'm wrong, but vmsplice was actually designed to avoid copies
     entirely between two userspace programs to be able to move memory
     more efficiently - whereas a fault tolerant system actually *needs*
     copy to be made.

2) Using COW: Actually, I think that's an excellent idea. I've bounced that
     around with my colleagues, but we simply didn't have the manpower
     to implement it and benchmark it. There was also some concern about
performance: Would the writable working set of the guest be so active/busy
     that COW would not get you much benefit? I think it's worth a try.
     Patches welcome =)

- Michael




reply via email to

[Prev in Thread] Current Thread [Next in Thread]