qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] Microcheckpointing: Memory-VCPU / Disk State consistenc


From: Hongyang Yang
Subject: Re: [Qemu-devel] Microcheckpointing: Memory-VCPU / Disk State consistency
Date: Fri, 12 Sep 2014 09:24:17 +0800
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.0



在 09/12/2014 01:44 AM, Dr. David Alan Gilbert 写道:
(I've cc'd in Fam, Stefan, and Kevin for Block stuff, and
               Yang and Eddie for Colo)

* Walid Nouri (address@hidden) wrote:
Hello Michael, Hello Paolo
i have ???studied??? the available documentation/Information and tried to
get an idea of the QEMU live block operation possibilities.

I think the MC protocol doesn???t need synchronous block device replication
because primary and secondary VM are not synchronous. The state of the
primary is allays ahead of the state of the secondary. When the primary is
in epoch(n) the secondary is in epoch(n-1).

What MC needs is a block device agnostic, controlled and asynchronous
approach for replicating the contents of block devices and its state changes
to the secondary VM while the primary VM is running. Asynchronous block
transfer is important to allow maximum performance for the primary VM, while
keeping the secondary VM updated with state changes.

The block device replication should be possible in two stages or modes.

The first stage is the live copy of all block devices of the primary to the
secondary. This is necessary if the secondary doesn???t have an existing
image which is in sync with the primary at the time MC has started. This is
not very convenient but as far as I know actually there is no mechanism for
persistent dirty bitmap in QEMU.

The second stage (mode) is the replication of block device state changes
(modified blocks)  to keep the image on the secondary in sync with the
primary. The mirrored blocks must be buffered in ram (block buffer) until
the complete Checkpoint (RAM, vCPU, device state) can be committed.

For keeping the complete system state consistent on the secondary system
there must be a possibility for MC to commit/discard block device state
changes. In normal operation the mirrored block device state changes (block
buffer) are committed to disk when the complete checkpoint is committed. In
case of a crash of the primary system while transferring a checkpoint the
data in the block buffer corresponding to the failed Checkpoint must be
discarded.

I think for COLO there's a requirement that the secondary can do reads/writes
in parallel with the primary, and the secondary can discard those reads/writes
- and that doesn't happen in MC (Yang or Eddie should be able to confirm that).

Exactly, COLO need this functionality to ensure consistency.


The storage architecture should be ???shared nothing??? so that no shared
storage is required and primary/secondary can have separate block device
images.

MC/COLO with shared storage still needs some stuff like this; but it's subtely
different.   They still need to be able to buffer/release modifications
to the shared storage; if any of this code can also be used in the
shared-storage configurations it would be good.

Shared-storage is more complicated, we don't support shared-storage currently...


I think this can be achieved by drive-mirror and a filter block driver.
Another approach could be to exploit the block migration functionality of
live migration with a filter block driver.

The drive-mirror (and live migration) does not rely on shared storage and
allow live block device copy and incremental syncing.

A block buffer can be implemented with a QEMU filter block driver. It should
sit at the same position as the Quorum driver in the block driver hierarchy.
When using block filter approach MC will be transparent and block device
agnostic.

The block buffer filter must have an Interface which allows MC control the
commits or discards of block device state changes. I have no idea where to
put such an interface to stay conform with QEMU coding style.


I???m sure there are alternative and better approaches and I???m open for
any ideas


Walid

Am 17.08.2014 11:52, schrieb Paolo Bonzini:
Il 11/08/2014 22:15, Michael R. Hines ha scritto:
Excellent question: QEMU does have a feature called "drive-mirror"
in block/mirror.c that was introduced a couple of years ago. I'm not
sure what the
adoption rate of the feature is, but I would start with that one.

block/mirror.c is asynchronous, and there's no support for communicating
checkpoints back to the master.  However, the quorum disk driver could
be what you need.

There's also a series on the mailing list that lets quorum read only
>from the primary, so that quorum can still do replication and fault
tolerance, but skip fault detection.

Paolo

There is also a second fault tolerance implementation that works a
little differently called
"COLO" - you may have seen those emails on the list too, but their
method does not require a disk replication solution, if I recall correctly.



--
Dr. David Alan Gilbert / address@hidden / Manchester, UK
.


--
Thanks,
Yang.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]