Re: [Qemu-devel] [RFC PATCH 01/14] docs: block replication's description

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC PATCH 01/14] docs: block replication's description

From:	Dr. David Alan Gilbert
Subject:	Re: [Qemu-devel] [RFC PATCH 01/14] docs: block replication's description
Date:	Thu, 5 Mar 2015 19:04:17 +0000
User-agent:	Mutt/1.5.23 (2014-03-12)

* Wen Congyang (address@hidden) wrote:
> On 03/05/2015 12:35 AM, Dr. David Alan Gilbert wrote:
> > * Wen Congyang (address@hidden) wrote:
> >> Signed-off-by: Wen Congyang <address@hidden>
> >> Signed-off-by: Paolo Bonzini <address@hidden>
> >> Signed-off-by: Yang Hongyang <address@hidden>
> >> Signed-off-by: zhanghailiang <address@hidden>
> >> Signed-off-by: Gonglei <address@hidden>
> > 
> > Hi,
> > 
> >> ---
> >>  docs/block-replication.txt | 129 
> >> +++++++++++++++++++++++++++++++++++++++++++++
> >>  1 file changed, 129 insertions(+)
> >>  create mode 100644 docs/block-replication.txt
> >>
> >> diff --git a/docs/block-replication.txt b/docs/block-replication.txt
> >> new file mode 100644
> >> index 0000000..59150b8
> >> --- /dev/null
> >> +++ b/docs/block-replication.txt
> >> @@ -0,0 +1,129 @@
> >> +Block replication
> >> +----------------------------------------
> >> +Copyright Fujitsu, Corp. 2015
> >> +Copyright (c) 2015 Intel Corporation
> >> +Copyright (c) 2015 HUAWEI TECHNOLOGIES CO.,LTD.
> >> +
> >> +This work is licensed under the terms of the GNU GPL, version 2 or later.
> >> +See the COPYING file in the top-level directory.
> >> +
> >> +The block replication is used for continuous checkpoints. It is designed
> >> +for COLO that Secondary VM is running. It can also be applied for FT/HA
> >> +scene that Secondary VM is not running.
> >> +
> >> +This document gives an overview of block replication's design.
> >> +
> >> +== Background ==
> >> +High availability solutions such as micro checkpoint and COLO will do
> >> +consecutive checkpoint. The VM state of Primary VM and Secondary VM is
> >> +identical right after a VM checkpoint, but becomes different as the VM
> >> +executes till the next checkpoint. To support disk contents checkpoint,
> >> +the modified disk contents in the Secondary VM must be buffered, and are
> >> +only dropped at next checkpoint time. To reduce the network transportation
> >> +effort at the time of checkpoint, the disk modification operations of
> >> +Primary disk are asynchronously forwarded to the Secondary node.
> > 
> > Can you explain how the block data is synchronised with the main checkpoint
> > stream?  i.e. when the secondary receives a new checkpoint how does it know
> > it's received all of the block writes from the primary associated with that
> > checkpoint and that all the following writes that it receives are for the
> > next checkpoint period?
> 
> NBD server will do it. Writing to NBD client will return after NBD server 
> replies
> the result(ACK or error).

Ah OK, so if the NBD client is synchronous then yes I can see that;
(I was confused by the word 'asynchronously' in your description above
but I guess that means asynchronous to the checkpoint stream).
I see that 'do_colo_transaction' keeps the primary stopped until after
the secondary does blk_do_checkpoint and then sends 'LOADED'.

I think yes that should work; although potentially you could make it faster;
since the primary doesn't need to know that it's write has been commited
until the next checkpoint, and if you could mark the separation in the two
checkpoints, then you could start the primary running again earlier.  But that's
all more complicated; this should work OK.

Thanks for the explanation,

Dave

> Thanks
> Wen Congyang
> 
> > 
> > Dave
> > 
> >> +
> >> +== Workflow ==
> >> +The following is the image of block replication workflow:
> >> +
> >> +        +----------------------+            +------------------------+
> >> +        |Primary Write Requests|            |Secondary Write Requests|
> >> +        +----------------------+            +------------------------+
> >> +                  |                                       |
> >> +                  |                                      (4)
> >> +                  |                                       V
> >> +                  |                              /-------------\
> >> +                  |      Copy and Forward        |             |
> >> +                  |---------(1)----------+       | Disk Buffer |
> >> +                  |                      |       |             |
> >> +                  |                     (3)      \-------------/
> >> +                  |                 speculative      ^
> >> +                  |                write through    (2)
> >> +                  |                      |           |
> >> +                  V                      V           |
> >> +           +--------------+           +----------------+
> >> +           | Primary Disk |           | Secondary Disk |
> >> +           +--------------+           +----------------+
> >> +
> >> +    1) Primary write requests will be copied and forwarded to Secondary
> >> +       QEMU.
> >> +    2) Before Primary write requests are written to Secondary disk, the
> >> +       original sector content will be read from Secondary disk and
> >> +       buffered in the Disk buffer, but it will not overwrite the existing
> >> +       sector content in the Disk buffer.
> >> +    3) Primary write requests will be written to Secondary disk.
> >> +    4) Secondary write requests will be buffered in the Disk buffer and it
> >> +       will overwrite the existing sector content in the buffer.
> >> +
> >> +== Architecture ==
> >> +We are going to implement COLO block replication from many basic
> >> +blocks that are already in QEMU.
> >> +
> >> +         virtio-blk       ||
> >> +             ^            ||                            .----------
> >> +             |            ||                            | Secondary
> >> +        1 Quorum          ||                            '----------
> >> +         /      \         ||
> >> +        /        \        ||
> >> +   Primary      2 NBD  ------->  2 NBD
> >> +     disk       client    ||     server                  virtio-blk
> >> +                          ||        ^                         ^
> >> +--------.                 ||        |                         |
> >> +Primary |                 ||  Secondary disk <--------- COLO buffer 3
> >> +--------'                 ||                   backing
> >> +
> >> +1) The disk on the primary is represented by a block device with two
> >> +children, providing replication between a primary disk and the host that
> >> +runs the secondary VM. The read pattern for quorum can be extended to
> >> +make the primary always read from the local disk instead of going through
> >> +NBD.
> >> +
> >> +2) The secondary disk receives writes from the primary VM through QEMU's
> >> +embedded NBD server (speculative write-through).
> >> +
> >> +3) The disk on the secondary is represented by a custom block device
> >> +("COLO buffer"). The disk buffer's backing image is the secondary disk,
> >> +and the disk buffer uses bdrv_add_before_write_notifier to implement
> >> +copy-on-write, similar to block/backup.c.
> >> +
> >> +== New block driver interface ==
> >> +We add three block driver interfaces to control block replication:
> >> +a. bdrv_start_replication()
> >> +   Start block replication, called in migration/checkpoint thread.
> >> +   We must call bdrv_start_replication() in secondary QEMU before
> >> +   calling bdrv_start_replication() in primary QEMU.
> >> +b. bdrv_do_checkpoint()
> >> +   This interface is called after all VM state is transfered to
> >> +   Secondary QEMU. The Disk buffer will be dropped in this interface.
> >> +c. bdrv_stop_replication()
> >> +   It is called when failover. We will flush the Disk buffer into
> >> +   Secondary Disk and stop block replication.
> >> +
> >> +== Usage ==
> >> +Primary:
> >> +  -drive if=xxx,driver=quorum,read-pattern=first,\
> >> +         children.0.file.filename=1.raw,\
> >> +         children.0.driver=raw,\
> >> +         children.1.file.driver=nbd+colo,\
> >> +         children.1.file.host=xxx,\
> >> +         children.1.file.port=xxx,\
> >> +         children.1.file.export=xxx,\
> >> +         children.1.driver=raw
> >> +  Note:
> >> +  1. NBD Client should not be the first child of quorum.
> >> +  2. There should be only one NBD Client.
> >> +  3. host is the secondary physical machine's hostname or IP
> >> +  4. Each disk must have its own export name.
> >> +
> >> +Secondary:
> >> +  -drive if=xxx,driver=blkcolo,export=xxx,\
> >> +         backing.file.filename=1.raw,\
> >> +         backing.driver=raw
> >> +  Then run qmp command:
> >> +    nbd_server_start host:port
> >> +  Note:
> >> +  1. The export name for the same disk must be the same in primary
> >> +     and secondary QEMU command line
> >> +  2. The qmp command nbd_server_start must be run before running the
> >> +     qmp command migrate on primary QEMU
> >> +  3. Don't use nbd_server_start's other options
> >> -- 
> >> 2.1.0
> >>
> > --
> > Dr. David Alan Gilbert / address@hidden / Manchester, UK
> > .
> > 
> 
--
Dr. David Alan Gilbert / address@hidden / Manchester, UK

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [Qemu-devel] [RFC PATCH 01/14] docs: block replication's description, (continued)
- Re: [Qemu-devel] [RFC PATCH 01/14] docs: block replication's description, Dr. David Alan Gilbert, 2015/03/04
  - Re: [Qemu-devel] [RFC PATCH 01/14] docs: block replication's description, Wen Congyang, 2015/03/04
    - Re: [Qemu-devel] [RFC PATCH 01/14] docs: block replication's description, Dr. David Alan Gilbert <=

Prev by Date: Re: [Qemu-devel] [PATCH 19/21] userfaultfd: remap_pages: UFFDIO_REMAP preparation
Next by Date: Re: [Qemu-devel] [PATCH v4 2/5] target-i386: Remove unused APIC ID default code
Previous by thread: Re: [Qemu-devel] [RFC PATCH 01/14] docs: block replication's description
Next by thread: [Qemu-devel] [Bug ?]Qemu segfault because of non-initial kvm_state variable
Index(es):
- Date
- Thread