qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC PATCH 01/14] docs: block replication's description


From: Fam Zheng
Subject: Re: [Qemu-devel] [RFC PATCH 01/14] docs: block replication's description
Date: Thu, 12 Feb 2015 17:46:21 +0800
User-agent: Mutt/1.5.23 (2014-03-12)

On Thu, 02/12 17:36, Hongyang Yang wrote:
> Hi Fam,
> 
> 在 02/12/2015 04:44 PM, Fam Zheng 写道:
> >On Thu, 02/12 15:40, Wen Congyang wrote:
> >>On 02/12/2015 03:21 PM, Fam Zheng wrote:
> >>>Hi Congyang,
> >>>
> >>>On Thu, 02/12 11:07, Wen Congyang wrote:
> >>>>+== Workflow ==
> >>>>+The following is the image of block replication workflow:
> >>>>+
> >>>>+        +----------------------+            +------------------------+
> >>>>+        |Primary Write Requests|            |Secondary Write Requests|
> >>>>+        +----------------------+            +------------------------+
> >>>>+                  |                                       |
> >>>>+                  |                                      (4)
> >>>>+                  |                                       V
> >>>>+                  |                              /-------------\
> >>>>+                  |      Copy and Forward        |             |
> >>>>+                  |---------(1)----------+       | Disk Buffer |
> >>>>+                  |                      |       |             |
> >>>>+                  |                     (3)      \-------------/
> >>>>+                  |                 speculative      ^
> >>>>+                  |                write through    (2)
> >>>>+                  |                      |           |
> >>>>+                  V                      V           |
> >>>>+           +--------------+           +----------------+
> >>>>+           | Primary Disk |           | Secondary Disk |
> >>>>+           +--------------+           +----------------+
> >>>>+
> >>>>+    1) Primary write requests will be copied and forwarded to Secondary
> >>>>+       QEMU.
> >>>>+    2) Before Primary write requests are written to Secondary disk, the
> >>>>+       original sector content will be read from Secondary disk and
> >>>>+       buffered in the Disk buffer, but it will not overwrite the 
> >>>>existing
> >>>>+       sector content in the Disk buffer.
> >>>
> >>>I'm a little confused by the tenses ("will be" versus "are") and terms. I 
> >>>am
> >>>reading them as "s/will be/are/g"
> >>>
> >>>Why do you need this buffer?
> >>
> >>We only sync the disk till next checkpoint. Before next checkpoint, 
> >>secondary
> >>vm write to the buffer.
> >>
> >>>
> >>>If both primary and secondary write to the same sector, what is saved in 
> >>>the
> >>>buffer?
> >>
> >>The primary content will be written to the secondary disk, and the 
> >>secondary content
> >>is saved in the buffer.
> >
> >I wonder if alternatively this is possible with an imaginary "writable 
> >backing
> >image" feature, as described below.
> >
> >When we have a normal backing chain,
> >
> >                {virtio-blk dev 'foo'}
> >                          |
> >                          |
> >                          |
> >     [base] <- [mid] <- (foo)
> >
> >Where [base] and [mid] are read only, (foo) is writable. When we add an 
> >overlay
> >to an existing image on top,
> >
> >                {virtio-blk dev 'foo'}        {virtio-blk dev 'bar'}
> >                          |                              |
> >                          |                              |
> >                          |                              |
> >     [base] <- [mid] <- (foo)  <---------------------- (bar)
> >
> >It's important to make sure that writes to 'foo' doesn't break data for 
> >'bar'.
> >We can utilize an automatic hidden drive-backup target:
> >
> >                {virtio-blk dev 'foo'}                                    
> > {virtio-blk dev 'bar'}
> >                          |                                                  
> >         |
> >                          |                                                  
> >         |
> >                          v                                                  
> >         v
> >
> >     [base] <- [mid] <- (foo)  <----------------- (hidden target) 
> > <--------------- (bar)
> >
> >                          v                              ^
> >                          v                              ^
> >                          v                              ^
> >                          v                              ^
> >                          >>>> drive-backup sync=none >>>>
> >
> >So when guest writes to 'foo', the old data is moved to (hidden target), 
> >which
> >remains unchanged from (bar)'s PoV.
> >
> >The drive in the middle is called hidden because QEMU creates it 
> >automatically,
> >the naming is arbitrary.
> >
> >It is interesting because it is a more generalized case of image fleecing,
> >where the (hidden target) is exposed via NBD server for data scanning (read
> >only) purpose.
> >
> >More interestingly, with above facility, it is also possible to create a 
> >guest
> >visible live snapshot (disk 'bar') of an existing device (disk 'foo') very
> >cheaply. Or call it shadow copy if you will.
> >
> >Back to the COLO case, the configuration will be very similar:
> >
> >
> >                       {primary wr}                                          
> >       {secondary vm}
> >                             |                                               
> >             |
> >                             |                                               
> >             |
> >                             |                                               
> >             |
> >                             v                                               
> >             v
> >
> >    [what] <- [ever] <- (nbd target) <------------ (hidden buf disk) 
> > <------------- (active disk)
> >
> >                             v                              ^
> >                             v                              ^
> >                             v                              ^
> >                             v                              ^
> >                             >>>> drive-backup sync=none >>>>
> >
> >The workflow analogue is:
> >
> >>>>+    1) Primary write requests will be copied and forwarded to Secondary
> >>>>+       QEMU.
> >
> >Primary write requests are forwarded to secondary QEMU as well.
> >
> >>>>+    2) Before Primary write requests are written to Secondary disk, the
> >>>>+       original sector content will be read from Secondary disk and
> >>>>+       buffered in the Disk buffer, but it will not overwrite the 
> >>>>existing
> >>>>+       sector content in the Disk buffer.
> >
> >Before Primary write requests are written to (nbd target), aka the Secondary
> >disk, the orignal sector content is read from it and copied to (hidden buf
> >disk) by drive-backup. It obviously will not overwrite the data in (active
> >disk).
> >
> >>>>+    3) Primary write requests will be written to Secondary disk.
> >
> >Primary write requests are written to (nbd target).
> >
> >>>>+    4) Secondary write requests will be buffered in the Disk buffer and 
> >>>>it
> >>>>+       will overwrite the existing sector content in the buffer.
> >
> >Secondary write request will be written in (active disk) as usual.
> >
> >Finally, when checkpoint arrives, if you want to sync with primary, just drop
> >data in (hidden buf disk) and (active disk); when failover happends, if you
> >want to promote secondary vm, you can commit (active disk) to (nbd target), 
> >and
> >drop data in (hidden buf disk).
> 
> If I understand correctly, you split the Disk Buffer into a hidden buf disk +
> an active disk. What we need to do is only to implement a buf disk(will be
> used as hidden buf disk and active disk as mentioned), apart from this, we can
> use the existing mechinism like backing-file/drive-backup?
> 

Yes, but you need a separate driver to take care of the buffer logic as
introduced in this series, which is less generic, but does the same thing we
will need in the image fleecing use case.

Fam



reply via email to

[Prev in Thread] Current Thread [Next in Thread]