[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-block] [PATCH 00/21] new backup architecture
From: |
John Snow |
Subject: |
Re: [Qemu-block] [PATCH 00/21] new backup architecture |
Date: |
Tue, 10 Jan 2017 13:48:51 -0500 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.6.0 |
On 01/10/2017 01:05 AM, Jeff Cody wrote:
> On Mon, Jan 09, 2017 at 11:04:27AM +0000, Stefan Hajnoczi wrote:
>> On Fri, Dec 23, 2016 at 05:28:43PM +0300, Vladimir Sementsov-Ogievskiy wrote:
>>
>> Jeff or John: are you reviewing this?
>
> It's in my review queue, but it would probably be a good one for John to
> review as well if he has time.
>
Sorry, just back from vacation. It may take me a bit to get to it.
>>
>>> This is a new architecture for backup. It solves some current problems:
>>> 1. intersecting requests: for now at request start we wait for all
>>> intersecting requests, which means that
>>> a. we may wait even for unrelated to our request clusters
>>> b. not full async: if we are going to copy clusters 1,2,3,4, when 2 and
>>> 4 are in flight, why should we wait for 2 and 4 to be fully copied? Why not
>>> to start 1 and 3 in parallel with 2 and 4?
>>>
>>> 2. notifier request is internally synchronous: if notifier starts copying
>>> clusters 1,2,3,4, they will be copied one by one in synchronous loop.
>>>
>>> 3. notifier wait full copying of corresponding clusters (when actually it
>>> may wait only for _read_ operations to be finished)
>>>
>>> In short, what is done:
>>> 1. full async scheme
>>> 4. no intersecting requests
>>> 3. notifiers can wait only for read, not for write
>>> 4. notifiers wait only for corresponding clusters
>>> 5. time limit for notifiers
>>> 5. skip unallocated clusters for full mode
>>> 6. use HBitmap as main backup bitmap and just init it from dirty bitmap for
>>> incremental case
>>> 7. retrying: do not reread on write fail
>>>
>>> # Intro
>>>
>>> Instead of sync-copying + async-notifiers as in old backup, or aio requests
>>> like in mirror, this scheme just start 24 workers - separate coroutines,
>>> each of them copying clusters synchronously. Copying is only done by one
>>> cluster, there are no large requests.
>>> The only difference for clusters, awaited by write notifiers, is larger
>>> priority. So, notifiers do not start io requests, they just mark some
>>> clusters as awaited and yield. Then, when some worker completes read of
>>> last cluster, awaited by this notifier it will enter it.
>>>
>>> # Some data structures
>>>
>>> Instead of done_bitmap - copy_bitmap, like in mirror.
>>> HBitmap copy_bitmap
>>> Exactly, what should be copied:
>>> 0 - may mean one of three things:
>>> - this cluster should not be copied at all
>>> - this cluster is in flight
>>> - this cluster is already copied
>>> 1 - means that cluster should be copied, but not touched yet (no async
>>> io exists for it)
>>>
>>> New bitmap: notif_wait_bitmap - not HBitmap, just Bitmap.
>>> Exactly, in flight clusters, waiting for read operation:
>>> 0 - may mean one of three things:
>>> - this cluster should not be copied at all
>>> - this cluster is in flight and it is _already_ read to memory
>>> - this cluster is already copied
>>> 1 - means that cluster is in flight, but read operation have not
>>> finished
>>> yet
>>> The only exception is none-mode: in this case 1 means in flight: in io
>>> read or write. This is needed for image fleecing.
>>>
>>> Cluster states (copy_bitmap, notif_wait_bitmap)
>>>
>>> 0, 0 - Ignored (should not be copied at all) or In flight (read done) or
>>> Copied
>>> 0, 1 - In flight, read operation not finished (or write op. - for none-mode)
>>> 1, 0 - Should be copied, but not touched yet
>>> 1, 1 - Impossible state
>>>
>>> NotifierRequest - request from notifier, it changes sequence of cluster
>>> copying by workers.
>>> NotifierRequest {
>>> int64_t start;
>>> int64_t end;
>>> int nb_wait; // nb clusters (in specified range) that should be copied
>>> but not already read, i.e. clusters awaited by this notifier
>>> Coroutine *notif; // corresponding notifier coroutine
>>> }
>>>
>>> notifier_reqs - list of notifier requests
>>>
>>> # More info
>>>
>>> At backup start copy_bitmap is inited to sync_bitmap for incremental
>>> backup. For top/full backup it is inited to all ones, but in parallel with
>>> workers main coroutine skips not allocated clusters.
>>>
>>> Worker coroutines are copying clusters, preferable awaited by notifiers
>>> (for which NotifierRequest exists in the list). Function get_work helps
>>> them.
>>> Workers will copy clusters, awaited by notifiers even if block-job is
>>> paused - it is the same behaviour as in old architecture.
>>>
>>> Old backup fails guest-write if notifier fails to backup corresponding
>>> clusters. In the new scheme there is a little difference: notifier just
>>> wait for 5s and if backup can't copy all corresponding clusters in this
>>> time - guest-write fails.
>>> Error scenarios was considered on list, the final solution was to provide
>>> user a possibility to chose what should be failed: backup or guest-write.
>>> I'll add this later.
>>>
>>> Worker can exit (no more clusters to copy or fatal error) or pause (error
>>> or user pause or throttling). When last worker goes to pause it rings up
>>> main block-job coroutine, which will handle user pause or errors. We need
>>> to handle errors in main coroutine because of nature of
>>> block_job_error_action, which may yield.
>>>
>>> There also is a bonus: new io-retrying scheme: if there is an error on read
>>> or write, worker just yield in the retrying loop and if it will be resumed
>>> (with job->error_exit = false) it will continue from the same place, so if
>>> we have failed write after successful read we will not reread.
>>>
>>> Vladimir Sementsov-Ogievskiy (21):
>>> backup: move from done_bitmap to copy_bitmap
>>> backup: init copy_bitmap from sync_bitmap for incremental
>>> backup: improve non-dirty bits progress processing
>>> backup: use copy_bitmap in incremental backup
>>> hbitmap: improve dirty iter
>>> backup: rewrite top mode cluster skipping
>>> backup: refactor: merge top/full/incremental backup code
>>> backup: skip unallocated clusters for full mode
>>> backup: separate copy function
>>> backup: refactor backup_copy_cluster()
>>> backup: move r/w error handling code to r/w functions
>>> iotests: add supported_cache_modes to main function
>>> coroutine: add qemu_coroutine_add_next
>>> block: add trace point on bdrv_close_all
>>> bitmap: add bitmap_count_between() function
>>> hbitmap: add hbitmap_count_between() function
>>> backup: make all reads not serializing
>>> backup: new async architecture
>>> backup: refactor backup_do_cow
>>> backup: move bitmap handling from backup_do_cow to get_work
>>> backup: refactor: remove backup_do_cow()
>>>
>>> block.c | 1 +
>>> block/backup.c | 871
>>> +++++++++++++++++++++++++++++++-----------
>>> block/trace-events | 34 +-
>>> blockjob.c | 29 +-
>>> include/block/blockjob.h | 15 +-
>>> include/qemu/bitmap.h | 4 +
>>> include/qemu/coroutine.h | 2 +
>>> include/qemu/hbitmap.h | 26 +-
>>> tests/qemu-iotests/055 | 4 +-
>>> tests/qemu-iotests/129 | 6 +-
>>> tests/qemu-iotests/iotests.py | 7 +-
>>> util/bitmap.c | 27 ++
>>> util/hbitmap.c | 32 +-
>>> util/qemu-coroutine.c | 7 +
>>> 14 files changed, 805 insertions(+), 260 deletions(-)
>>>
>>> --
>>> 1.8.3.1
>>>
>
>