Re: [Qemu-devel] [PATCH 1/2 v7] block: add-cow file format

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH 1/2 v7] block: add-cow file format

From:	Stefan Hajnoczi
Subject:	Re: [Qemu-devel] [PATCH 1/2 v7] block: add-cow file format
Date:	Thu, 8 Mar 2012 10:50:49 +0000

On Thu, Mar 8, 2012 at 2:27 AM, Dong Xu Wang <address@hidden> wrote:
> On Wed, Mar 7, 2012 at 00:59, Stefan Hajnoczi <address@hidden> wrote:
>> On Thu, Mar 1, 2012 at 2:56 AM, Dong Xu Wang <address@hidden> wrote:
>>> +static int add_cow_update_bitmap(BlockDriverState *bs, int64_t sector_num,
>>> +        int nb_sectors)
>>> +{
>>> +    BDRVAddCowState *s = bs->opaque;
>>> +    uint8_t *bitmap;
>>> +
>>> +    int i, ret = 0;
>>> +    for (i = 0; i < nb_sectors; i++) {
>>> +        int ret = add_cow_cache_get(bs, s->bitmap_cache,
>>> +            (sector_num + i) / 8 & ~(ADD_COW_CLUSTER_SIZE - 1), (void 
>>> **)&bitmap);
>>> +        if (ret < 0) {
>>> +            abort();
>>> +        }
>>> +        *(bitmap + ((sector_num + i) / 8 & (ADD_COW_CLUSTER_SIZE - 1))) |=
>>> +                    (1 << ((sector_num + i) % 8));
>>> +        add_cow_cache_entry_mark_dirty(s->bitmap_cache, bitmap);
>>> +
>>> +    }
>>> +    ret = add_cow_cache_flush(bs, s->bitmap_cache);
>>
>> Data must be flushed to the cow file before writing metadata updates,
>> otherwise a crash in between could result in zero sectors instead of
>> backing file sectors.
>>
> Sorry, what does it mean?  The correct steps shoud be:
> 1. flush to the image file
> 2. if 1 succeeds, flush to add-cow?
> That means metadata updates should in image  file flushing step, not
> in add_cow_co_writev?

You only need to flush when there are metadata updates.  So when the
guest writes to a sector that is already allocated in the image file,
no metadata updates are necessary and we don't need to flush.  The
guest is expect to issue a flush operation if it really wants to
flush.

However, there is a data integrity problem in the case where metadata
updates *are* required.  When we allocate new sectors in the image
file we need to make sure that they have been populated and are safely
on disk.  Otherwise we risk data corruption if the system crashes.
Here is the scenario:

1. Guest issues a write, this will require allocating new sectors in
the image file.  Therefore we'll need to update the bitmap in the
.add-cow file.
2. Write guest data buffer to image file.
3. Write bitmap update to .add-cow file.

If there is no flush between #2 and #3 then the write operations are
unordered.  #2 could appear before #3.  Or #3 could appear before #2.

I say "appear" because the write operation will complete in order, but
we're not guaranteed to "see" the data on disk after crash due to
volatile disk write caches or the host OS page cache (when -drive
...,cache=writeback).

If we want to ensure that #2 happens before #3 then we need to flush
before #3.  This will force all layers of caching to put the data
safely onto disk.

The reason *why* this is necessary is because we must avoid putting #3
on disk before #2.  If the system crashes and #2 never makes it to
disk but #3 did, then we will have allocated sectors in the image file
which are empty - they will be full of zeroes.  That is incorrect
because the guest either should see the data from the backing file if
the write didn't complete during the crash, or it should see the data
it was writing if the write did complete during the crash.

>>> +static int bdrv_add_cow_truncate(BlockDriverState *bs, int64_t offset)
>>> +{
>>> +    int ret = 0;
>>> +    int64_t image_sectors = offset / BDRV_SECTOR_SIZE;
>>> +    BDRVAddCowState *s = bs->opaque;
>>> +    int64_t old_image_sector = s->image_hd->total_sectors;
>>> +
>>> +    ret = bdrv_truncate(bs->file, sizeof(AddCowHeader) + ((image_sectors + 
>>> 7) >> 3));
>>> +    if (ret < 0) {
>>> +        bdrv_truncate(s->image_hd, old_image_sector * BDRV_SECTOR_SIZE);
>>> +        return ret;
>>> +    }
>>> +    return ret;
>>> +}
>>
>> I'm surprised that we truncate the image file...but only in the error case.
>>
>> Also, do we need to resize the cow cache?
>>
> cow cache should be resized, because the function will be called while 
> creating.
>
> Kevin gave comments in v2:
> "bs->file only contains metadata, so why is this correct? Shoudln't you
> update the header with the new size and truncate s->image_hd?"
>
> So I want to know should image_file have the same virtual size as the
> backing_file?

qcow2 and qed do not require backing file and image to have the same
size.  However, when you truncate it changes the size of the image
file.

I think it's good to have the flexibility of a smaller (or larger)
backing file.  Supporting this is useful because there are some cases,
like provisioning new VMs where it can be advantageous to take a
compact, small backing file and create a larger image from it which
might append a data partition at the end of the disk.

>> 2. Code that accesses the bitmap first divides by ADD_COW_CLUSTER_SIZE
>> to operate on an entire cluster.  Right now the lookup code seems to
>> do it *during* the bitmap lookup instead of before.
> Sorry, I do not understand this comment clearly. Could you explain more?

If you implement clusters (for example, 64 KB size) then each bit in
the bitmap covers 64 KB of data in the image file.  This is why I
expected the code to divide by cluster size before accessing the
bitmap - because we discard the information about which exact sector
we're accessing and only work at 64 KB granularity.

Stefan

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [Qemu-devel] [PATCH 1/2 v7] block: add-cow file format, Stefan Hajnoczi, 2012/03/06
- Re: [Qemu-devel] [PATCH 1/2 v7] block: add-cow file format, Dong Xu Wang, 2012/03/07
  - Re: [Qemu-devel] [PATCH 1/2 v7] block: add-cow file format, Stefan Hajnoczi <=

Prev by Date: [Qemu-devel] [PATCH] implement vnc_dpy_setdata
Next by Date: Re: [Qemu-devel] [PATCH 2/2] block: add-cow support snapshot_blkdev
Previous by thread: Re: [Qemu-devel] [PATCH 1/2 v7] block: add-cow file format
Next by thread: [Qemu-devel] Regression: more 0.12 regression (SeaBIOS related?)
Index(es):
- Date
- Thread