Re: [Qemu-devel] [PATCH 1/2 v8] add-cow file format

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH 1/2 v8] add-cow file format

From:	Stefan Hajnoczi
Subject:	Re: [Qemu-devel] [PATCH 1/2 v8] add-cow file format
Date:	Wed, 25 Apr 2012 17:09:47 +0100

On Wed, Apr 18, 2012 at 9:18 AM, Dong Xu Wang
<address@hidden> wrote:

QEMU normally does not cache data, only metadata.  There are a couple
of reasons:

1. The host page cache already provides this, we don't need to
duplicate this in QEMU.  The user can decide whether to cache data in
host memory or not using the cache=writeback|writethrough vs
cache=none|directsync options.

2. It increases the risk of data loss since anything in QEMU memory is
likely to be lost in a crash or power failure.  When we use the host
page cache at least guest data is not lost when the QEMU process
crashes.

One consequence of caching 8 64 KB data clusters is that the cache
cannot effectively reduce metadata I/O without increasing QEMU's
memory footprint a lot.  Since a cache entry only holds 8 bits of the
COW bitmap we probably have a lot of seeks required to access bitmap
metadata.  We could increase the cache size but each cache entry will
take more than 8 * 64 KB = 512 KB.

For these reasons I suggest only caching the COW bitmap.  Each entry
can cache 64 KB of the COW bitmap, that means 64 KB bitmap * 8 bits *
64 KB data bytes = allocation information for a 32 GB region of the
disk.  Since a single cache entry covers 32 GB of the disk we can
expect pretty good c

> +static coroutine_fn int add_cow_co_writev(BlockDriverState *bs,
> +        int64_t sector_num, int remaining_sectors, QEMUIOVector *qiov)
> +{
> +    BDRVAddCowState *s = bs->opaque;
> +    int ret = 0;
> +    int cur_nr_sectors;
> +    QEMUIOVector hd_qiov;
> +    uint64_t bytes_done = 0;
> +    uint8_t *table;
> +    uint8_t bitmap;
> +    int64_t index;
> +    int i;
> +    uint8_t *cluster_data = NULL;
> +    qemu_co_mutex_lock(&s->lock);
> +    qemu_iovec_init(&hd_qiov, qiov->niov);
> +    while (remaining_sectors != 0) {
> +        index = sector_num & 1023;

Please use a constant to explain what this calculation does instead of 1023.

> +        cur_nr_sectors = MIN(remaining_sectors,
> +            (sector_num | 1023) - sector_num + 1);

A clearer expression would be: MIN(remaining_sectors,
SECTORS_PER_CLUSTER - index)

I just invented SECTORS_PER_CLUSTER but you should be able to define
something like that if you don't have it already.

> +
> +        ret = add_cow_cache_get(bs, s->bitmap_cache,
> +            sector_num & ~1023, &bitmap, (void **)&table);
> +        if (ret < 0) {
> +            goto fail;
> +        }

Here I wonder why add_cow_cache_get() doesn't allow us to pass in
sector_num (unmodified) and tells us whether or not this COW bit is
set.  Instead it fills in a uint8_t that we need to index into, which
requires every caller to duplicate the bitmap indexing code.

An even better strategy is to use bdrv_co_is_allocated() because it
tells you how many contiguous clusters are allocated.  If you're lucky
and they are all allocated/unallocated you can handle the entire I/O
request in one host I/O operation.  It's more efficient than doing
things 1 cluster at a time.

> +
> +        cluster_data = qemu_blockalign(bs, BDRV_SECTOR_SIZE * 
> cur_nr_sectors);
> +        qemu_iovec_reset(&hd_qiov);
> +        qemu_iovec_copy(&hd_qiov, qiov, bytes_done,
> +            cur_nr_sectors * BDRV_SECTOR_SIZE);
> +        qemu_iovec_to_buffer(&hd_qiov, cluster_data);
> +
> +        memcpy(table + index * BDRV_SECTOR_SIZE,
> +            cluster_data,
> +            BDRV_SECTOR_SIZE * cur_nr_sectors);
> +        for (i = index / 128;

Please use a constant.  It should be clear what 128 means (I guess 128
x 512 byte sectors = 64 KB cluster size).

> +            i <= (index + cur_nr_sectors - 1) / 128;
> +            i++) {
> +                bitmap |= 1 << i;
> +        }
> +        add_cow_cache_entry_mark_dirty(s->bitmap_cache,
> +            bitmap,
> +            table);

It seems you always mark this cached COW bitmap entry dirty, even when
the cluster was already allocated.  When you switch to a pure metadata
cache this will cause unnecessary I/O when flushing or evicting cache
entries.  We should only mark the table dirty if a COW bit
transitioned from 0 -> 1.

> +        remaining_sectors -= cur_nr_sectors;
> +        sector_num += cur_nr_sectors;
> +        bytes_done += cur_nr_sectors * BDRV_SECTOR_SIZE;
> +    }
> +    ret = 0;
> +fail:
> +    ret = add_cow_cache_flush(bs, s->bitmap_cache);
> +    if (ret < 0) {
> +        goto fail;
> +    }

There are two options with metadata flushing:

1. If BDRV_O_CACHE_WB is set then we should follow the rule that
metadata updates are buffered in memory (for speed).  Data writes are
issued immediately (we don't buffer them because we don't want to use
too much memory for guest data).  If a flush is needed, then we must
first ensure all data is written and flushed to the .raw file.  Then
we write out metadata and flush it.

2. If not BDRV_O_CACHE_WB then we need to write out data to the .raw
file, flush the .raw file, and then write out metadata.

Note that if no COW bits transitioned from 0 -> 1 then we have no
metadata updates and can simply perform the data writes to the .raw
file!

> +    qemu_co_mutex_unlock(&s->lock);
> +    qemu_vfree(cluster_data);

It seems like cluster_data gets leaked since the allocation is inside
the loop and may happen multiple times but we only free once.

> +    qemu_iovec_destroy(&hd_qiov);
> +    return ret;
> +}
> +
> +static int bdrv_add_cow_truncate(BlockDriverState *bs, int64_t offset)
> +{
> +    return bdrv_truncate(bs->file,
> +        sizeof(AddCowHeader) + ((offset / BDRV_SECTOR_SIZE + 1023) >> 10));

This calculation is unclear to me.  I think this is saying that each
COW bit covers 64 KB of image data.  Please rewrite the expression
using constants from add-cow.h.  Don't precompute parts of the
expression, the compiler will do that for you and it's more important
to show where this calculation comes from to the reader of the code.

I think the .raw file should also be truncated.  Since add_cow_open()
uses the .raw file size we need to keep the .raw file correctly sized
at all times.

> +=Specification=
> +
> +The file format looks like this:
> +
> + +---------------+--------------------------+
> + |     Header    |           Data           |
> + +---------------+--------------------------+

'Metadata' or even 'COW bitmap' is clearer than 'Data'.  No guest data
is stored in the .add-cow file, only metadata.

[Prev in Thread]

Current Thread

[Next in Thread]

[Qemu-devel] [PATCH 1/2 v8] add-cow file format, Dong Xu Wang, 2012/04/18
- [Qemu-devel] [PATCH 2/2 v8] add-cow: support snapshot_blkdev, Dong Xu Wang, 2012/04/18
- Re: [Qemu-devel] [PATCH 1/2 v8] add-cow file format, Stefan Hajnoczi <=

Prev by Date: [Qemu-devel] [PATCH v4 3/5] block: change block-job-set-speed argument from 'value' to 'speed'
Next by Date: [Qemu-devel] [PATCH] softfloat: float32_to_float16() should do inexact instead of underflow for rounding case
Previous by thread: [Qemu-devel] [PATCH 2/2 v8] add-cow: support snapshot_blkdev
Next by thread: [Qemu-devel] [PATCH] qxl-render: fix broken vnc+spice since commit f934493
Index(es):
- Date
- Thread