[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [PATCH v13 03/14] qcow2: Optimize bdrv_make_empty()
From: |
Kevin Wolf |
Subject: |
Re: [Qemu-devel] [PATCH v13 03/14] qcow2: Optimize bdrv_make_empty() |
Date: |
Thu, 23 Oct 2014 10:41:13 +0200 |
User-agent: |
Mutt/1.5.21 (2010-09-15) |
Am 23.10.2014 um 10:36 hat Max Reitz geschrieben:
> On 2014-10-23 at 10:29, Kevin Wolf wrote:
> >Am 23.10.2014 um 09:46 hat Max Reitz geschrieben:
> >>On 2014-10-22 at 20:35, Kevin Wolf wrote:
> >>>Am 22.10.2014 um 14:51 hat Max Reitz geschrieben:
> >>>>bdrv_make_empty() is currently only called if the current image
> >>>>represents an external snapshot that has been committed to its base
> >>>>image; it is therefore unlikely to have internal snapshots. In this
> >>>>case, bdrv_make_empty() can be greatly sped up by emptying the L1 and
> >>>>refcount table (while having the dirty flag set, which only works for
> >>>>compat=1.1) and creating a trivial refcount structure.
> >>>>
> >>>>If there are snapshots or for compat=0.10, fall back to the simple
> >>>>implementation (discard all clusters).
> >>>>
> >>>>Signed-off-by: Max Reitz <address@hidden>
> >>>Hey, this feels actually reviewable this time. :-)
> >>I'm still unsure which version I like more. If it wasn't for the
> >>math, I'd prefer the other version.
> >>
> >>>>diff --git a/block/blkdebug.c b/block/blkdebug.c
> >>>>index e046b92..862d93b 100644
> >>>>--- a/block/blkdebug.c
> >>>>+++ b/block/blkdebug.c
> >>>>@@ -195,6 +195,8 @@ static const char *event_names[BLKDBG_EVENT_MAX] = {
> >>>> [BLKDBG_PWRITEV] = "pwritev",
> >>>> [BLKDBG_PWRITEV_ZERO] = "pwritev_zero",
> >>>> [BLKDBG_PWRITEV_DONE] = "pwritev_done",
> >>>>+
> >>>>+ [BLKDBG_EMPTY_IMAGE_PREPARE] = "empty_image_prepare",
> >>>> };
> >>>> static int get_event_by_name(const char *name, BlkDebugEvent *event)
> >>>>diff --git a/block/qcow2.c b/block/qcow2.c
> >>>>index 1ef3a5f..16dece2 100644
> >>>>--- a/block/qcow2.c
> >>>>+++ b/block/qcow2.c
> >>>>@@ -2232,24 +2232,137 @@ fail:
> >>>> static int qcow2_make_empty(BlockDriverState *bs)
> >>>> {
> >>>>+ BDRVQcowState *s = bs->opaque;
> >>>> int ret = 0;
> >>>>- uint64_t start_sector;
> >>>>- int sector_step = INT_MAX / BDRV_SECTOR_SIZE;
> >>>>- for (start_sector = 0; start_sector < bs->total_sectors;
> >>>>- start_sector += sector_step)
> >>>>- {
> >>>>- /* As this function is generally used after committing an
> >>>>external
> >>>>- * snapshot, QCOW2_DISCARD_SNAPSHOT seems appropriate. Also, the
> >>>>- * default action for this kind of discard is to pass the
> >>>>discard,
> >>>>- * which will ideally result in an actually smaller image file,
> >>>>as
> >>>>- * is probably desired. */
> >>>>- ret = qcow2_discard_clusters(bs, start_sector * BDRV_SECTOR_SIZE,
> >>>>- MIN(sector_step,
> >>>>- bs->total_sectors -
> >>>>start_sector),
> >>>>- QCOW2_DISCARD_SNAPSHOT, true);
> >>>>+ if (s->snapshots || s->qcow_version < 3) {
> >>>>+ uint64_t start_sector;
> >>>>+ int sector_step = INT_MAX / BDRV_SECTOR_SIZE;
> >>>>+
> >>>>+ /* If there are snapshots, every active cluster has to be
> >>>>discarded; and
> >>>>+ * because compat=0.10 does not support setting the dirty flag,
> >>>>we have
> >>>>+ * to use this fallback there as well */
> >>>>+
> >>>>+ for (start_sector = 0; start_sector < bs->total_sectors;
> >>>>+ start_sector += sector_step)
> >>>>+ {
> >>>>+ /* As this function is generally used after committing an
> >>>>external
> >>>>+ * snapshot, QCOW2_DISCARD_SNAPSHOT seems appropriate. Also,
> >>>>the
> >>>>+ * default action for this kind of discard is to pass the
> >>>>discard,
> >>>>+ * which will ideally result in an actually smaller image
> >>>>file, as
> >>>>+ * is probably desired. */
> >>>>+ ret = qcow2_discard_clusters(bs, start_sector *
> >>>>BDRV_SECTOR_SIZE,
> >>>>+ MIN(sector_step,
> >>>>+ bs->total_sectors -
> >>>>start_sector),
> >>>>+ QCOW2_DISCARD_SNAPSHOT, true);
> >>>>+ if (ret < 0) {
> >>>>+ break;
> >>>>+ }
> >>>>+ }
> >>>My first though was to add a return here, so the indentation level for
> >>>the rest is one less. Probably a matter of taste, though.
> >>I'd rather put the second branch into an own function.
> >Works for me.
> >
> >>>>+ } else {
> >>>>+ int l1_clusters;
> >>>>+ int64_t offset;
> >>>>+ uint64_t *new_reftable;
> >>>>+ uint64_t rt_entry;
> >>>>+ struct {
> >>>>+ uint64_t l1_offset;
> >>>>+ uint64_t reftable_offset;
> >>>>+ uint32_t reftable_clusters;
> >>>>+ } QEMU_PACKED l1_ofs_rt_ofs_cls;
> >>>>+
> >>>>+ ret = qcow2_cache_empty(bs, s->l2_table_cache);
> >>>> if (ret < 0) {
> >>>>- break;
> >>>>+ return ret;
> >>>>+ }
> >>>>+
> >>>>+ ret = qcow2_cache_empty(bs, s->refcount_block_cache);
> >>>>+ if (ret < 0) {
> >>>>+ return ret;
> >>>>+ }
> >>>>+
> >>>>+ /* Refcounts will be broken utterly */
> >>>>+ ret = qcow2_mark_dirty(bs);
> >>>>+ if (ret < 0) {
> >>>>+ return ret;
> >>>>+ }
> >>>>+
> >>>>+ l1_clusters = DIV_ROUND_UP(s->l1_size,
> >>>>+ s->cluster_size / sizeof(uint64_t));
> >>>>+ new_reftable = g_try_new0(uint64_t, s->cluster_size /
> >>>>sizeof(uint64_t));
> >>>>+ if (!new_reftable) {
> >>>>+ return -ENOMEM;
> >>>>+ }
> >>>>+
> >>>>+ BLKDBG_EVENT(bs->file, BLKDBG_EMPTY_IMAGE_PREPARE);
> >>>Until here, the failure cases are trivially okay. The worst thing that
> >>>could happen is that the image is needlessly marked as dirty.
> >>>
> >>>>+ /* Overwrite enough clusters at the beginning of the sectors to
> >>>>place
> >>>>+ * the refcount table, a refcount block and the L1 table in;
> >>>>this may
> >>>>+ * overwrite parts of the existing refcount and L1 table, which
> >>>>is not
> >>>>+ * an issue because the dirty flag is set, complete data loss is
> >>>>in fact
> >>>>+ * desired and partial data loss is consequently fine as well */
> >>>>+ ret = bdrv_write_zeroes(bs->file, s->cluster_size /
> >>>>BDRV_SECTOR_SIZE,
> >>>>+ (2 + l1_clusters) * s->cluster_size /
> >>>>+ BDRV_SECTOR_SIZE, 0);
> >>>If we crash at this point, we're _not_ okay any more. --verbose follows:
> >>>
> >>>On disk, we may have overwritten a refcount table or a refcount block
> >>>with zeros. This is fine, we have the dirty flag set, so destroying any
> >>>refcounting structure does no harm.
> >>>
> >>>We may also have overwritten an L1 or L2 table. As the commit correctly
> >>>explains, this is doing partially what the function is supposed to do
> >>>for the whole image. Affected data clusters are now read from the
> >>>backing file. Good.
> >>>
> >>>However, we may also have overwritten data clusters that are still
> >>>accessible using an L1/L2 table that hasn't been hit by this write
> >>>operation. We're reading corrupted (zeroed out) data now instead of
> >>>going to the backing file. Bug!
> >>Oh, right, I forgot about the L1 table not always being at the start
> >>of the file.
> >>
> >>>In my original suggestion I had an item where the L1 table was zeroed
> >>>out first before the start of the image is zeroed. This would have
> >>>avoided the bug.
> >>>
> >>>>+ if (ret < 0) {
> >>>>+ g_free(new_reftable);
> >>>>+ return ret;
> >>>>+ }
> >>>If we fail here (without crashing), the first clusters could be in their
> >>>original state or partially zeroed. Assuming that you fixed the above
> >>>bug, the on-disk state would be okay if we opened the image now because
> >>>the dirty flag would trigger an image repair; but we don't get the
> >>>repair when taking this failure path and we may have zeroed a refcount
> >>>table/block. This is probably a problem and we may have to make the BDS
> >>>unusable.
> >>>
> >>>The in-memory state of the L1 table is hopefully zeroed out, so it's
> >>>consistent with what is on disk.
> >>>
> >>>The in-memory state of the refcount table looks like it's not in sync
> >>>with the on-disk state. Note that while the dirty flag allows that the
> >>>on-disk state can be anything, the in-memory state is what we keep using
> >>>after a failure. The in-memory state isn't accurate at this point, but
> >>>we only create leaks. Lots of them, because we zeroed the L1 table, but
> >>>that's good enough. If refcounts are updated later, the old offsets
> >>>should still be valid.
> >>If we set at least parts of the in-memory reftable to zero,
> >>everything probably breaks. Try to allocate a new cluster while the
> >>beginning of the reftable is zero. So we cannot take the on-disk
> >>reftable into memory.
> >>
> >>Doing it the other way around, writing the in-memory reftable to
> >>disk on error won't work either. The refblocks may have been zeroed
> >>out, so we have exactly the same problem.
> >>
> >>Therefore, to make the BDS usable after error, we have to (in the
> >>error path) read the on-disk reftable into memory, call the qcow2
> >>repair function and hope for the best.
> >>
> >>Or, you know, we could go back to v11 which had my other version of
> >>this patch which always kept everything consistent. :-P
> >I'm not sure how important it is to keep the BDS usable after such an
> >unlikely error.
> >
> >I started to write up a suggestion that we could do without
> >qcow2_alloc_clusters(), but just build up the first refcount block
> >ourselves. Doesn't really help us, because the new state is not what we
> >need here, but it made me aware that it assumes that the L1 table is
> >small enough to fit in the first refcount block and we would have to
> >fall back to discard if it isn't. Come to think of it, I suspect you
> >already make the same assumption without checking it.
> >
> >
> >Anyway, if we want to keep the BDS usable what is the right state?
> >We need references for the header, for the L1 table, for the refcount
> >table and all refcount blocks. If we decide to build a new in-memory
> >refcount table, the refcount blocks are only the refcount blocks that
> >are still referenced there, i.e. we don't have to worry about the
> >location of refcount blocks on the disk.
> >
> >In fact, we can also safely change the refcount table offset to
> >cluster 1 and handle it together with the header. We'll then need one
> >refcount block for header/reftable and potentially another one for the
> >L1 table, which still must be in its original place (adding the
> >assumption that the L1 table doesn't cross a refblock boundary :-/).
> >
> >Our refblock cache can hold 4 tables at least, so creating two refblocks
> >purely in memory is doable.
> >
> >Leaves the question, is it worth the hassle?
>
> Probably not. Would you be fine with setting bs->drv to NULL in the
> problematic error paths?
By calling qcow2_signal_corruption(), right? I think that's fine.
What about the assumption that 3 + l1_size fits in one refcount block?
Do we need to check it even now?
Kevin
- [Qemu-devel] [PATCH v13 00/14] qemu-img: Implement commit like QMP, Max Reitz, 2014/10/22
- [Qemu-devel] [PATCH v13 01/14] qcow2: Allow "full" discard, Max Reitz, 2014/10/22
- [Qemu-devel] [PATCH v13 02/14] qcow2: Implement bdrv_make_empty(), Max Reitz, 2014/10/22
- [Qemu-devel] [PATCH v13 03/14] qcow2: Optimize bdrv_make_empty(), Max Reitz, 2014/10/22
- Re: [Qemu-devel] [PATCH v13 03/14] qcow2: Optimize bdrv_make_empty(), Eric Blake, 2014/10/22
- Re: [Qemu-devel] [PATCH v13 03/14] qcow2: Optimize bdrv_make_empty(), Kevin Wolf, 2014/10/22
- Re: [Qemu-devel] [PATCH v13 03/14] qcow2: Optimize bdrv_make_empty(), Max Reitz, 2014/10/23
- Re: [Qemu-devel] [PATCH v13 03/14] qcow2: Optimize bdrv_make_empty(), Kevin Wolf, 2014/10/23
- Re: [Qemu-devel] [PATCH v13 03/14] qcow2: Optimize bdrv_make_empty(), Max Reitz, 2014/10/23
- Re: [Qemu-devel] [PATCH v13 03/14] qcow2: Optimize bdrv_make_empty(),
Kevin Wolf <=
- Re: [Qemu-devel] [PATCH v13 03/14] qcow2: Optimize bdrv_make_empty(), Max Reitz, 2014/10/23
- Re: [Qemu-devel] [PATCH v13 03/14] qcow2: Optimize bdrv_make_empty(), Kevin Wolf, 2014/10/23
- Re: [Qemu-devel] [PATCH v13 03/14] qcow2: Optimize bdrv_make_empty(), Max Reitz, 2014/10/23
[Qemu-devel] [PATCH v13 04/14] blockjob: Introduce block_job_complete_sync(), Max Reitz, 2014/10/22
[Qemu-devel] [PATCH v13 05/14] blockjob: Add "ready" field, Max Reitz, 2014/10/22
[Qemu-devel] [PATCH v13 06/14] iotests: Omit length/offset test in 040 and 041, Max Reitz, 2014/10/22
[Qemu-devel] [PATCH v13 07/14] block/mirror: Improve progress report, Max Reitz, 2014/10/22