Re: [Qemu-devel] [PATCH v5 1/3] qcow2: Add qcow2_shrink_l1_and_l2

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v5 1/3] qcow2: Add qcow2_shrink_l1_and_l2_table

From:	Max Reitz
Subject:	Re: [Qemu-devel] [PATCH v5 1/3] qcow2: Add qcow2_shrink_l1_and_l2_table for qcow2 shrinking
Date:	Fri, 21 Nov 2014 11:56:13 +0100
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.2.0

On 2014-10-26 at 16:20, Jun Li wrote:

This patch is the realization of new function qcow2_shrink_l1_and_l2_table.
This function will shrink/discard l1 and l2 table when do qcow2 shrinking.

Signed-off-by: Jun Li <address@hidden>
---
v5:
   Do some modifications based on MAX's suggestion. Thanks for MAX.
   In v5, do l2 shrinking firstly, then do l1 shrinking in function 
qcow2_shrink_l1_and_l2_table. As do l1 shrinking need to allocate some clusters 
for new l1 table, so in v5 it can re-use the freed clusters come from l2 
shrinking.

v4:
   Add deal with COW clusters in l2 table. When using COW, some of (l2_entry >>
s->cluster_bits) will larger than s->refcount_table_size, so need to discard
this l2_entry.

v3:
   Fixed host cluster leak.
---
  block/qcow2-cluster.c | 182 ++++++++++++++++++++++++++++++++++++++++++++++++++
  block/qcow2.c         |  37 +++++++++-
  block/qcow2.h         |   2 +
  3 files changed, 218 insertions(+), 3 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 4d888c7..28d2d62 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -29,6 +29,9 @@
  #include "block/qcow2.h"
  #include "trace.h"

+static int l2_load(BlockDriverState *bs, uint64_t l2_offset,

+                   uint64_t **l2_table);
+
  int qcow2_grow_l1_table(BlockDriverState *bs, uint64_t min_size,
                          bool exact_size)
  {
@@ -135,6 +138,185 @@ int qcow2_grow_l1_table(BlockDriverState *bs, uint64_t 
min_size,
      return ret;
  }

+int qcow2_shrink_l1_and_l2_table(BlockDriverState *bs, uint64_t new_l1_size,

+                                 int new_l2_index, int64_t boundary_size)
+{
+    BDRVQcowState *s = bs->opaque;
+    int new_l1_size2, ret, i;

gcc gives me a warning (which is made an error by -Werror) here due touninitialized use of "ret". "ret" is not necessarily set in the while ()loop and there are some "goto fail"s afterwards which do not set it.

+    uint64_t *new_l1_table;
+    int64_t new_l1_table_offset;
+    int64_t old_l1_table_offset, old_l1_size;
+    uint8_t data[12];
+    uint64_t l2_offset;
+    uint64_t *l2_table, l2_entry;
+    int64_t l2_free_entry; /* The entry of l2 table need to free from */
+    uint64_t *old_l1_table = s->l1_table;
+    int num = s->l1_size - new_l1_size;
+
+    assert(new_l1_size <= s->l1_size);
+    while ((num >= -1) && (s->l1_size + num - 1 >= 0)) {
+        l2_free_entry = 0;
+        l2_offset = old_l1_table[s->l1_size + num - 1] & L1E_OFFSET_MASK;
+
+        if (l2_offset == 0) {
+            goto retry;
+        }
+
+        if (num == 0) {
+            if (new_l2_index == 0) {
+                goto retry;
+            }
+            l2_free_entry = new_l2_index;
+        }
+
+        /* load l2_table into cache */
+        ret = l2_load(bs, l2_offset, &l2_table);
+
+        if (ret < 0) {
+            return ret;
+        }
+
+        for (i = s->l2_size - 1; i >= 0; i--) {
+            l2_entry = be64_to_cpu(l2_table[i]);


& L2E_OFFSET_MASK is missing. OFLAG_COPIED will break this without.

+
+            /* Due to COW, the clusters in l2 table will
+             * not in sequential order, so there will be
+             * some l2_entry >= boundary_size when perform shrinking.
+             */
+            if (num == -1) {
+                if (l2_entry >= boundary_size) {

boundary_size is set to "offset" by the caller. "offset" is the guestdisk size. l2_entry (or should be) a host offset. They are not comparable.

+                    goto free_cluster;
+                } else {
+                    continue;
+                }
+            }
+
+            /* Deal with COW clusters in l2 table when num == 0 */
+            if (i <= l2_free_entry - 1) {
+                if (l2_entry >= boundary_size) {
+                    goto free_cluster;
+                }
+                continue;
+            }
+
+            switch (qcow2_get_cluster_type(l2_entry)) {
+            case QCOW2_CLUSTER_UNALLOCATED:
+                if (!bs->backing_hd) {
+                    continue;
+                }
+                break;
+
+            case QCOW2_CLUSTER_ZERO:
+                continue;
+
+            case QCOW2_CLUSTER_NORMAL:
+            case QCOW2_CLUSTER_COMPRESSED:
+                break;
+
+            default:
+                abort();
+            }
+
+        free_cluster:
+            qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_table);
+
+            if (s->qcow_version >= 3) {
+                l2_table[i] = cpu_to_be64(QCOW_OFLAG_ZERO);
+            } else {
+                l2_table[i] = cpu_to_be64(0);
+            }
+
+            /* Then decrease the refcount */
+            qcow2_free_any_clusters(bs, l2_entry, 1, QCOW2_DISCARD_MAX);
+        }
+
+        ret = qcow2_cache_put(bs, s->l2_table_cache, (void **) &l2_table);
+        if (ret < 0) {
+            return ret;
+        }
+        if (l2_free_entry == 0 && num != -1) {
+            qemu_vfree(l2_table);
+            qcow2_free_clusters(bs, l2_offset, s->cluster_size - 1,
+                                QCOW2_DISCARD_OTHER);


You should probably flush the L2 cache before you do that.

+        }
+    retry:
+        num--;
+    }
+
+    new_l1_size2 = sizeof(uint64_t) * new_l1_size;
+    new_l1_table = qemu_try_blockalign(bs->file,
+                                       align_offset(new_l1_size2, 512));
+    if (new_l1_table == NULL) {
+        return -ENOMEM;
+    }
+    memset(new_l1_table, 0, align_offset(new_l1_size2, 512));
+
+    /* shrinking l1 table */
+    memcpy(new_l1_table, s->l1_table, new_l1_size2);
+
+    /* write new table (align to cluster) */
+    new_l1_table_offset = qcow2_alloc_clusters(bs, new_l1_size2);
+
+    if ((new_l1_table_offset) >= boundary_size) {
+        goto fail;
+    }
+
+    ret = qcow2_cache_flush(bs, s->refcount_block_cache);
+    if (ret < 0) {
+        goto fail;
+    }


You don't need to flush the refblock cache.

+
+    /* the L1 position has not yet been updated, so these clusters must
+     * indeed be completely free */
+    ret = qcow2_pre_write_overlap_check(bs, 0, new_l1_table_offset,
+                                        new_l1_size2);
+
+    if (ret < 0) {
+        goto fail;
+    }
+
+    BLKDBG_EVENT(bs->file, BLKDBG_L1_GROW_WRITE_TABLE);
+
+    for (i = 0; i < new_l1_size; i++) {
+        new_l1_table[i] = cpu_to_be64(new_l1_table[i]);
+    }
+
+    ret = bdrv_pwrite_sync(bs->file, new_l1_table_offset,
+                           new_l1_table, new_l1_size2);
+    if (ret < 0) {
+        goto fail;
+    }
+
+    for (i = 0; i < new_l1_size; i++) {
+        new_l1_table[i] = be64_to_cpu(new_l1_table[i]);
+    }
+
+    /* set new table */
+    BLKDBG_EVENT(bs->file, BLKDBG_L1_GROW_ACTIVATE_TABLE);
+    cpu_to_be32w((uint32_t *)data, new_l1_size);
+    stq_be_p(data + 4, new_l1_table_offset);
+    ret = bdrv_pwrite_sync(bs->file, offsetof(QCowHeader, l1_size),
+                           data, sizeof(data));
+    if (ret < 0) {
+        goto fail;
+    }

I understand your intent with writing a new L1 table. But I don't seethe need. With 64 kB clusters, a 64 kB L1 table can serve an image with4 TB guest size (8192 L2 tables, each having 8192 entries). I don'tthink we'll gain a lot from shrinking the L1 table. It's too much effort.

+
+    qemu_vfree(s->l1_table);
+    old_l1_table_offset = s->l1_table_offset;
+    s->l1_table_offset = new_l1_table_offset;
+    s->l1_table = new_l1_table;
+    old_l1_size = s->l1_size;
+    s->l1_size = new_l1_size;
+    qcow2_free_clusters(bs, old_l1_table_offset, old_l1_size * 
sizeof(uint64_t),
+                        QCOW2_DISCARD_OTHER);
+    return 0;
+ fail:
+    qemu_vfree(new_l1_table);
+    qcow2_free_clusters(bs, new_l1_table_offset, new_l1_size2,
+                        QCOW2_DISCARD_OTHER);
+    return ret;
+}
+
  /*
   * l2_load
   *
diff --git a/block/qcow2.c b/block/qcow2.c
index d031515..d2b0dfe 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -2111,10 +2111,41 @@ static int qcow2_truncate(BlockDriverState *bs, int64_t 
offset)
          return -ENOTSUP;
      }

- /* shrinking is currently not supported */

+    /* shrinking image */
      if (offset < bs->total_sectors * 512) {
-        error_report("qcow2 doesn't support shrinking images yet");
-        return -ENOTSUP;
+        /* As l1 table, l2 table, refcount table, refcount block table
+         * and file header of the qcow2 image need to use some clusters,
+         * so should subtract these metadata from offset.
+         */
+        int64_t nb_l1 = DIV_ROUND_UP((uint64_t)s->l1_size * sizeof(uint64_t),
+                                     s->cluster_size);
+        int64_t nb_l2 = DIV_ROUND_UP(offset, (uint64_t)s->l2_size <<
+                                     s->cluster_bits);
+        int64_t nb_refcount_block_table = DIV_ROUND_UP(offset, (uint64_t)
+                                                       s->cluster_size <<
+                                                       s->refcount_block_bits);
+        int64_t nb_refcount_table = DIV_ROUND_UP(nb_refcount_block_table << 3,
+                                                 s->cluster_size);
+        int64_t total_nb = 2 * nb_l2 + nb_l1 + nb_refcount_block_table +
+                           nb_refcount_table + 1;
+        int64_t offset_for_shrink = offset - (total_nb << s->cluster_bits);


Okay, what does total_nb have to do with offset?

total_nb is some host value. offset is a guest value. Don't mix them.

+        int new_l2_index = offset_to_l2_index(s, offset_for_shrink);
+
+        new_l1_size = size_to_l1(s, offset_for_shrink);
+        ret = qcow2_shrink_l1_and_l2_table(bs, new_l1_size, new_l2_index,
+                                           offset);
+        if (ret < 0) {
+            return ret;
+        }
+
+        int64_t actual_size = bdrv_get_allocated_file_size(bs);
+
+        if (offset < actual_size) {
+            ret = bdrv_truncate(bs->file, offset);
+            if (ret < 0) {
+                return ret;
+            }
+        }

actual_size is not some offset you can use. It's the number of bytesallocated in the file. The file can contain holes. Therefore, comparingan offset against actual_size is wrong. Second, offset is the guest disksize, actual_size is the host disk size. They are not comparable.

Second, why are you truncating the file to offset if offset is smallerthan actual_size? That is always wrong. You are blindly assuming:(1) that the image consists of only data and no metadata, because you'reusing the guest disk size (offset) to shrink the host file

(2) that all that data is tightly packed at the beginning of the file

(1) is obviously wrong. (2) is wrong as well, because clusters may bespread all over the host file. We would need defragmentation to solvethat issue.

Therefore, to shorten the host file, you'd need to walk through theimage and find the highest *host* cluster actually in use. Then you cantruncate to after that cluster. However, I'd strongly advise againstthat because it doesn't bring any actual benefit.

Today's file systems support holes. Just discard all the clusters whichget freed during shrinking, that will free up all that space. No need todefragment, no need to truncate. Yes, the file will look larger than itactually is, but nobody should care. We will need defragmentation totackle that issue.

new_l1_size = size_to_l1(s, offset);

diff --git a/block/qcow2.h b/block/qcow2.h
index 577ccd1..be1237d 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -516,6 +516,8 @@ int qcow2_pre_write_overlap_check(BlockDriverState *bs, int 
ign, int64_t offset,
  /* qcow2-cluster.c functions */
  int qcow2_grow_l1_table(BlockDriverState *bs, uint64_t min_size,
                          bool exact_size);
+int qcow2_shrink_l1_and_l2_table(BlockDriverState *bs, uint64_t new_l1_size,
+                                 int new_l2_index, int64_t boundary_size);
  int qcow2_write_l1_entry(BlockDriverState *bs, int l1_index);
  void qcow2_l2_cache_reset(BlockDriverState *bs);
  int qcow2_decompress_cluster(BlockDriverState *bs, uint64_t cluster_offset);

So, as for what I think we do need to do when shrinking (and keep inmind: The offset given to qcow2_truncate() is the guest size! NOT thehost image size!):

(1) Determine the first L2 table and the first entry in the table whichwill lie beyond the new guest disk size.

(2) Discard all clusters beginning from there.
(3) Discard all L2 tables which are then completely empty.
(4) Update the header size.

And that's it. We can either speed up step (2) by implementing itmanually, or we just use bdrv_discard() on the qcow2 BDS (in thesimplest case: bdrv_discard(bs, DIV_ROUND_UP(offset, BDRV_SECTOR_SIZE),bs->total_sectors - DIV_ROUND_UP(offset, BDRV_SECTOR_SIZE));.

We can incorporate step (3) by extending qcow2_discard_clusters() tofree L2 tables when they are empty after discard_single_l2(). But wedon't even have to that now. It's an optimization we can go about later.

So, we can do (1), (2) and (3) in a single step: Just one bdrv_discard()call. But it's probably better to use qcow2_discard_clusters() insteadand set the full_discard parameter to true.

So: qcow2_discard_clusters(bs, offset, bs->total_sectors - offset /BDRV_SECTOR_SIZE, true);. Then update the guest disk size field in theheader. And we're done.


There are four problems with this approach:

- qcow2_discard_clusters() might be slower than optimal. I personallydon't care at all.- If "bs->total_sectors * BDRV_SECTOR_SIZE - offset" is greater thanINT_MAX, this won't work. Trivially solvable by encapsulating theqcow2_discard_clusters() call in a loop which limits nb_clusters toINT_MAX / BDRV_SECTOR_SIZE.

- The L1 table is not resized. Should not matter in practice at all.

- The file is not truncated. Does not matter either (because holes inthe file are good enough), and we can't actually solve this problemwithout defragmentation anyway.


There is one advantage:
- It's extremely simple. It's literally below ten lines of code.

I think the advantage far outweighs the disadvantage. But I may bewrong. What do you think?

Max

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [Qemu-devel] [PATCH v5 1/3] qcow2: Add qcow2_shrink_l1_and_l2_table for qcow2 shrinking, Max Reitz <=
- Re: [Qemu-devel] [PATCH v5 1/3] qcow2: Add qcow2_shrink_l1_and_l2_table for qcow2 shrinking, Eric Blake, 2014/11/24

Prev by Date: Re: [Qemu-devel] [BUGFIX][PATCH for 2.2 1/1] hw/ide/core.c: Prevent SIGSEGV during migration
Next by Date: Re: [Qemu-devel] [Spice-devel] screen freezed for 2-3 minutes on spice connect on xen windows 7 domU's with qxl after save/restore
Previous by thread: Re: [Qemu-devel] [PATCH] virtio_blk: fix defaults for max_hw_sectors and max_segment_size
Next by thread: Re: [Qemu-devel] [PATCH v5 1/3] qcow2: Add qcow2_shrink_l1_and_l2_table for qcow2 shrinking
Index(es):
- Date
- Thread