qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v5 1/3] qcow2: Add qcow2_shrink_l1_and_l2_table


From: Max Reitz
Subject: Re: [Qemu-devel] [PATCH v5 1/3] qcow2: Add qcow2_shrink_l1_and_l2_table for qcow2 shrinking
Date: Fri, 21 Nov 2014 11:56:13 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.2.0

On 2014-10-26 at 16:20, Jun Li wrote:
This patch is the realization of new function qcow2_shrink_l1_and_l2_table.
This function will shrink/discard l1 and l2 table when do qcow2 shrinking.

Signed-off-by: Jun Li <address@hidden>
---
v5:
   Do some modifications based on MAX's suggestion. Thanks for MAX.
   In v5, do l2 shrinking firstly, then do l1 shrinking in function 
qcow2_shrink_l1_and_l2_table. As do l1 shrinking need to allocate some clusters 
for new l1 table, so in v5 it can re-use the freed clusters come from l2 
shrinking.

v4:
   Add deal with COW clusters in l2 table. When using COW, some of (l2_entry >>
s->cluster_bits) will larger than s->refcount_table_size, so need to discard
this l2_entry.

v3:
   Fixed host cluster leak.
---
  block/qcow2-cluster.c | 182 ++++++++++++++++++++++++++++++++++++++++++++++++++
  block/qcow2.c         |  37 +++++++++-
  block/qcow2.h         |   2 +
  3 files changed, 218 insertions(+), 3 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 4d888c7..28d2d62 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -29,6 +29,9 @@
  #include "block/qcow2.h"
  #include "trace.h"
+static int l2_load(BlockDriverState *bs, uint64_t l2_offset,
+                   uint64_t **l2_table);
+
  int qcow2_grow_l1_table(BlockDriverState *bs, uint64_t min_size,
                          bool exact_size)
  {
@@ -135,6 +138,185 @@ int qcow2_grow_l1_table(BlockDriverState *bs, uint64_t 
min_size,
      return ret;
  }
+int qcow2_shrink_l1_and_l2_table(BlockDriverState *bs, uint64_t new_l1_size,
+                                 int new_l2_index, int64_t boundary_size)
+{
+    BDRVQcowState *s = bs->opaque;
+    int new_l1_size2, ret, i;

gcc gives me a warning (which is made an error by -Werror) here due to uninitialized use of "ret". "ret" is not necessarily set in the while () loop and there are some "goto fail"s afterwards which do not set it.

+    uint64_t *new_l1_table;
+    int64_t new_l1_table_offset;
+    int64_t old_l1_table_offset, old_l1_size;
+    uint8_t data[12];
+    uint64_t l2_offset;
+    uint64_t *l2_table, l2_entry;
+    int64_t l2_free_entry; /* The entry of l2 table need to free from */
+    uint64_t *old_l1_table = s->l1_table;
+    int num = s->l1_size - new_l1_size;
+
+    assert(new_l1_size <= s->l1_size);
+    while ((num >= -1) && (s->l1_size + num - 1 >= 0)) {
+        l2_free_entry = 0;
+        l2_offset = old_l1_table[s->l1_size + num - 1] & L1E_OFFSET_MASK;
+
+        if (l2_offset == 0) {
+            goto retry;
+        }
+
+        if (num == 0) {
+            if (new_l2_index == 0) {
+                goto retry;
+            }
+            l2_free_entry = new_l2_index;
+        }
+
+        /* load l2_table into cache */
+        ret = l2_load(bs, l2_offset, &l2_table);
+
+        if (ret < 0) {
+            return ret;
+        }
+
+        for (i = s->l2_size - 1; i >= 0; i--) {
+            l2_entry = be64_to_cpu(l2_table[i]);

& L2E_OFFSET_MASK is missing. OFLAG_COPIED will break this without.

+
+            /* Due to COW, the clusters in l2 table will
+             * not in sequential order, so there will be
+             * some l2_entry >= boundary_size when perform shrinking.
+             */
+            if (num == -1) {
+                if (l2_entry >= boundary_size) {

boundary_size is set to "offset" by the caller. "offset" is the guest disk size. l2_entry (or should be) a host offset. They are not comparable.

+                    goto free_cluster;
+                } else {
+                    continue;
+                }
+            }
+
+            /* Deal with COW clusters in l2 table when num == 0 */
+            if (i <= l2_free_entry - 1) {
+                if (l2_entry >= boundary_size) {
+                    goto free_cluster;
+                }
+                continue;
+            }
+
+            switch (qcow2_get_cluster_type(l2_entry)) {
+            case QCOW2_CLUSTER_UNALLOCATED:
+                if (!bs->backing_hd) {
+                    continue;
+                }
+                break;
+
+            case QCOW2_CLUSTER_ZERO:
+                continue;
+
+            case QCOW2_CLUSTER_NORMAL:
+            case QCOW2_CLUSTER_COMPRESSED:
+                break;
+
+            default:
+                abort();
+            }
+
+        free_cluster:
+            qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_table);
+
+            if (s->qcow_version >= 3) {
+                l2_table[i] = cpu_to_be64(QCOW_OFLAG_ZERO);
+            } else {
+                l2_table[i] = cpu_to_be64(0);
+            }
+
+            /* Then decrease the refcount */
+            qcow2_free_any_clusters(bs, l2_entry, 1, QCOW2_DISCARD_MAX);
+        }
+
+        ret = qcow2_cache_put(bs, s->l2_table_cache, (void **) &l2_table);
+        if (ret < 0) {
+            return ret;
+        }
+        if (l2_free_entry == 0 && num != -1) {
+            qemu_vfree(l2_table);
+            qcow2_free_clusters(bs, l2_offset, s->cluster_size - 1,
+                                QCOW2_DISCARD_OTHER);

You should probably flush the L2 cache before you do that.

+        }
+    retry:
+        num--;
+    }
+
+    new_l1_size2 = sizeof(uint64_t) * new_l1_size;
+    new_l1_table = qemu_try_blockalign(bs->file,
+                                       align_offset(new_l1_size2, 512));
+    if (new_l1_table == NULL) {
+        return -ENOMEM;
+    }
+    memset(new_l1_table, 0, align_offset(new_l1_size2, 512));
+
+    /* shrinking l1 table */
+    memcpy(new_l1_table, s->l1_table, new_l1_size2);
+
+    /* write new table (align to cluster) */
+    new_l1_table_offset = qcow2_alloc_clusters(bs, new_l1_size2);
+
+    if ((new_l1_table_offset) >= boundary_size) {
+        goto fail;
+    }
+
+    ret = qcow2_cache_flush(bs, s->refcount_block_cache);
+    if (ret < 0) {
+        goto fail;
+    }

You don't need to flush the refblock cache.

+
+    /* the L1 position has not yet been updated, so these clusters must
+     * indeed be completely free */
+    ret = qcow2_pre_write_overlap_check(bs, 0, new_l1_table_offset,
+                                        new_l1_size2);
+
+    if (ret < 0) {
+        goto fail;
+    }
+
+    BLKDBG_EVENT(bs->file, BLKDBG_L1_GROW_WRITE_TABLE);
+
+    for (i = 0; i < new_l1_size; i++) {
+        new_l1_table[i] = cpu_to_be64(new_l1_table[i]);
+    }
+
+    ret = bdrv_pwrite_sync(bs->file, new_l1_table_offset,
+                           new_l1_table, new_l1_size2);
+    if (ret < 0) {
+        goto fail;
+    }
+
+    for (i = 0; i < new_l1_size; i++) {
+        new_l1_table[i] = be64_to_cpu(new_l1_table[i]);
+    }
+
+    /* set new table */
+    BLKDBG_EVENT(bs->file, BLKDBG_L1_GROW_ACTIVATE_TABLE);
+    cpu_to_be32w((uint32_t *)data, new_l1_size);
+    stq_be_p(data + 4, new_l1_table_offset);
+    ret = bdrv_pwrite_sync(bs->file, offsetof(QCowHeader, l1_size),
+                           data, sizeof(data));
+    if (ret < 0) {
+        goto fail;
+    }

I understand your intent with writing a new L1 table. But I don't see the need. With 64 kB clusters, a 64 kB L1 table can serve an image with 4 TB guest size (8192 L2 tables, each having 8192 entries). I don't think we'll gain a lot from shrinking the L1 table. It's too much effort.

+
+    qemu_vfree(s->l1_table);
+    old_l1_table_offset = s->l1_table_offset;
+    s->l1_table_offset = new_l1_table_offset;
+    s->l1_table = new_l1_table;
+    old_l1_size = s->l1_size;
+    s->l1_size = new_l1_size;
+    qcow2_free_clusters(bs, old_l1_table_offset, old_l1_size * 
sizeof(uint64_t),
+                        QCOW2_DISCARD_OTHER);
+    return 0;
+ fail:
+    qemu_vfree(new_l1_table);
+    qcow2_free_clusters(bs, new_l1_table_offset, new_l1_size2,
+                        QCOW2_DISCARD_OTHER);
+    return ret;
+}
+
  /*
   * l2_load
   *
diff --git a/block/qcow2.c b/block/qcow2.c
index d031515..d2b0dfe 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -2111,10 +2111,41 @@ static int qcow2_truncate(BlockDriverState *bs, int64_t 
offset)
          return -ENOTSUP;
      }
- /* shrinking is currently not supported */
+    /* shrinking image */
      if (offset < bs->total_sectors * 512) {
-        error_report("qcow2 doesn't support shrinking images yet");
-        return -ENOTSUP;
+        /* As l1 table, l2 table, refcount table, refcount block table
+         * and file header of the qcow2 image need to use some clusters,
+         * so should subtract these metadata from offset.
+         */
+        int64_t nb_l1 = DIV_ROUND_UP((uint64_t)s->l1_size * sizeof(uint64_t),
+                                     s->cluster_size);
+        int64_t nb_l2 = DIV_ROUND_UP(offset, (uint64_t)s->l2_size <<
+                                     s->cluster_bits);
+        int64_t nb_refcount_block_table = DIV_ROUND_UP(offset, (uint64_t)
+                                                       s->cluster_size <<
+                                                       s->refcount_block_bits);
+        int64_t nb_refcount_table = DIV_ROUND_UP(nb_refcount_block_table << 3,
+                                                 s->cluster_size);
+        int64_t total_nb = 2 * nb_l2 + nb_l1 + nb_refcount_block_table +
+                           nb_refcount_table + 1;
+        int64_t offset_for_shrink = offset - (total_nb << s->cluster_bits);

Okay, what does total_nb have to do with offset?

total_nb is some host value. offset is a guest value. Don't mix them.

+        int new_l2_index = offset_to_l2_index(s, offset_for_shrink);
+
+        new_l1_size = size_to_l1(s, offset_for_shrink);
+        ret = qcow2_shrink_l1_and_l2_table(bs, new_l1_size, new_l2_index,
+                                           offset);
+        if (ret < 0) {
+            return ret;
+        }
+
+        int64_t actual_size = bdrv_get_allocated_file_size(bs);
+
+        if (offset < actual_size) {
+            ret = bdrv_truncate(bs->file, offset);
+            if (ret < 0) {
+                return ret;
+            }
+        }

actual_size is not some offset you can use. It's the number of bytes allocated in the file. The file can contain holes. Therefore, comparing an offset against actual_size is wrong. Second, offset is the guest disk size, actual_size is the host disk size. They are not comparable.

Second, why are you truncating the file to offset if offset is smaller than actual_size? That is always wrong. You are blindly assuming: (1) that the image consists of only data and no metadata, because you're using the guest disk size (offset) to shrink the host file
(2) that all that data is tightly packed at the beginning of the file

(1) is obviously wrong. (2) is wrong as well, because clusters may be spread all over the host file. We would need defragmentation to solve that issue.

Therefore, to shorten the host file, you'd need to walk through the image and find the highest *host* cluster actually in use. Then you can truncate to after that cluster. However, I'd strongly advise against that because it doesn't bring any actual benefit.

Today's file systems support holes. Just discard all the clusters which get freed during shrinking, that will free up all that space. No need to defragment, no need to truncate. Yes, the file will look larger than it actually is, but nobody should care. We will need defragmentation to tackle that issue.

      }
new_l1_size = size_to_l1(s, offset);
diff --git a/block/qcow2.h b/block/qcow2.h
index 577ccd1..be1237d 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -516,6 +516,8 @@ int qcow2_pre_write_overlap_check(BlockDriverState *bs, int 
ign, int64_t offset,
  /* qcow2-cluster.c functions */
  int qcow2_grow_l1_table(BlockDriverState *bs, uint64_t min_size,
                          bool exact_size);
+int qcow2_shrink_l1_and_l2_table(BlockDriverState *bs, uint64_t new_l1_size,
+                                 int new_l2_index, int64_t boundary_size);
  int qcow2_write_l1_entry(BlockDriverState *bs, int l1_index);
  void qcow2_l2_cache_reset(BlockDriverState *bs);
  int qcow2_decompress_cluster(BlockDriverState *bs, uint64_t cluster_offset);

So, as for what I think we do need to do when shrinking (and keep in mind: The offset given to qcow2_truncate() is the guest size! NOT the host image size!):

(1) Determine the first L2 table and the first entry in the table which will lie beyond the new guest disk size.
(2) Discard all clusters beginning from there.
(3) Discard all L2 tables which are then completely empty.
(4) Update the header size.

And that's it. We can either speed up step (2) by implementing it manually, or we just use bdrv_discard() on the qcow2 BDS (in the simplest case: bdrv_discard(bs, DIV_ROUND_UP(offset, BDRV_SECTOR_SIZE), bs->total_sectors - DIV_ROUND_UP(offset, BDRV_SECTOR_SIZE));.

We can incorporate step (3) by extending qcow2_discard_clusters() to free L2 tables when they are empty after discard_single_l2(). But we don't even have to that now. It's an optimization we can go about later.

So, we can do (1), (2) and (3) in a single step: Just one bdrv_discard() call. But it's probably better to use qcow2_discard_clusters() instead and set the full_discard parameter to true.

So: qcow2_discard_clusters(bs, offset, bs->total_sectors - offset / BDRV_SECTOR_SIZE, true);. Then update the guest disk size field in the header. And we're done.

There are four problems with this approach:
- qcow2_discard_clusters() might be slower than optimal. I personally don't care at all. - If "bs->total_sectors * BDRV_SECTOR_SIZE - offset" is greater than INT_MAX, this won't work. Trivially solvable by encapsulating the qcow2_discard_clusters() call in a loop which limits nb_clusters to INT_MAX / BDRV_SECTOR_SIZE.
- The L1 table is not resized. Should not matter in practice at all.
- The file is not truncated. Does not matter either (because holes in the file are good enough), and we can't actually solve this problem without defragmentation anyway.

There is one advantage:
- It's extremely simple. It's literally below ten lines of code.

I think the advantage far outweighs the disadvantage. But I may be wrong. What do you think?

Max



reply via email to

[Prev in Thread] Current Thread [Next in Thread]