qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v8 03/14] qcow2: Optimize bdrv_make_empty()


From: Max Reitz
Subject: Re: [Qemu-devel] [PATCH v8 03/14] qcow2: Optimize bdrv_make_empty()
Date: Thu, 10 Jul 2014 01:23:12 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.6.0

On 30.06.2014 13:33, Kevin Wolf wrote:
Am 07.06.2014 um 20:51 hat Max Reitz geschrieben:
bdrv_make_empty() is currently only called if the current image
represents an external snapshot that has been committed to its base
image; it is therefore unlikely to have internal snapshots. In this
case, bdrv_make_empty() can be greatly sped up by creating an empty L1
table and dropping all data clusters at once by recreating the refcount
structure accordingly instead of normally discarding all clusters.

If there are snapshots, fall back to the simple implementation (discard
all clusters).

Signed-off-by: Max Reitz <address@hidden>
Reviewed-by: Eric Blake <address@hidden>
This approach looks a bit too complicated to me, and calulating the
required metadata size seems error-prone.

How about this:

1. Set the dirty flag in the header so we can mess with the L1 table
    without keeping the refcounts consistent

2. Overwrite the L1 table with zeros

3. Overwrite the first n clusters after the header with zeros
    (n = 2 + l1_clusters).

4. Update the header:
    refcount_table_offset = cluster_size
    refcount_table_clusters = 1
    l1_table_offset = 3 * cluster_size

6. bdrv_truncate to n + 1 clusters

7. Now update the first 8 bytes at cluster_size (the first new refcount
    table entry) to point to 2 * cluster_size (new refcount block)

8. Reset refcount block and L2 cache

9. Allocate n + 1 clusters (the header, too) and make sure you get
    offset 0

10. Remove the dirty flag

Okay, after some fixing around and getting it to work, I noticed a (seemingly to me) rather big problem: If something bad happens between 3 and 7 (especially between 4 and 7), the image cannot be repaired. The reason is that the refcount table is empty and a new refcount block cannot be allocated because the consistency checks correctly signal an overlap with the refcount table (I guess, I would have expected the image header instead, but well...); this is because nothing is allocated and the first cluster offset returned by an allocation will probably be zero (the image header) or $cluster_size (where the reftable resides).

So I think we absolutely have to make sure that whenever the refcount_table_offset is changed on disk, the reftable it points to already contains a valid offset. We could pull 7 before 4, but then we'd have to guarantee that 3 did not already overwrite the reftable (which it probably does). Well, maybe we could change 3 so it checks whether the reftable is already part of that area, and if it is, overwrite its first entry not with zero, but with 2 * cluster_size; if the offset of the reftable is not 2 * cluster_size, in which case we'd have to take some other offset. Then we could either try to write a new reftable anyway or just place everything behind that old reftable, just ignoring the "lost" space.

In any case, I doubt it'll be much shorter overall with these additional checks. The current code has 340 LOC with extremely verbose commentary; my new code (failing to address the problem described above) has 100 LOC without any comments.

So I guess the main issue is how *complicated* the code actually is; in my opinion, the most complicated and hardest to review piece of code in this patch (patch v8 3/14) is minimal_blob_size(); which, as far as I think, we will need in one form or another eventually anyway. create_refcount_l1() is pretty long, but due to the commentary should be well comprehensible.

In any case, I still have the code for your proposal here and I'd be absolutely fine with working further on it. So if you think it'll be worth it anyway (which I personally don't have any opinion on), I'll continue on it.

Max



reply via email to

[Prev in Thread] Current Thread [Next in Thread]