qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH 18/21] qcow2: Add function for refcount order am


From: Eric Blake
Subject: Re: [Qemu-devel] [PATCH 18/21] qcow2: Add function for refcount order amendment
Date: Wed, 12 Nov 2014 06:50:12 -0700
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.2.0

On 11/12/2014 02:55 AM, Max Reitz wrote:
> On 2014-11-12 at 05:15, Eric Blake wrote:
>> On 11/10/2014 06:45 AM, Max Reitz wrote:
>>> Add a function qcow2_change_refcount_order() which allows changing the
>>> refcount order of a qcow2 image.
>>>
>>> Signed-off-by: Max Reitz <address@hidden>
>>> ---
>>>   block/qcow2-refcount.c | 424
>>> +++++++++++++++++++++++++++++++++++++++++++++++++
>>>   block/qcow2.h          |   4 +
>>>   2 files changed, 428 insertions(+)
>> This is fairly big; you may want to get a second review from a
>> maintainer rather than blindly trusting me.
> 
> I didn't really see the point in splitting it up. Introducing the static
> helper functions first would not be very helpful either, so I thought
> what I'd do as a reviewer, and that was "apply the patch and just read
> through all the code". Splitting it into multiple patches would not have
> helped there (and I don't see how I could split this patch into logical
> changes, where at the start we have some inefficient but simple
> implementation and it gets better over time).

Yes, I agree that your patch is not divisible.  Big doesn't always mean
bad :)


>>> +                refcount = s->get_refcount(refblock, refblock_index);
>>> +                if (new_refcount_bits < 64 && refcount >>
>>> new_refcount_bits) {
>> Technically, this get_refcount() call is dead code on the second walk,
>> since the first walk already validated things, so you could push all of
>> this code...
>>
>>> +                    uint64_t offset;
>>> +
>>> +                    qcow2_cache_put(bs, s->refcount_block_cache,
>>> &refblock);
>>> +
>>> +                    offset = ((reftable_index <<
>>> s->refcount_block_bits)
>>> +                              + refblock_index) << s->cluster_bits;
>>> +
>>> +                    error_setg(errp, "Cannot decrease refcount entry
>>> width to "
>>> +                               "%i bits: Cluster at offset %#"
>>> PRIx64 " has a "
>>> +                               "refcount of %" PRIu64,
>>> new_refcount_bits,
>>> +                               offset, refcount);
>>> +                    return -EINVAL;
>>> +                }
>>> +
>>> +                if (new_set_refcount) {
>>> +                    new_set_refcount(new_refblock,
>>> new_refblock_index++, refcount);
>>> +                } else {
>> ...here, in the branch only run on the first walk.
> 
> Well, yes, but I wanted to keep this function as agnostic to what the
> caller wants to do with it as possible. I'd rather decide depending on
> whether index == 0 because that's a better way of discerning the first
> walk.

I was thinking a bit more about how to avoid the allocation corner case
bug, and wonder if three walks instead of 2 is the right solution, in
which case the first two walks are both allocations, and index==0 is no
longer a reliable witness of whether these checks are needed.  More on
that below.

> 
>>> +                    new_refblock_index++;
>>> +                }
>>> +                new_refblock_empty = new_refblock_empty && refcount
>>> == 0;
>> Worth condensing to 'new_refblock_empty &= !refcount'?  Maybe not.
> 
> I personally would find that harder to read.

No need to change it then.


>>> +                if (new_set_refcount) {
>>> +                    new_set_refcount(new_refblock,
>>> new_refblock_index++, 0);
>> Would it be worth guaranteeing that every new refblock is 0-initialized
>> when allocated, so that you can skip setting a refcount to 0?  This
>> question depends on the answer about block allocation asked at [4] above.
> 
> This function sets a value in the buffer new_refblock, not in the
> cluster on disk. Therefore, in order to be able to omit this call, we'd
> have to call a memset() with 0 on new_refblock after each call to
> operation(). I don't think it's worth it. This is more explicit and
> won't cost much performance.

Yeah, that's about the same conclusion I came to after finishing the
whole review, although I didn't state it very well (this was one of my
earlier comments in my non-linear review, that I didn't touch up later).


>>> +    /* The new_reftable_size is now valid and will not be changed
>>> anymore,
>>> +     * so we can now allocate the reftable */
>>> +    new_reftable_offset = qcow2_alloc_clusters(bs, new_reftable_size *
>>> +                                                   sizeof(uint64_t));
>> And here is your bug, that I hinted at with the mention of [3] above.
>> This allocation can potentially cause an overflow of the existing
>> reftable to occupy one more cluster.
> 
> An additional bug is that the new reftable may be referenced by an
> existing refblock which was completely empty, though (or at least
> referenced by part of an existing refblock which was to be turned into a
> new refblock which was then completely empty, and thus omitted from
> allocation).
> 
>> Remember my thought experiment
>> above, how an 8 megabyte image rolls from 1 to 2 clusters during the
>> course of allocating refblocks for the new table?  What if the original
>> image wasn't completely full, but things are perfectly sized with enough
>> free clusters, then all of the refblock allocations done during the
>> first walk will still fit, and it is only this final allocation of the
>> new reftable that will cause the rollover, at which point we've failed
>> to account for the new refblock size.  That is, I think I could craft an
>> image that would trigger either an assertion failure or an out-of-bounds
>> array access during the second walk.
> 
> You're completely right, this will be a pain to fix, though... The
> simplest way would probably to check whether the new_reftable_size
> should be increased due to this operation and if it did, rerun
> walk_over_reftable() with the alloc_refblock() function only allocating
> refblocks if the new reftable does not already point to a refblock for a
> certain index. This would be repeated until the new_reftable_size is
> constant. And the really simplest incarnation of this would be to have a
> flag whether any allocations were done and repeat until everything is fine.
> 
> Another way would be to somehow integrate the allocation of the new
> reftable into walk_over_reftable() and then to only mind the additional
> reftable entries.
> 
> But probably the first way is the correct one because due to
> reallocation of the old reftable, some intermediate refblocks which were
> empty before are now filled.
> 
> I'll have to craft some test images myself, not least to be able to
> include them in the iotest.

Overnight, I had been thinking about it (I don't know if it's a good or
a bad thing when a patch is so mentally engaging that it becomes the
thing on my mind).  My initial idea was to teach the walk function how
to start and stop at given limits, maybe by passing an address to
dereference as the end point.  Then do something like:

original_limit = s->refcount_table_size;
walk allocations with limits of 0, &original_limit
allocate a contiguous block of limit+1 clusters for the new reftable
walk allocations with limits of original_limit, &s->refcount_table_size
if limit+1 is too big after all (most of the time), then free the tail
walk refcount assignment with limits of 0, &s->refcount_table_size

Allocating limit+1 for the new refblock should always be safe.  In the
majority of cases, it overallocates; but as we are already having
several small holes in the image after freeing the original reftable
means it won't hurt to have one more cluster hole.  It's better to have
the algorithm be safe and waste a few clusters than to figure out how to
pack it into the most efficient space possible at the expense of more
code complexity.  And maybe someday we'll implement a cluster
defragmenter for packing an image down into no holes.  In the minority
of cases, the +1 should always be sufficient to cover any additional
allocation spillovers.

But you raise a good point that the original image may have holes on the
first walk, where allocating refblocks for the new table will turn those
holes into refcounts that must be picked up, so I think you are right
that you have to walk the ENTIRE image on the second pass, rather than
just the tail of the image where you stopped early on the first pass.

Good luck on coming up with a plan to tackle it.

>> Interesting patch.  Hope my review helps you prepare a better v2.
> 
> If everything else fails, I'll just split the amend stuff from this
> series. But I'll work it out somehow. And your review will definitely
> help, thanks a lot!

Glad to hear it.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

Attachment: signature.asc
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]