qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC] qcow2: 2 way to improve performance updating refc


From: Kevin Wolf
Subject: Re: [Qemu-devel] [RFC] qcow2: 2 way to improve performance updating refcount
Date: Fri, 22 Jul 2011 11:30:59 +0200
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.18) Gecko/20110621 Fedora/3.1.11-1.fc15 Thunderbird/3.1.11

Am 22.07.2011 11:13, schrieb Frediano Ziglio:
> 2011/7/22 Kevin Wolf <address@hidden>:
>> Am 21.07.2011 18:17, schrieb Frediano Ziglio:
>>> Hi,
>>>   after a snapshot is taken currently many write operations are quite
>>> slow due to
>>> - refcount updates (decrement old and increment new )
>>> - cluster allocation and file expansion
>>> - read-modify-write on partial clusters
>>>
>>> I found 2 way to improve refcount performance
>>>
>>> Method 1 - Lazy count
>>> Mainly do not take into account count for current snapshot, that is
>>> current snapshot counts as 0. This would require to add a
>>> current_snapshot in header and update refcount when current is changed.
>>> So for these operation
>>> - creating snapshot, performance are the same, just increment for old
>>> snapshot instead of the new one
>>> - normal write operations. As current snaphot counts as 0 there is not
>>> operations here so do not write any data
>>> - changing current snapshot, this is the worst case, you have to
>>> increment for the current snapshot and decrement for the new so it will
>>> take twice
>>> - deleting snapshot, if is the current just set current_snapshot to a
>>> dummy not existing value, if is not the current just decrement counters,
>>> no performance changes
>>
>> How would you do cluster allocation if you don't have refcounts any more
>> that can tell you if a cluster is used or not?
>>
> 
> You have refcount, is only that current snapshot counts as 0. An
> example may help, start with "A" snapshot A counts as zero so all
> refcounts are 0, now we create a snapshot "B" and make it current so
> refcounts are 1
> 
> A --- B
> 
> If you change a cluster in snapshot "B" counts are still 1. If you go
> back to "A" counters are increment (cause you leave B) and then
> decrement (cause you enter in A).
> 
> Perhaps the problem is how to distinguish 0 from "allocated in
> current" and "not allocated". Yes, with which I suppose above it's a
> problem, but we can easily use -1 as not allocated. If current and
> refcount 0 mark as -1, if not current we would have to increment
> counters of current, mark current as -1 than decrement for deleting,
> yes in this case you have twice the time.

Yes, this is the problem that I meant. If you use -1 for not allocated,
you're back to our current situation, just with refcount - 1 for each
cluster. In particular, you now need to update refcounts again on writes
(in order to change from -1 to 0).

>>> Method 2 - Read-only parent
>>> Here parents are readonly, instead of storing a refcount store a numeric
>>> id of the owner. If the owner is not current copy the cluster and change
>>> it. Considering this situation
>>>
>>> A --- B --- C
>>>
>>> B cannot be changed so in order to "change" B you have to create a new
>>> snapshot
>>>
>>> A --- B --- C
>>>          \--- D
>>>
>>> and change D. It can take more space cause you have in this case an
>>> additional snapshot.
>>>
>>> Operations:
>>> - creating snapshot, really fast as you don't have to change any
>>> ownership
>>> - normal write operations. If owner is not the same allocate a new
>>> cluster and just store a new owner for new cluster. Also ownership for
>>> past-to-end cluster could be set all to current owner in order to
>>> collapse allocations
>>> - changing current snapshot, no changes required for owners
>>> - deleting snapshot. Only possible if you have no child or a single
>>> child. Will require to scan all l2 tables and merge and update owner.
>>
>> I think this has similar characteristics as we have with external
>> snapshots (i.e. backing files). The advantage is that with applying it
>> to internal snapshots is that when deleting a snapshot you don't have to
>> copy around all the data.
>>
>> Probably this change could even be done transparently for the user, so
>> that B still appears to be writeable, but in fact refers to D now.
>>
>>
>> Anyway, have you checked how bad the refcount work really is? I think
>> that writing the VM state takes a lot longer, so that optimising the
>> refcount update may be the wrong approach, especially if it requires a
>> format change. My results with qemu-img snapshot suggest that it's not
>> worth it:
>>
>> address@hidden:~/images$ ~/source/qemu/qemu-img info scratch.qcow2
>> image: scratch.qcow2
>> file format: qcow2
>> virtual size: 8.0G (8589934592 bytes)
>> disk size: 4.0G
>> cluster_size: 65536
>> address@hidden:~/images$ time ~/source/qemu/qemu-img snapshot -c test
>> scratch.qcow2
>>
>> real    0m0.116s
>> user    0m0.009s
>> sys     0m0.040s
>> address@hidden:~/images$ time ~/source/qemu/qemu-img snapshot -d test
>> scratch.qcow2
>>
>> real    0m0.084s
>> user    0m0.011s
>> sys     0m0.044s
>>
>> Kevin
> 
> I'm not worried about time just taking snapshot more after taking
> snapshot during normal use. As you stated taking snapshot you can
> disable cache writethrough making it very fast but during normal
> operations you can't.

Well, the obvious solution is not using writethrough in this case. You
need it only for some broken guest OSes.

The other solution is adding a dirty flag which says that the refcount
on disk may not be accurate and the refcount must be rebuilt after a
crash. In this case you can drive the metadata cache in a writeback mode
even with cache=writethrough. This dirty flag is included in my proposal
for qcow2v3.

> Personally I'm pondering a log too to allow collapsing metadata
> updates. Even an external (another file) full log (with data) to try
> to reduce even overhead caused by read-modify-write during partial
> cluster updates and reduce file fragmentation. But as you can see from
> my patches I'm still exercising myself with Qemu code.

A journal is something to consider, yes. It's something that requires
some development effort, but long term I think it could provide some
nice advantages. I'm not sure if using it for the full data will help,
but for metadata it would certainly make sense.

Kevin



reply via email to

[Prev in Thread] Current Thread [Next in Thread]