[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [RFC] qcow2: 2 way to improve performance updating refc
From: |
Kevin Wolf |
Subject: |
Re: [Qemu-devel] [RFC] qcow2: 2 way to improve performance updating refcount |
Date: |
Fri, 22 Jul 2011 11:30:59 +0200 |
User-agent: |
Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.18) Gecko/20110621 Fedora/3.1.11-1.fc15 Thunderbird/3.1.11 |
Am 22.07.2011 11:13, schrieb Frediano Ziglio:
> 2011/7/22 Kevin Wolf <address@hidden>:
>> Am 21.07.2011 18:17, schrieb Frediano Ziglio:
>>> Hi,
>>> after a snapshot is taken currently many write operations are quite
>>> slow due to
>>> - refcount updates (decrement old and increment new )
>>> - cluster allocation and file expansion
>>> - read-modify-write on partial clusters
>>>
>>> I found 2 way to improve refcount performance
>>>
>>> Method 1 - Lazy count
>>> Mainly do not take into account count for current snapshot, that is
>>> current snapshot counts as 0. This would require to add a
>>> current_snapshot in header and update refcount when current is changed.
>>> So for these operation
>>> - creating snapshot, performance are the same, just increment for old
>>> snapshot instead of the new one
>>> - normal write operations. As current snaphot counts as 0 there is not
>>> operations here so do not write any data
>>> - changing current snapshot, this is the worst case, you have to
>>> increment for the current snapshot and decrement for the new so it will
>>> take twice
>>> - deleting snapshot, if is the current just set current_snapshot to a
>>> dummy not existing value, if is not the current just decrement counters,
>>> no performance changes
>>
>> How would you do cluster allocation if you don't have refcounts any more
>> that can tell you if a cluster is used or not?
>>
>
> You have refcount, is only that current snapshot counts as 0. An
> example may help, start with "A" snapshot A counts as zero so all
> refcounts are 0, now we create a snapshot "B" and make it current so
> refcounts are 1
>
> A --- B
>
> If you change a cluster in snapshot "B" counts are still 1. If you go
> back to "A" counters are increment (cause you leave B) and then
> decrement (cause you enter in A).
>
> Perhaps the problem is how to distinguish 0 from "allocated in
> current" and "not allocated". Yes, with which I suppose above it's a
> problem, but we can easily use -1 as not allocated. If current and
> refcount 0 mark as -1, if not current we would have to increment
> counters of current, mark current as -1 than decrement for deleting,
> yes in this case you have twice the time.
Yes, this is the problem that I meant. If you use -1 for not allocated,
you're back to our current situation, just with refcount - 1 for each
cluster. In particular, you now need to update refcounts again on writes
(in order to change from -1 to 0).
>>> Method 2 - Read-only parent
>>> Here parents are readonly, instead of storing a refcount store a numeric
>>> id of the owner. If the owner is not current copy the cluster and change
>>> it. Considering this situation
>>>
>>> A --- B --- C
>>>
>>> B cannot be changed so in order to "change" B you have to create a new
>>> snapshot
>>>
>>> A --- B --- C
>>> \--- D
>>>
>>> and change D. It can take more space cause you have in this case an
>>> additional snapshot.
>>>
>>> Operations:
>>> - creating snapshot, really fast as you don't have to change any
>>> ownership
>>> - normal write operations. If owner is not the same allocate a new
>>> cluster and just store a new owner for new cluster. Also ownership for
>>> past-to-end cluster could be set all to current owner in order to
>>> collapse allocations
>>> - changing current snapshot, no changes required for owners
>>> - deleting snapshot. Only possible if you have no child or a single
>>> child. Will require to scan all l2 tables and merge and update owner.
>>
>> I think this has similar characteristics as we have with external
>> snapshots (i.e. backing files). The advantage is that with applying it
>> to internal snapshots is that when deleting a snapshot you don't have to
>> copy around all the data.
>>
>> Probably this change could even be done transparently for the user, so
>> that B still appears to be writeable, but in fact refers to D now.
>>
>>
>> Anyway, have you checked how bad the refcount work really is? I think
>> that writing the VM state takes a lot longer, so that optimising the
>> refcount update may be the wrong approach, especially if it requires a
>> format change. My results with qemu-img snapshot suggest that it's not
>> worth it:
>>
>> address@hidden:~/images$ ~/source/qemu/qemu-img info scratch.qcow2
>> image: scratch.qcow2
>> file format: qcow2
>> virtual size: 8.0G (8589934592 bytes)
>> disk size: 4.0G
>> cluster_size: 65536
>> address@hidden:~/images$ time ~/source/qemu/qemu-img snapshot -c test
>> scratch.qcow2
>>
>> real 0m0.116s
>> user 0m0.009s
>> sys 0m0.040s
>> address@hidden:~/images$ time ~/source/qemu/qemu-img snapshot -d test
>> scratch.qcow2
>>
>> real 0m0.084s
>> user 0m0.011s
>> sys 0m0.044s
>>
>> Kevin
>
> I'm not worried about time just taking snapshot more after taking
> snapshot during normal use. As you stated taking snapshot you can
> disable cache writethrough making it very fast but during normal
> operations you can't.
Well, the obvious solution is not using writethrough in this case. You
need it only for some broken guest OSes.
The other solution is adding a dirty flag which says that the refcount
on disk may not be accurate and the refcount must be rebuilt after a
crash. In this case you can drive the metadata cache in a writeback mode
even with cache=writethrough. This dirty flag is included in my proposal
for qcow2v3.
> Personally I'm pondering a log too to allow collapsing metadata
> updates. Even an external (another file) full log (with data) to try
> to reduce even overhead caused by read-modify-write during partial
> cluster updates and reduce file fragmentation. But as you can see from
> my patches I'm still exercising myself with Qemu code.
A journal is something to consider, yes. It's something that requires
some development effort, but long term I think it could provide some
nice advantages. I'm not sure if using it for the full data will help,
but for metadata it would certainly make sense.
Kevin