qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-block] [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster


From: Denis V. Lunev
Subject: Re: [Qemu-block] [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation
Date: Thu, 13 Apr 2017 16:09:53 +0300
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0

On 04/13/2017 04:05 PM, Kevin Wolf wrote:
> Am 13.04.2017 um 14:44 hat Denis V. Lunev geschrieben:
>> On 04/13/2017 02:58 PM, Alberto Garcia wrote:
>>> On Wed 12 Apr 2017 06:54:50 PM CEST, Denis V. Lunev wrote:
>>>> My opinion about this approach is very negative as the problem could
>>>> be (partially) solved in a much better way.
>>> Hmm... it seems to me that (some of) the problems you are describing are
>>> different from the ones this proposal tries to address. Not that I
>>> disagree with them! I think you are giving useful feedback :)
>>>
>>>> 1) current L2 cache management seems very wrong to me. Each cache
>>>>     miss means that we have to read entire L2 cache block. This means
>>>>     that in the worst case (when dataset of the test does not fit L2
>>>>     cache size we read 64kb of L2 table for each 4 kb read).
>>>>
>>>>     The situation is MUCH worse once we are starting to increase
>>>>     cluster size. For 1 Mb blocks we have to read 1 Mb on each cache
>>>>     miss.
>>>>
>>>>     The situation can be cured immediately once we will start reading
>>>>     L2 cache with 4 or 8kb chunks. We have patchset for this for our
>>>>     downstream and preparing it for upstream.
>>> Correct, although the impact of this depends on whether you are using
>>> SDD or HDD.
>>>
>>> With an SSD what you want is to minimize is the number of unnecessary
>>> reads, so reading small chunks will likely increase the performance when
>>> there's a cache miss.
>>>
>>> With an HDD what you want is to minimize the number of seeks. Once you
>>> have moved the disk head to the location where the cluster is, reading
>>> the whole cluster is relatively inexpensive, so (leaving the memory
>>> requirements aside) you generally want to read as much as possible.
>> no! This greatly helps for HDD too!
>>
>> The reason is that you cover areas of the virtual disk much more precise.
>> There is very simple example. Let us assume that I have f.e. 1 TB virtual
>> HDD with 1 MB block size. As far as I understand right now L2 cache
>> for the case consists of 4 L2 clusters.
>>
>> So, I can exhaust current cache only with 5 requests and each actual read
>> will costs L2 table read. This is a read problem. This condition could
>> happen on fragmented FS without a problem.
>>
>> With my proposal the situation is MUCH better. All accesses will be taken
>> from the cache after the first run.
>>
>>>> 2) yet another terrible thing in cluster allocation is its allocation
>>>>     strategy.
>>>>     Current QCOW2 codebase implies that we need 5 (five) IOPSes to
>>>>     complete COW operation. We are reading head, writing head, reading
>>>>     tail, writing tail, writing actual data to be written. This could
>>>>     be easily reduced to 3 IOPSes.
>>> That sounds right, but I'm not sure if this is really incompatible with
>>> my proposal :)
>> the problem is code complexity, with is very complex right now.
>>
>>
>>>>     Another problem is the amount of data written. We are writing
>>>>     entire cluster in write operation and this is also insane. It is
>>>>     possible to perform fallocate() and actual data write on normal
>>>>     modern filesystem.
>>> But that only works when filling the cluster with zeroes, doesn't it? If
>>> there's a backing image you need to bring all the contents from there.
>> Yes. Backing images are problems. Though, even with sub-clusters, we
>> will suffer exactly the same with the amount of IOPSes as even with
>> that head and tail have to be read. If you are spoken about
>> subclusters equals to FS block size and avoid COW at all, this would
>> be terribly slow later on with sequential reading. In such an approach
>> sequential reading will result in random read.
>>
>> Guest OSes are written keeping in mind that adjacent LBAs are really
>> adjacent and reading them sequentially is a very good idea. This
>> invariant will be broken for the case of subclusters.
> How that?
>
> Given the same cluster size, subclustered and traditional images behave
> _exactly_ the same regarding fragmentation. Subclusters only have an
> effect on it (and a positive one) when you take them as a reason that
> you can now afford to increase the cluster size.
>
> I see subclusters and fragmentation as mostly independent topics.
>
>> For nowadays SSD we are facing problems somewhere else. Right now I
>> can achieve only 100k IOPSes on SSD capable of 350-550k. 1 Mb block
>> with preallocation and fragmented L2 cache gives same 100k. Tests for
>> initially empty image gives around 80k for us.
> Preallocated images aren't particularly interesting to me. qcow2 is used
> mainly for two reasons. One of them is sparseness (initially small file
> size) mostly for desktop use cases with no serious I/O, so not that
> interesting either. The other one is snapshots, i.e. backing files,
> which doesn't work with preallocation (yet).
>
> Actually, preallocation with backing files is something that subclusters
> would automatically enable: You could already reserve the space for a
> cluster, but still leave all subclusters marked as unallocated.

I am spoken about fallocate() for the entire cluster before actual write()
for originally empty image. This increases the performance of 4k random
writes 10+ times. In this case we can just write those 4k and do nothing
else.

Den



reply via email to

[Prev in Thread] Current Thread [Next in Thread]