qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-block] [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster


From: Denis V. Lunev
Subject: Re: [Qemu-block] [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation
Date: Thu, 13 Apr 2017 16:30:43 +0300
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0

On 04/13/2017 04:21 PM, Alberto Garcia wrote:
> On Thu 13 Apr 2017 02:44:51 PM CEST, Denis V. Lunev wrote:
>>>> 1) current L2 cache management seems very wrong to me. Each cache
>>>>     miss means that we have to read entire L2 cache block. This means
>>>>     that in the worst case (when dataset of the test does not fit L2
>>>>     cache size we read 64kb of L2 table for each 4 kb read).
>>>>
>>>>     The situation is MUCH worse once we are starting to increase
>>>>     cluster size. For 1 Mb blocks we have to read 1 Mb on each cache
>>>>     miss.
>>>>
>>>>     The situation can be cured immediately once we will start reading
>>>>     L2 cache with 4 or 8kb chunks. We have patchset for this for our
>>>>     downstream and preparing it for upstream.
>>> Correct, although the impact of this depends on whether you are using
>>> SDD or HDD.
>>>
>>> With an SSD what you want is to minimize is the number of unnecessary
>>> reads, so reading small chunks will likely increase the performance when
>>> there's a cache miss.
>>>
>>> With an HDD what you want is to minimize the number of seeks. Once you
>>> have moved the disk head to the location where the cluster is, reading
>>> the whole cluster is relatively inexpensive, so (leaving the memory
>>> requirements aside) you generally want to read as much as possible.
>> no! This greatly helps for HDD too!
>>
>> The reason is that you cover areas of the virtual disk much more
>> precise.  There is very simple example. Let us assume that I have
>> f.e. 1 TB virtual HDD with 1 MB block size. As far as I understand
>> right now L2 cache for the case consists of 4 L2 clusters.
>>
>> So, I can exhaust current cache only with 5 requests and each actual
>> read will costs L2 table read. This is a read problem. This condition
>> could happen on fragmented FS without a problem.
> But what you're saying is that this makes a more efficient use of cache
> memory.
>
> If the guest OS has a lot of unused space but is very fragmented then
> you don't want to fill up your cache with L2 entries that are not going
> to be used. It's better to read smaller chunks from the L2 table so
> there are fewer chances of having to evict entries from the
> cache. Therefore this results in less cache misses and better I/O
> performance.
>
> Ok, this sounds perfectly reasonable to me.
>
> If the cache is however big enough for the whole disk then you never
> need to evict entries, so with an HDD you actually want to take
> advantage of disk seeks and read as many L2 entries as possible.
>
> However it's true that in this case this will only affect the initial
> reads. Once the cache is full there's no need to read the L2 tables from
> disk anymore and the performance will be the same, so your point remains
> valid.
>
> Still, one of the goals from my proposal is to reduce the amount of
> metadata needed for the image. No matter how efficient you make the
> cache, the only way to reduce the amount of L2 entries is to increase
> the cluster size. And increasing the cluster size results in slower COW
> and less efficient use of disk space.
actually we can read by clusters if the cache is empty or near empty.
Yes, block size should be increased. I perfectly in agreement with your.
But I think that we could do that by plain increase of the cluster size
without any further dances. Sub-clusters as sub-clusters will help
if we are able to avoid COW. With COW I do not see much difference.

But for the case of the COW absence, further sequential reading will
be broken by the fragmented file in the host. That is the point. We
should try to avoid host fragmentation at all.


>>>>     Another problem is the amount of data written. We are writing
>>>>     entire cluster in write operation and this is also insane. It is
>>>>     possible to perform fallocate() and actual data write on normal
>>>>     modern filesystem.
>>> But that only works when filling the cluster with zeroes, doesn't it? If
>>> there's a backing image you need to bring all the contents from there.
>> Yes. Backing images are problems. Though, even with sub-clusters, we
>> will suffer exactly the same with the amount of IOPSes as even with
>> that head and tail have to be read. If you are spoken about
>> subclusters equals to FS block size and avoid COW at all, this would
>> be terribly slow later on with sequential reading. In such an approach
>> sequential reading will result in random read.
>>
>> Guest OSes are written keeping in mind that adjacent LBAs are really
>> adjacent and reading them sequentially is a very good idea. This
>> invariant will be broken for the case of subclusters.
> This invariant is already broken by the very design of the qcow2 format,
> subclusters don't really add anything new there. For any given cluster
> size you can write 4k in every odd cluster, then do the same in every
> even cluster, and you'll get an equally fragmented image.
>
The size of the cluster matters! Our experiments in older Parallels
shown that
with 1 Mb continuous (!) cluster this invariant is "almost" kept and
this works
fine for sequential ops.

Den




reply via email to

[Prev in Thread] Current Thread [Next in Thread]