qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation


From: Alberto Garcia
Subject: Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation
Date: Thu, 13 Apr 2017 15:21:30 +0200
User-agent: Notmuch/0.18.2 (http://notmuchmail.org) Emacs/24.4.1 (i586-pc-linux-gnu)

On Thu 13 Apr 2017 02:44:51 PM CEST, Denis V. Lunev wrote:
>>> 1) current L2 cache management seems very wrong to me. Each cache
>>>     miss means that we have to read entire L2 cache block. This means
>>>     that in the worst case (when dataset of the test does not fit L2
>>>     cache size we read 64kb of L2 table for each 4 kb read).
>>>
>>>     The situation is MUCH worse once we are starting to increase
>>>     cluster size. For 1 Mb blocks we have to read 1 Mb on each cache
>>>     miss.
>>>
>>>     The situation can be cured immediately once we will start reading
>>>     L2 cache with 4 or 8kb chunks. We have patchset for this for our
>>>     downstream and preparing it for upstream.
>> Correct, although the impact of this depends on whether you are using
>> SDD or HDD.
>>
>> With an SSD what you want is to minimize is the number of unnecessary
>> reads, so reading small chunks will likely increase the performance when
>> there's a cache miss.
>>
>> With an HDD what you want is to minimize the number of seeks. Once you
>> have moved the disk head to the location where the cluster is, reading
>> the whole cluster is relatively inexpensive, so (leaving the memory
>> requirements aside) you generally want to read as much as possible.
> no! This greatly helps for HDD too!
>
> The reason is that you cover areas of the virtual disk much more
> precise.  There is very simple example. Let us assume that I have
> f.e. 1 TB virtual HDD with 1 MB block size. As far as I understand
> right now L2 cache for the case consists of 4 L2 clusters.
>
> So, I can exhaust current cache only with 5 requests and each actual
> read will costs L2 table read. This is a read problem. This condition
> could happen on fragmented FS without a problem.

But what you're saying is that this makes a more efficient use of cache
memory.

If the guest OS has a lot of unused space but is very fragmented then
you don't want to fill up your cache with L2 entries that are not going
to be used. It's better to read smaller chunks from the L2 table so
there are fewer chances of having to evict entries from the
cache. Therefore this results in less cache misses and better I/O
performance.

Ok, this sounds perfectly reasonable to me.

If the cache is however big enough for the whole disk then you never
need to evict entries, so with an HDD you actually want to take
advantage of disk seeks and read as many L2 entries as possible.

However it's true that in this case this will only affect the initial
reads. Once the cache is full there's no need to read the L2 tables from
disk anymore and the performance will be the same, so your point remains
valid.

Still, one of the goals from my proposal is to reduce the amount of
metadata needed for the image. No matter how efficient you make the
cache, the only way to reduce the amount of L2 entries is to increase
the cluster size. And increasing the cluster size results in slower COW
and less efficient use of disk space.

>>>     Another problem is the amount of data written. We are writing
>>>     entire cluster in write operation and this is also insane. It is
>>>     possible to perform fallocate() and actual data write on normal
>>>     modern filesystem.
>> But that only works when filling the cluster with zeroes, doesn't it? If
>> there's a backing image you need to bring all the contents from there.
>
> Yes. Backing images are problems. Though, even with sub-clusters, we
> will suffer exactly the same with the amount of IOPSes as even with
> that head and tail have to be read. If you are spoken about
> subclusters equals to FS block size and avoid COW at all, this would
> be terribly slow later on with sequential reading. In such an approach
> sequential reading will result in random read.
>
> Guest OSes are written keeping in mind that adjacent LBAs are really
> adjacent and reading them sequentially is a very good idea. This
> invariant will be broken for the case of subclusters.

This invariant is already broken by the very design of the qcow2 format,
subclusters don't really add anything new there. For any given cluster
size you can write 4k in every odd cluster, then do the same in every
even cluster, and you'll get an equally fragmented image.

Berto



reply via email to

[Prev in Thread] Current Thread [Next in Thread]