qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-block] [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster


From: Denis V. Lunev
Subject: Re: [Qemu-block] [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation
Date: Thu, 13 Apr 2017 15:44:51 +0300
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0

On 04/13/2017 02:58 PM, Alberto Garcia wrote:
> On Wed 12 Apr 2017 06:54:50 PM CEST, Denis V. Lunev wrote:
>> My opinion about this approach is very negative as the problem could
>> be (partially) solved in a much better way.
> Hmm... it seems to me that (some of) the problems you are describing are
> different from the ones this proposal tries to address. Not that I
> disagree with them! I think you are giving useful feedback :)
>
>> 1) current L2 cache management seems very wrong to me. Each cache
>>     miss means that we have to read entire L2 cache block. This means
>>     that in the worst case (when dataset of the test does not fit L2
>>     cache size we read 64kb of L2 table for each 4 kb read).
>>
>>     The situation is MUCH worse once we are starting to increase
>>     cluster size. For 1 Mb blocks we have to read 1 Mb on each cache
>>     miss.
>>
>>     The situation can be cured immediately once we will start reading
>>     L2 cache with 4 or 8kb chunks. We have patchset for this for our
>>     downstream and preparing it for upstream.
> Correct, although the impact of this depends on whether you are using
> SDD or HDD.
>
> With an SSD what you want is to minimize is the number of unnecessary
> reads, so reading small chunks will likely increase the performance when
> there's a cache miss.
>
> With an HDD what you want is to minimize the number of seeks. Once you
> have moved the disk head to the location where the cluster is, reading
> the whole cluster is relatively inexpensive, so (leaving the memory
> requirements aside) you generally want to read as much as possible.
no! This greatly helps for HDD too!

The reason is that you cover areas of the virtual disk much more precise.
There is very simple example. Let us assume that I have f.e. 1 TB virtual
HDD with 1 MB block size. As far as I understand right now L2 cache
for the case consists of 4 L2 clusters.

So, I can exhaust current cache only with 5 requests and each actual read
will costs L2 table read. This is a read problem. This condition could
happen on fragmented FS without a problem.

With my proposal the situation is MUCH better. All accesses will be taken
from the cache after the first run.

>> 2) yet another terrible thing in cluster allocation is its allocation
>>     strategy.
>>     Current QCOW2 codebase implies that we need 5 (five) IOPSes to
>>     complete COW operation. We are reading head, writing head, reading
>>     tail, writing tail, writing actual data to be written. This could
>>     be easily reduced to 3 IOPSes.
> That sounds right, but I'm not sure if this is really incompatible with
> my proposal :)
the problem is code complexity, with is very complex right now.


>>     Another problem is the amount of data written. We are writing
>>     entire cluster in write operation and this is also insane. It is
>>     possible to perform fallocate() and actual data write on normal
>>     modern filesystem.
> But that only works when filling the cluster with zeroes, doesn't it? If
> there's a backing image you need to bring all the contents from there.

Yes. Backing images are problems. Though, even with sub-clusters, we
will suffer
exactly the same with the amount of IOPSes as even with that head and
tail have to
be read. If you are spoken about subclusters equals to FS block size and
avoid
COW at all, this would be terribly slow later on with sequential
reading. In such
an approach sequential reading will result in random read.

Guest OSes are written keeping in mind that adjacent LBAs are really
adjacent
and reading them sequentially is a very good idea. This invariant will
be broken
for the case of subclusters.

For nowadays SSD we are facing problems somewhere else. Right now I can
achieve
only 100k IOPSes on SSD capable of 350-550k. 1 Mb block with
preallocation and
fragmented L2 cache gives same 100k. Tests for initially empty image
gives around
80k for us.

May be I have too good hardware for these tests. Right now I am giving
number
from our last run on Intel top SSDs. We will remeasure this on something
slower
before submission ;)

Den




reply via email to

[Prev in Thread] Current Thread [Next in Thread]