qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC] Re-evaluating subcluster allocation for qcow2 ima


From: Denis Lunev
Subject: Re: [Qemu-devel] [RFC] Re-evaluating subcluster allocation for qcow2 images
Date: Thu, 27 Jun 2019 14:19:25 +0000

On 6/27/19 4:59 PM, Alberto Garcia wrote:
> Hi all,
>
> a couple of years ago I came to the mailing list with a proposal to
> extend the qcow2 format to add subcluster allocation.
>
> You can read the original message (and the discussion thread that came
> afterwards) here:
>
>    https://lists.gnu.org/archive/html/qemu-block/2017-04/msg00178.html
>
> The description of the problem from the original proposal is still
> valid so I won't repeat it here.
>
> What I have been doing during the past few weeks was to retake the
> code that I wrote in 2017, make it work with the latest QEMU and fix
> many of its bugs. I have again a working prototype which is by no
> means complete but it allows us to have up-to-date information about
> what we can expect from this feature.
>
> My goal with this message is to retake the discussion and re-evaluate
> whether this is a feature that we'd like for QEMU in light of the test
> results and all the changes that we have had in the past couple of
> years.
>
> === Test results ===
>
> I ran these tests with the same hardware configuration as in 2017: an
> SSD drive and random 4KB write requests to an empty 40GB qcow2 image.
>
> Here are the results when the qcow2 file is backed by a fully
> populated image. There are 8 subclusters per cluster and the
> subcluster size is in brackets:
>
> |-----------------+----------------+-----------------|
> |  Cluster size   | subclusters=on | subclusters=off |
> |-----------------+----------------+-----------------|
> |   2 MB (256 KB) |   571 IOPS     |  124 IOPS       |
> |   1 MB (128 KB) |   863 IOPS     |  212 IOPS       |
> | 512 KB  (64 KB) |  1678 IOPS     |  365 IOPS       |
> | 256 KB  (32 KB) |  2618 IOPS     |  568 IOPS       |
> | 128 KB  (16 KB) |  4907 IOPS     |  873 IOPS       |
> |  64 KB   (8 KB) | 10613 IOPS     | 1680 IOPS       |
> |  32 KB   (4 KB) | 13038 IOPS     | 2476 IOPS       |
> |   4 KB (512 B)  |   101 IOPS     |  101 IOPS       |
> |-----------------+----------------+-----------------|
>
> Some comments about the results, after comparing them with those from
> 2017:
>
> - As expected, 32KB clusters / 4 KB subclusters give the best results
>   because that matches the size of the write request and therefore
>   there's no copy-on-write involved.
>
> - Allocation is generally faster now in all cases (between 20-90%,
>   depending on the case). We have made several optimizations to the
>   code since last time, and I suppose that the COW changes made in
>   commits b3cf1c7cf8 and ee22a9d869 are probably the main factor
>   behind these improvements.
>
> - Apart from the 64KB/8KB case (which is much faster), the patters are
>   generally the same: subcluster allocation offers similar performance
>   benefits compared to last time, so there are no surprises in this
>   area.
>
> Then I ran the tests again using the same environment but without a
> backing image. The goal is to measure the impact of subcluster
> allocation on completely empty images.
>
> Here we have an important change: since commit c8bb23cbdb empty
> clusters are preallocated and filled with zeroes using an efficient
> operation (typically fallocate() with FALLOC_FL_ZERO_RANGE) instead of
> writing the zeroes with the usual pwrite() call.
>
> The effects of this are dramatic, so I decided to run two sets of
> tests: one with this optimization and one without it.
>
> Here are the results:
>
> |-----------------+----------------+-----------------+----------------+-----------------|
> |                 | Initialization with fallocate()  |  Initialization with 
> pwritev()   |
> |-----------------+----------------+-----------------+----------------+-----------------|
> |  Cluster size   | subclusters=on | subclusters=off | subclusters=on | 
> subclusters=off |
> |-----------------+----------------+-----------------+----------------+-----------------|
> |   2 MB (256 KB) | 14468 IOPS     | 14776 IOPS      |  1181 IOPS     |  260 
> IOPS       |
> |   1 MB (128 KB) | 13752 IOPS     | 14956 IOPS      |  1916 IOPS     |  358 
> IOPS       |
> | 512 KB  (64 KB) | 12961 IOPS     | 14776 IOPS      |  4038 IOPS     |  684 
> IOPS       |
> | 256 KB  (32 KB) | 12790 IOPS     | 14534 IOPS      |  6172 IOPS     | 1213 
> IOPS       |
> | 128 KB  (16 KB) | 12550 IOPS     | 13967 IOPS      |  8700 IOPS     | 1976 
> IOPS       |
> |  64 KB   (8 KB) | 12491 IOPS     | 13432 IOPS      | 11735 IOPS     | 4267 
> IOPS       |
> |  32 KB   (4 KB) | 13203 IOPS     | 11752 IOPS      | 12366 IOPS     | 6306 
> IOPS       |
> |   4 KB (512 B)  |   103 IOPS     |   101 IOPS      |   101 IOPS     |  101 
> IOPS       |
> |-----------------+----------------+-----------------+----------------+-----------------|
>
> Comments:
>
> - With the old-style allocation method using pwritev() we get similar
>   benefits as we did last time. The comments from the test with a
>   backing image apply to this one as well.
>
> - However the new allocation method is so efficient that having
>   subclusters does not offer any performance benefit. It even slows
>   down things a bit in most cases, so we'd probably need to fine tune
>   the algorithm in order to get similar results.
>
> - In light of this numbers I also think that even when there's a
>   backing image we could preallocate the full cluster but only do COW
>   on the affected subclusters. This would the rest of the cluster
>   preallocated on disk but unallocated on the bitmap. This would
>   probably reduce on-disk fragmentation, which was one of the concerns
>   raised during the original discussion.
>
> I also ran some tests on a rotating HDD drive. Here having subclusters
> doesn't make a big difference regardless of whether there is a backing
> image or not, so we can ignore this scenario.
>
> === Changes to the on-disk format ===
>
> In my original proposal I described 3 different alternatives for
> storing the subcluster bitmaps. I'm naming them here, but refer to
> that message for more details.
>
> (1) Storing the bitmap inside the 64-bit entry
> (2) Making L2 entries 128-bit wide.
> (3) Storing the bitmap somewhere else
>
> I used (1) for this implementation for simplicity, but I think (2) is
> probably the best one.
>
> ===========================
>
> And I think that's all. As you can see I didn't want to go much into
> the open technical questions (I think the on-disk format would be the
> main one), the first goal should be to decide whether this is still an
> interesting feature or not.
>
> So, any questions or comments will be much appreciated.
>
> Berto
I would like to add my $0.02 here from a little bit different
point of view.

Right now QCOW2 is not very efficient with default cluster
size (64k) for fast performance with big disks. Nowadays
ppl uses really BIG images and 1-2-3-8 Tb disks are really
common. Unfortunately ppl want to get random IO fast too.
Thus metadata cache should be in memory as in the any other
case we will get IOPSes halved (1 operation for metadata
cache read and one operation for real read). For 8 Tb image
this results in 1 Gb RAM for that. For 1 Mb cluster we get
64 Mb which is much more reasonable.

Though with 1 Mb cluster the reclaim process becomes
much-much worse. I can not give exact number, unfortunately.
AFAIR the image occupies 30-50% more space. Guys, I would
appreciate if you will correct me here with real numbers.

Thus in respect to this patterns subclusters could give us
benefits of fast random IO and good reclaim rate. I would
consider 64k cluster/8k subcluster as too extreme for me.
In reality we would end up with completely fragmented
image very soon. Sequential reads would become random
VERY soon without preallocation. Though, anyway, this
makes some sense for COW. But, again, in such a case
subclusters should not be holed as required by scenario
I have mentioned first.

Den

reply via email to

[Prev in Thread] Current Thread [Next in Thread]