[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] RFC: Reducing the size of entries in the qcow2 L2 cache
From: |
Denis V. Lunev |
Subject: |
Re: [Qemu-devel] RFC: Reducing the size of entries in the qcow2 L2 cache |
Date: |
Tue, 19 Sep 2017 18:18:17 +0300 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.3.0 |
On 09/19/2017 06:07 PM, Alberto Garcia wrote:
> Hi everyone,
>
> over the past few weeks I have been testing the effects of reducing
> the size of the entries in the qcow2 L2 cache. This was briefly
> mentioned by Denis in the same thread where we discussed subcluster
> allocation back in April, but I'll describe here the problem and the
> proposal in detail.
>
> === Problem ===
>
> In the qcow2 file format guest addresses are mapped to host addresses
> using the so-called L1 and L2 tables. The size of an L2 table is the
> same as the cluster size, therefore a larger cluster means more L2
> entries in a table, and because of that an L2 table can map a larger
> portion of the address space (not only because it contains more
> entries, but also because the data cluster that each one of those
> entries points at is larger).
>
> There are two consequences of this:
>
> 1) If you double the cluster size of a qcow2 image then the maximum
> space needed for all L2 tables is divided by two (i.e. you need
> half the metadata).
>
> 2) If you double the cluster size of a qcow2 image then each one of
> the L2 tables will map four times as much disk space.
>
> With the default cluster size of 64KB, each L2 table maps 512MB of
> contiguous disk space. This table shows what happens when you change
> the cluster size:
>
> |--------------+------------------|
> | Cluster size | An L2 table maps |
> |--------------+------------------|
> | 512 B | 32 KB |
> | 1 KB | 128 KB |
> | 2 KB | 512 KB |
> | 4 KB | 2 MB |
> | 8 KB | 8 MB |
> | 16 KB | 32 MB |
> | 32 KB | 128 MB |
> | 64 KB | 512 MB |
> | 128 KB | 2 GB |
> | 256 KB | 8 GB |
> | 512 KB | 32 GB |
> | 1 MB | 128 GB |
> | 2 MB | 512 GB |
> |--------------+------------------|
>
> When QEMU wants to convert a guest address into a host address, it
> needs to read the entry from the corresponding L2 table. The qcow2
> driver doesn't read those entries directly, it does it by loading the
> tables in the L2 cache so they can be kept in memory in case they are
> needed later.
>
> The problem here is that the L2 cache (and the qcow2 driver in
> general) always works with complete L2 tables: if QEMU needs a
> particular L2 entry then the whole cluster containing the L2 table is
> read from disk, and if the cache is full then a cluster worth of
> cached data has to be discarded.
>
> The consequences of this are worse the larger the cluster size is, not
> only because we're reading (and discarding) larger amounts of data,
> but also because we're using that memory in a very inefficient way.
>
> Example: with 1MB clusters each L2 table maps 128GB of contiguous
> virtual disk, so that's the granularity of our cache. If we're
> performing I/O in a 4GB area that overlaps two of those 128GB chunks,
> we need to have in the cache two complete L2 tables (2MB) even when in
> practice we're only using 32KB of those 2MB (32KB contain enough L2
> entries to map the 4GB that we're using).
>
> === The proposal ===
>
> One way to solve the problems described above is to decouple the L2
> table size (which is equal to the cluster size) from the cache entry
> size.
>
> The qcow2 cache doesn't actually know anything about the data that
> it's loading, it just receives a disk offset and checks that it is
> properly aligned. It's perfectly possible to make it load data blocks
> smaller than a cluster.
>
> I already have a working prototype, and I was doing tests using a 4KB
> cache entry size. 4KB is small enough, it allows us to make a more
> flexible use of the cache, it's also a common file system block size
> and it can hold enough L2 entries to cover substantial amounts of disk
> space (especially with large clusters).
>
> |--------------+-----------------------|
> | Cluster size | 4KB of L2 entries map |
> |--------------+-----------------------|
> | 64 KB | 32 MB |
> | 128 KB | 64 MB |
> | 256 KB | 128 MB |
> | 512 KB | 256 MB |
> | 1 MB | 512 MB |
> | 2 MB | 1 GB |
> |--------------+-----------------------|
>
> Some results from my tests (using an SSD drive and random 4K reads):
>
> |-----------+--------------+-------------+---------------+--------------|
> | Disk size | Cluster size | L2 cache | Standard QEMU | Patched QEMU |
> |-----------+--------------+-------------+---------------+--------------|
> | 16 GB | 64 KB | 1 MB [8 GB] | 5000 IOPS | 12700 IOPS |
> | 2 TB | 2 MB | 4 MB [1 TB] | 576 IOPS | 11000 IOPS |
> |-----------+--------------+-------------+---------------+--------------|
>
> The improvements are clearly visible, but it's important to point out
> a couple of things:
>
> - L2 cache size is always < total L2 metadata on disk (otherwise
> this wouldn't make sense). Increasing the L2 cache size improves
> performance a lot (and makes the effect of these patches
> disappear), but it requires more RAM.
> - Doing random reads over the whole disk is probably not a very
> realistic scenario. During normal usage only certain areas of the
> disk need to be accessed, so performance should be much better
> with the same amount of cache.
> - I wrote a best-case scenario test (several I/O jobs each accesing
> a part of the disk that requires loading its own L2 table) and my
> patched version is 20x faster even with 64KB clusters.
>
> === What needs to change? ===
>
> Not so much, fortunately:
>
> - The on-disk format does not need any change, qcow2 images remain
> the same.
> - The qcow2 cache driver needs almost no changes, the entry size
> is no longer assumed to be equal to the cluster size, and it has
> to be explicitly set. Other than that it remains the same (it can
> even be simplified).
>
> The QEMU qcow2 driver does need a few changes:
>
> - qcow2_get_cluster_offset() and get_cluster_table() simply need to
> be aware that they're not loading full L2 tables anymore.
> - handle_copied(), handle_alloc(), discard_single_l2() and
> zero_single_l2() only need to update the calculation of
> nb_clusters.
> - l2_allocate() and qcow2_update_snapshot_refcount() cannot load a
> full L2 table in memory at once, they need to loop over the
> sub-tables. Other than wrapping the core of those functions
> inside a loop I haven't detected any major problem.
> - We need a proper name for these sub-tables that we are loading
> now. I'm actually still struggling with this :-) I can't think of
> any name that is clear enough and not too cumbersome to use (L2
> subtables? => Confusing. L3 tables? => they're not really that).
>
> === What about the refcount cache? ===
>
> This proposal would not touch this at all, the assumption would remain
> that "refcount cache entry size = cluster size".
>
> ===========================
>
> I think I haven't forgotten anything. As I said I have a working
> prototype of this and if you like the idea I'd like to publish it
> soon. Any questions or comments will be appreciated.
>
> Thanks!
>
> Berto
very good analysis :)
I have two words about real life-ness of the scenario. There is very common
test by our customers when they create 1-2-4-16 Tb disk and starts random
IO over that disk. They are trying to emulate database access with all-flash
driver.
IMHO this scenario is important and I like this very much.
Thank you in advance,
Den