qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] RFC: Reducing the size of entries in the qcow2 L2 cache


From: Denis V. Lunev
Subject: Re: [Qemu-devel] RFC: Reducing the size of entries in the qcow2 L2 cache
Date: Tue, 19 Sep 2017 18:18:17 +0300
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.3.0

On 09/19/2017 06:07 PM, Alberto Garcia wrote:
> Hi everyone,
>
> over the past few weeks I have been testing the effects of reducing
> the size of the entries in the qcow2 L2 cache. This was briefly
> mentioned by Denis in the same thread where we discussed subcluster
> allocation back in April, but I'll describe here the problem and the
> proposal in detail.
>
> === Problem ===
>
> In the qcow2 file format guest addresses are mapped to host addresses
> using the so-called L1 and L2 tables. The size of an L2 table is the
> same as the cluster size, therefore a larger cluster means more L2
> entries in a table, and because of that an L2 table can map a larger
> portion of the address space (not only because it contains more
> entries, but also because the data cluster that each one of those
> entries points at is larger).
>
> There are two consequences of this:
>
>    1) If you double the cluster size of a qcow2 image then the maximum
>       space needed for all L2 tables is divided by two (i.e. you need
>       half the metadata).
>
>    2) If you double the cluster size of a qcow2 image then each one of
>       the L2 tables will map four times as much disk space.
>
> With the default cluster size of 64KB, each L2 table maps 512MB of
> contiguous disk space. This table shows what happens when you change
> the cluster size:
>
>      |--------------+------------------|
>      | Cluster size | An L2 table maps |
>      |--------------+------------------|
>      |       512  B |            32 KB |
>      |         1 KB |           128 KB |
>      |         2 KB |           512 KB |
>      |         4 KB |             2 MB |
>      |         8 KB |             8 MB |
>      |        16 KB |            32 MB |
>      |        32 KB |           128 MB |
>      |        64 KB |           512 MB |
>      |       128 KB |             2 GB |
>      |       256 KB |             8 GB |
>      |       512 KB |            32 GB |
>      |         1 MB |           128 GB |
>      |         2 MB |           512 GB |
>      |--------------+------------------|
>
> When QEMU wants to convert a guest address into a host address, it
> needs to read the entry from the corresponding L2 table. The qcow2
> driver doesn't read those entries directly, it does it by loading the
> tables in the L2 cache so they can be kept in memory in case they are
> needed later.
>
> The problem here is that the L2 cache (and the qcow2 driver in
> general) always works with complete L2 tables: if QEMU needs a
> particular L2 entry then the whole cluster containing the L2 table is
> read from disk, and if the cache is full then a cluster worth of
> cached data has to be discarded.
>
> The consequences of this are worse the larger the cluster size is, not
> only because we're reading (and discarding) larger amounts of data,
> but also because we're using that memory in a very inefficient way.
>
> Example: with 1MB clusters each L2 table maps 128GB of contiguous
> virtual disk, so that's the granularity of our cache. If we're
> performing I/O in a 4GB area that overlaps two of those 128GB chunks,
> we need to have in the cache two complete L2 tables (2MB) even when in
> practice we're only using 32KB of those 2MB (32KB contain enough L2
> entries to map the 4GB that we're using).
>
> === The proposal ===
>
> One way to solve the problems described above is to decouple the L2
> table size (which is equal to the cluster size) from the cache entry
> size.
>
> The qcow2 cache doesn't actually know anything about the data that
> it's loading, it just receives a disk offset and checks that it is
> properly aligned. It's perfectly possible to make it load data blocks
> smaller than a cluster.
>
> I already have a working prototype, and I was doing tests using a 4KB
> cache entry size. 4KB is small enough, it allows us to make a more
> flexible use of the cache, it's also a common file system block size
> and it can hold enough L2 entries to cover substantial amounts of disk
> space (especially with large clusters).
>
>      |--------------+-----------------------|
>      | Cluster size | 4KB of L2 entries map |
>      |--------------+-----------------------|
>      | 64 KB        | 32 MB                 |
>      | 128 KB       | 64 MB                 |
>      | 256 KB       | 128 MB                |
>      | 512 KB       | 256 MB                |
>      | 1 MB         | 512 MB                |
>      | 2 MB         | 1 GB                  |
>      |--------------+-----------------------|
>
> Some results from my tests (using an SSD drive and random 4K reads):
>
> |-----------+--------------+-------------+---------------+--------------|
> | Disk size | Cluster size | L2 cache    | Standard QEMU | Patched QEMU |
> |-----------+--------------+-------------+---------------+--------------|
> | 16 GB     | 64 KB        | 1 MB [8 GB] | 5000 IOPS     | 12700 IOPS   |
> |  2 TB     |  2 MB        | 4 MB [1 TB] |  576 IOPS     | 11000 IOPS   |
> |-----------+--------------+-------------+---------------+--------------|
>
> The improvements are clearly visible, but it's important to point out
> a couple of things:
>
>    - L2 cache size is always < total L2 metadata on disk (otherwise
>      this wouldn't make sense). Increasing the L2 cache size improves
>      performance a lot (and makes the effect of these patches
>      disappear), but it requires more RAM.
>    - Doing random reads over the whole disk is probably not a very
>      realistic scenario. During normal usage only certain areas of the
>      disk need to be accessed, so performance should be much better
>      with the same amount of cache.
>    - I wrote a best-case scenario test (several I/O jobs each accesing
>      a part of the disk that requires loading its own L2 table) and my
>      patched version is 20x faster even with 64KB clusters.
>
> === What needs to change? ===
>
> Not so much, fortunately:
>
>    - The on-disk format does not need any change, qcow2 images remain
>      the same.
>    - The qcow2 cache driver needs almost no changes, the entry size
>      is no longer assumed to be equal to the cluster size, and it has
>      to be explicitly set. Other than that it remains the same (it can
>      even be simplified).
>
> The QEMU qcow2 driver does need a few changes:
>
>    - qcow2_get_cluster_offset() and get_cluster_table() simply need to
>      be aware that they're not loading full L2 tables anymore.
>    - handle_copied(), handle_alloc(), discard_single_l2() and
>      zero_single_l2() only need to update the calculation of
>      nb_clusters.
>    - l2_allocate() and qcow2_update_snapshot_refcount() cannot load a
>      full L2 table in memory at once, they need to loop over the
>      sub-tables. Other than wrapping the core of those functions
>      inside a loop I haven't detected any major problem.
>    - We need a proper name for these sub-tables that we are loading
>      now. I'm actually still struggling with this :-) I can't think of
>      any name that is clear enough and not too cumbersome to use (L2
>      subtables? => Confusing. L3 tables? => they're not really that).
>
> === What about the refcount cache? ===
>
> This proposal would not touch this at all, the assumption would remain
> that "refcount cache entry size = cluster size".
>
> ===========================
>
> I think I haven't forgotten anything. As I said I have a working
> prototype of this and if you like the idea I'd like to publish it
> soon. Any questions or comments will be appreciated.
>
> Thanks!
>
> Berto
very good analysis :)

I have two words about real life-ness of the scenario. There is very common
test by our customers when they create 1-2-4-16 Tb disk and starts random
IO over that disk. They are trying to emulate database access with all-flash
driver.

IMHO this scenario is important and I like this very much.

Thank you in advance,
    Den



reply via email to

[Prev in Thread] Current Thread [Next in Thread]