qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-block] [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster


From: Kevin Wolf
Subject: Re: [Qemu-block] [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation
Date: Thu, 13 Apr 2017 15:05:55 +0200
User-agent: Mutt/1.5.21 (2010-09-15)

Am 13.04.2017 um 14:44 hat Denis V. Lunev geschrieben:
> On 04/13/2017 02:58 PM, Alberto Garcia wrote:
> > On Wed 12 Apr 2017 06:54:50 PM CEST, Denis V. Lunev wrote:
> >> My opinion about this approach is very negative as the problem could
> >> be (partially) solved in a much better way.
> > Hmm... it seems to me that (some of) the problems you are describing are
> > different from the ones this proposal tries to address. Not that I
> > disagree with them! I think you are giving useful feedback :)
> >
> >> 1) current L2 cache management seems very wrong to me. Each cache
> >>     miss means that we have to read entire L2 cache block. This means
> >>     that in the worst case (when dataset of the test does not fit L2
> >>     cache size we read 64kb of L2 table for each 4 kb read).
> >>
> >>     The situation is MUCH worse once we are starting to increase
> >>     cluster size. For 1 Mb blocks we have to read 1 Mb on each cache
> >>     miss.
> >>
> >>     The situation can be cured immediately once we will start reading
> >>     L2 cache with 4 or 8kb chunks. We have patchset for this for our
> >>     downstream and preparing it for upstream.
> > Correct, although the impact of this depends on whether you are using
> > SDD or HDD.
> >
> > With an SSD what you want is to minimize is the number of unnecessary
> > reads, so reading small chunks will likely increase the performance when
> > there's a cache miss.
> >
> > With an HDD what you want is to minimize the number of seeks. Once you
> > have moved the disk head to the location where the cluster is, reading
> > the whole cluster is relatively inexpensive, so (leaving the memory
> > requirements aside) you generally want to read as much as possible.
> no! This greatly helps for HDD too!
> 
> The reason is that you cover areas of the virtual disk much more precise.
> There is very simple example. Let us assume that I have f.e. 1 TB virtual
> HDD with 1 MB block size. As far as I understand right now L2 cache
> for the case consists of 4 L2 clusters.
> 
> So, I can exhaust current cache only with 5 requests and each actual read
> will costs L2 table read. This is a read problem. This condition could
> happen on fragmented FS without a problem.
> 
> With my proposal the situation is MUCH better. All accesses will be taken
> from the cache after the first run.
> 
> >> 2) yet another terrible thing in cluster allocation is its allocation
> >>     strategy.
> >>     Current QCOW2 codebase implies that we need 5 (five) IOPSes to
> >>     complete COW operation. We are reading head, writing head, reading
> >>     tail, writing tail, writing actual data to be written. This could
> >>     be easily reduced to 3 IOPSes.
> > That sounds right, but I'm not sure if this is really incompatible with
> > my proposal :)
> the problem is code complexity, with is very complex right now.
> 
> 
> >>     Another problem is the amount of data written. We are writing
> >>     entire cluster in write operation and this is also insane. It is
> >>     possible to perform fallocate() and actual data write on normal
> >>     modern filesystem.
> > But that only works when filling the cluster with zeroes, doesn't it? If
> > there's a backing image you need to bring all the contents from there.
> 
> Yes. Backing images are problems. Though, even with sub-clusters, we
> will suffer exactly the same with the amount of IOPSes as even with
> that head and tail have to be read. If you are spoken about
> subclusters equals to FS block size and avoid COW at all, this would
> be terribly slow later on with sequential reading. In such an approach
> sequential reading will result in random read.
> 
> Guest OSes are written keeping in mind that adjacent LBAs are really
> adjacent and reading them sequentially is a very good idea. This
> invariant will be broken for the case of subclusters.

How that?

Given the same cluster size, subclustered and traditional images behave
_exactly_ the same regarding fragmentation. Subclusters only have an
effect on it (and a positive one) when you take them as a reason that
you can now afford to increase the cluster size.

I see subclusters and fragmentation as mostly independent topics.

> For nowadays SSD we are facing problems somewhere else. Right now I
> can achieve only 100k IOPSes on SSD capable of 350-550k. 1 Mb block
> with preallocation and fragmented L2 cache gives same 100k. Tests for
> initially empty image gives around 80k for us.

Preallocated images aren't particularly interesting to me. qcow2 is used
mainly for two reasons. One of them is sparseness (initially small file
size) mostly for desktop use cases with no serious I/O, so not that
interesting either. The other one is snapshots, i.e. backing files,
which doesn't work with preallocation (yet).

Actually, preallocation with backing files is something that subclusters
would automatically enable: You could already reserve the space for a
cluster, but still leave all subclusters marked as unallocated.

Kevin



reply via email to

[Prev in Thread] Current Thread [Next in Thread]