Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format

From:	Anthony Liguori
Subject:	Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format
Date:	Sun, 12 Sep 2010 12:09:34 -0500
User-agent:	Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.12) Gecko/20100826 Lightning/1.0b1 Thunderbird/3.0.7

On 09/12/2010 10:56 AM, Avi Kivity wrote:

No, the worst case is 0.003% allocated disk, with the allocatedclusters distributed uniformly. That means all your L2s areallocated, but almost none of your clusters are.

But in this case, you're so sparse that your metadata is pretty muchco-located which means seek performance won't matter much.

But since you have to boot before you can run any serious test, if ittakes 5 seconds to do an fsck(), it's highly likely that it's noteven noticeable.
What if it takes 300 seconds?

That means for a 1TB disk you're taking 500ms per L2 entry, you're fullyallocated and yet still doing an fsck. That seems awfully unlikely.

   if l2.committed:
       if l2.dirty
           l2.write()
           l2.dirty = False
       l2.mutex.unlock()
    else:
       l2.mutex.lock()
       l2cache[l2.pos] = l2
       l2.mutex.unlock()
The in-memory L2 is created by defaultdict(). I did omit linking L2into L1, by that's a function call. With a state machine, it's a newstring of states and calls.

But you have to write the L2 to disk first before you link it so it'snot purely in memory.

It's far easier to just avoid internal snapshots altogether andthis is exactly the thought process that led to QED. Once you dropsupport for internal snapshots, you can dramatically simplify.
The amount of metadata is O(nb_L2 * nb_snapshots). For qed,nb_snapshots = 1 but nb_L2 can be still quite large. If fsck is toolong for one, it is too long for the other.
nb_L2 is very small. It's exactly n / 2GB + 1 where n is imagesize. Since image size is typically < 100GB, practically speakingit's less than 50.
OTOH, nb_snapshots in qcow2 can be very large. In fact, it's notunrealistic for nb_snapshots to be >> 50. What that means is thatinstead of metadata being O(n) as it is today, it's at least O(n^2).
Why is in n^2? It's still n*m. If your image is 4TB instead of100GB, the time increases by a factor of 40 for both.

It's n*m but either n ~= m in which case it's n^2 or m << n, in whichcase, it's just n, or m >> n in which case, it's just O(m).


This is where asymptotic complexity ends up not being terribly helpful :-)

Let me put this another way though, if you support internal snapshots,what's a reasonable number of snapshots to expect reasonable performancewith? 10? 100? 1000? 10000?

Not doing qed-on-lvm is definitely a limitation. The one use caseI've heard is qcow2 on top of clustered LVM as clustered LVM issimpler than a clustered filesystem. I don't know the space wellenough so I need to think more about it.
I don't either. If this use case survives, and if qed isn't changedto accomodate it, it means that that's another place where qed can'tsupplant qcow2.
I'm okay with that. An image file should require a file system. IfI was going to design an image file to be used on top of raw storage,I would take an entirely different approach.
That spreads our efforts further.

No. I don't think we should be in the business of designing on top ofraw storage. Either assume fixed partitions, LVM, or a file system. Weshouldn't reinvent the wheel at every opportunity (just the carefullychosen opportunities).

Refcount table. See above discussion for my thoughts on refcounttable.
Ok. It boils down to "is fsck on startup acceptable". Without afreelist, you need fsck for both unclean shutdown and for UNMAP.
To rebuild the free list on unclean shutdown.
If you have an on-disk compact freelist, you don't need that fsck.

"If you have an on-disk compact [consistent] freelist, you don't needthat fsck."

Consistency is the key point. We go out of our way to avoid aconsistent freelist in QED because it's the path to best performance.The key goal for a file format should be to have exactly as muchconsistency as required and not one bit more as consistency always meansworse performance.

On the other hand, allocating a cluster in qcow2 as it is now requiresscanning the refcount table. Not very pretty. Kevin, how does thatperform?
(an aside: with cache!=none we're bouncing in the kernel as well; wereally need to make it work for cache=none, perhaps use O_DIRECT fordata and writeback for metadata and shared backing images).
QED achieves zero-copy with cache=none today. In fact, ourperformance testing that we'll publish RSN is exclusively withcache=none.
In this case, preallocation should really be cheap, since there isn'ta ton of dirty data that needs to be flushed. You issue an extraflush once in a while so your truncate (or physical image size in theheader) gets to disk, but that doesn't block new writes.
It makes qed/lvm work, and it replaces the need to fsck for the nextallocation with the need for a background scrubber to reclaim storage(you need that anyway for UNMAP). It makes the whole thing a lot moreattractive IMO.

For a 1PB disk image with qcow2, the reference count table is 128GB.For a 1TB image, the reference count table is 128MB. For a 128GBimage, the reference table is 16MB which is why we get away with it today.

Anytime you grow the freelist with qcow2, you have to write a brand newfreelist table and update the metadata synchronously to point to a newversion of it. That means for a 1TB image, you're potentially writingout 128MB of data just to allocate a new cluster.

s/freelist/refcount table/ to translate to current qcow2 nomenclature.This is certainly not fast. You can add a bunch of free blocks eachtime you mitigate the growth but I can't of many circumstances where a128MB write isn't going to be noticeable. And it only gets worse astime moves on because 1TB disk images are already in use today.

NB, with a 64-bit refcount table, the size of the refcount table isalmost exactly the same size as the L1/L2 table in QED. IOW, the costof transversing the refcount table to allocate a cluster is exactly thecost of transversing all of the L1/L2 metadata to build a freelist.IOW, you're doing the equivalent of an fsck everytime you open a qcow2file today.

It's very easy to neglect the details in something like qcow2. We'vebeen talking like the refcount table is basically free to read and writebut it's absolutely not. With large disk images, you're caching anawful lot of metadata to read the refcount table in fully.

If you reduce the reference count table to exactly two bits, you canstore that within the L1/L2 metadata since we have an extra 12 bitsworth of storage space. Since you need the L1/L2 metadata anyway, wemight as well just use that space as the authoritative source of thefree list information.

The only difference between qcow2 and qed is that since we use anon-demand table for L1/L2, our free list may be non-contiguous. Sincewe store virtual -> physical instead of physical->virtual, you have todo a full transversal with QED whereas with qcow2 you may get lucky.However, the fact that the reference count table is contiguous in qcow2is a design flaw IMHO because it makes growth extremely painful withlarge images to the point where I'll claim that qcow2 is probablyunusable by design with > 1TB disk images.

We can optimize qed by having a contiguous freelist mappingphysical->virtual (that's just a bitmap, and therefore considerablysmaller) but making the freelist not authoritative. That makes it muchfaster because we don't add another sync and let's us fallback to theL1/L2 table for authoritative information if we had an unclean shutdown.

It's a good compromise for performance and it validates the qedphilosophy. By starting with a correct and performant approach thatscales to large disk images, we can add features (like unmap) withoutsacrificing either.


Regards,

Anthony Liguori

Yes, you'll want to have that regardless. But adding new things toqcow2 has all the problems of introducing a new image format.
Just some of them. On mount, rewrite the image format as qcow3. Onclean shutdown, write it back to qcow2. So now there's no risk ofdata corruption (but there is reduced usability).
It means on unclean shutdown, you can't move images to olderversions. That means a management tool can't rely on the mobility ofimages which means it's a new format for all practical purposes.
QED started it's life as qcow3. You start with qcow3, remove thefeatures that are poorly thought out and make correctness hard, addsome future proofing, and you're left with QED.
We're fully backwards compatible with qcow2 (by virtue that qcow2 isstill in tree) but new images require new versions of QEMU. Thatsaid, we have a conversion tool to convert new images to the oldformat if mobility is truly required.
So it's the same story that you're telling above from an end-userperspective.
It's not exactly the same story (you can enable it selectively, or youcan run fsck before moving) but I agree it isn't a good thing.
They are once you copy the image. And power loss is the samething as unexpected exit because you're not simply talking aboutdelaying a sync, you're talking staging future I/O operationspurely within QEMU.
qed is susceptible to the same problem. If you have a 100MB writeand qemu exits before it updates L2s, then those 100MB areleaked. You could alleviate the problem by writing L2 atintermediate points, but even then, a power loss can leak those100MB.
qed trades off the freelist for the file size (anything beyond thefile size is free), it doesn't eliminate it completely. So youstill have some of its problems, but you don't get its benefits.
I think you've just established that qcow2 and qed both require anfsck. I don't disagree :-)
There's a difference between a background scrubber and a foregroundfsck.
The difference between qcow2 and qed is that qed relies on the filesize and qcow2 uses a bitmap.
The bitmap grows synchronously whereas in qed, we're not relying onsynchronous file growth. If we did, there would be no need for an fsck.
If you attempt to grow the refcount table in qcow2 without doing async(), then you're going to have to have an fsync to avoid corruption.
qcow2 doesn't have an advantage, it's just not trying to be assophisticated as qed is.
The difference is between preallocation and leaking, on one hand, anduncommitted allocation and later rebuilds, on the other. It isn't adifference between formats, but between implementations.

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format, (continued)

Prev by Date: Re: [Qemu-devel] QEMU interfaces for image streaming and post-copy block migration
Next by Date: Re: [Qemu-devel] [PATCH 1/3] block: allow migration to work with image files
Previous by thread: Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format
Next by thread: Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format
Index(es):
- Date
- Thread