Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format

From:	Avi Kivity
Subject:	Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format
Date:	Mon, 13 Sep 2010 11:24:58 +0200
User-agent:	Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.9) Gecko/20100907 Fedora/3.1.3-1.fc13 Lightning/1.0b3pre Thunderbird/3.1.3

 On 09/12/2010 10:18 PM, Anthony Liguori wrote:

But since you have to boot before you can run any serious test, ifit takes 5 seconds to do an fsck(), it's highly likely that it'snot even noticeable.
What if it takes 300 seconds?
That means for a 1TB disk you're taking 500ms per L2 entry, you'refully allocated and yet still doing an fsck. That seems awfullyunlikely.
I meant for a fully populated L1.  That's 10ms per L2.
Your math is off. A single L2 entry covers 2GB worth of logicalspace. That means a 1TB image consists of 512 L2s. 300 / 512 == .585which is 585ms.
That's a fully populated L1 on a 1TB image.


I meant fully populated L1, all 32K entries.  Not the 1TB disk.

But since that's 64TB, that's unlikely too. It can still take 10sfor a 2TB disk.
Ah, you're talking about a 64TB image. Recall that we can read L2s inparallel. I have trouble imaging that we'd get serialized performancewith a 64TB backing store. It's much more likely you've got more thanone spindle in this scenario.

When you have more than one spindle, you also have more than one guest.If they're all fscking at the same time, it actually gets slower.

Why is in n^2? It's still n*m. If your image is 4TB instead of100GB, the time increases by a factor of 40 for both.
It's n*m but either n ~= m in which case it's n^2 or m << n, inwhich case, it's just n, or m >> n in which case, it's just O(m).
This is where asymptotic complexity ends up not being terriblyhelpful :-)
Let me put this another way though, if you support internalsnapshots, what's a reasonable number of snapshots to expectreasonable performance with? 10? 100? 1000? 10000?
I'd say 10. Not that I really want to support internal snapshots, itdoesn't work well with multiple disks.
I don't think that's reasonable. The folks that I've talked to aboutsnapshots seem to want to do crazy things like use it forcheckpointing. TBH, I think they're looking for the ability to dothousands of checkpoints with an efficient way to release oldcheckpoints.
I imagine that's the design point things like btrfs are trying toachieve.

Well, qcow2 isn't going to get anywhere close to that. For that manyyou basically have to have extents and back references, or your metadataexplodes.

On the other hand, linear L2 (which now become L1) means your fsck isjust a linear scan of the table, which is probably faster than qcow2allocation...
And this is just a data layout optimization which is the sort of thingthat we should let performance data drive.


Well, if you drop L1, the code becomes simpler.

For a 1PB disk image with qcow2, the reference count table is128GB. For a 1TB image, the reference count table is 128MB. For a128GB image, the reference table is 16MB which is why we get awaywith it today.
Anytime you grow the freelist with qcow2, you have to write a brandnew freelist table and update the metadata synchronously to point toa new version of it. That means for a 1TB image, you're potentiallywriting out 128MB of data just to allocate a new cluster.
s/freelist/refcount table/ to translate to current qcow2nomenclature. This is certainly not fast. You can add a bunch offree blocks each time you mitigate the growth but I can't of manycircumstances where a 128MB write isn't going to be noticeable. Andit only gets worse as time moves on because 1TB disk images arealready in use today.
That's a strong point. qcow2 doubles on each allocation, itamortizes, but the delay is certainly going to be noticable.
You can do it ahead of time (so guest writes don't need to wait) butit's still expensive.
The trouble is, safe growth of the reference count table is hardbecause it's contiguous. That means you need to copy the table toanother location all at once instead of just creating a new L1 tableand reusing most of the existing L2 entries.

You can use the same technique as streaming. You initiate a copy in thebackground. Updates to the region already copied go to both copies,updates to the region not yet copied go to the original copy.

It's a damning flaw in the format for large images. You canpreallocate the whole thing up front to try to avoid the cost at runtime but even then, that's a huge cost to pay in disk space up front.

I'm more or less convinced it qcow2 is unusable for large images withoutincompatible format changes.

It's very easy to neglect the details in something like qcow2.We've been talking like the refcount table is basically free to readand write but it's absolutely not. With large disk images, you'recaching an awful lot of metadata to read the refcount table in fully.
If you reduce the reference count table to exactly two bits, you canstore that within the L1/L2 metadata since we have an extra 12 bitsworth of storage space. Since you need the L1/L2 metadata anyway,we might as well just use that space as the authoritative source ofthe free list information.
The only difference between qcow2 and qed is that since we use anon-demand table for L1/L2, our free list may be non-contiguous.Since we store virtual -> physical instead of physical->virtual, youhave to do a full transversal with QED whereas with qcow2 you mayget lucky. However, the fact that the reference count table iscontiguous in qcow2 is a design flaw IMHO because it makes growthextremely painful with large images to the point where I'll claimthat qcow2 is probably unusable by design with > 1TB disk images.
If you grow it in the background, it should be usable; since ithappens once every 1TB worth of writes, it's not such a huge load.I'll agree this is increasing complexity.
Trouble is, the reference count table is your authoritative source ofwhether something is free. You can grow it in the background but ifyou need to allocate clusters before your done growing, you have tostall the request. Those stalls can get very long on large disk images.


No, you update both copies.  It's hardly fun though.

We can optimize qed by having a contiguous freelist mappingphysical->virtual (that's just a bitmap, and therefore considerablysmaller) but making the freelist not authoritative. That makes itmuch faster because we don't add another sync and let's us fallbackto the L1/L2 table for authoritative information if we had anunclean shutdown.
It's a good compromise for performance and it validates the qedphilosophy. By starting with a correct and performant approach thatscales to large disk images, we can add features (like unmap)without sacrificing either.
How would you implement the bitmap as a compatible feature?
First, a refresher on the consistency model.
Metadata is not sync'd to disk when initially written because we tryto avoid having stronger metadata consistency than data consistency.We are forced to sync to enforce ordering when updating L1 with a newL2 but that's rare. If a guest attempts to sync data, we syncmetadata too.
Because we may have unsync'd metadata, all L1 and L2 entries need tobe scanned to try to find unreachable entries based on the start upfile size *before* allocating any new clusters upon start-up (notingthat we can read and rewrite existing clusters).
This is a correct design and performant this is where we start from.
The start up cost is undesirable, so to reduce the need to scan allentries up front, we add a feature that introduces a meta-data cleanflag in the header. The meta-data clean flag tells an implementationthat there were no cluster allocations since the last sync() whichmeans cluster allocation can now happen without searching for andcorrecting unreachable entries.

I don't think you can add it as a compatible feature. Implementationswithout the feature will load a clean image and dirty it withoutclearing the flag. So it has to be in from day 1.

An implementation can now set the meta-data clean flag right beforeany sync() operation

What happens if the clean flag reaches the disk before a metadata write(or truncate) makes it? That's a false-clean which can lead us to avoida scan which would be needed.

and unset the flag before the first cluster allocation (being carefulto sync the unsetting of the flag). This eliminates a lot ofunnecessary scans in the case of safe shut down.

Why do you need to sync clearing the flag? Worst case it is lost andyou have a false-dirty, leading to an unnecessary scan.

We can also set a timer after unsetting the meta-data clean flag tosync() after, say, 5 minutes. This further decreases the number oftimes we have to scan so that the fsck() window only happens whenpower failure occurs within 5 minutes of the last cluster allocation.
N.B., this flag is purely an optional feature that is backwardscompatible. If an implementation ignores the meta-data clean flag, itjust always scans for unreachable entries at startup.


No, you can get a false-clean by an old qemu crashing.

I think this sort of stuff can be done using a mount count. Every mountincrements the mount count. "clean" sets a clean counter equal to themount count. "dirty" sets the clean counter to 0. You could do thisfor any number of recoverable metadata structures (like a dirty bitmap).

This is still a correct design and we've eliminated the vast majorityof start up scans.
The freelist would also be an optional feature. If the freelistpointer is non-zero in the header, it means that an implementation canfind free blocks by reading the location of the freelist pointer inthe header.

The same issue - old qemu mounts, updates disk but ignores freelist, newqemu mounts and uses stale freelist.

A mount count fixes this (an alternative to a mount count is a new flagsfield with 1 bit per optional feature, any unrecognized compatiblefeature is cleared in the flags field, letting a new implementationdetect that an old qemu saw and ignored the feature).

We maintain the freelist in memory and we write it out right beforeany sync() operation. Adding to the freelist in memory (i.e. UNMAP)would clear the freelist pointer on disk. We could use this as anopportunity to schedule a future sync() just like as above.
We would actually write free list changes to disk while the freelistpointer was set to 0 such that when we did a sync(), we were justmarking the freelist valid again.
An implementation that doesn't know about free list has to do a fullscan to determine free blocks. That said, an implementation can alsojust ignore free blocks entirely and always allocate from the end.
The freelist would be a cluster that also happens to be an free'dentry. It would then contain a list of free clusters and would looksomething like:
struct freelist
{
     uint64_t next_cluster; /* zero to represent EOL */
     uint64_t num_entries;
     uint64_t entry[num_entries]; /* cluster offsets of free blocks */
};

A bitmap would work better vs fragmentation - the worst case for thisstructure is pretty bad. For 64k clusters, each 1TB needs just 2MB ofspace, which is easy to scan and even reasonable to hold in memory.

It's important to not represent the full list in a contiguous regionfor the reasons discussed in other notes re: refcount table.


Background copying can work around the issue, but yes.

This remains a completely correct implementation. We add noadditional syncs above what we already have and we write out thefreelist changes on demand. The only window that would require arebuild of the freelist would be a hard power failure 5 minutes sincethe last unmap.

After the last allocation; unmap only results in a leak (which can berecovered in the background).

One gotcha is that the freelist is double metadata which means thatfreeing a block requires an ordered write with respect to the L2 tableupdate. Otherwise, you could allocate something off of the freelistthat an L2 entry still pointed to.
So there's a sync() in the UNMAP path, but that's unavoidable withouta lot of cleverness. You could potentially delay the sync until youreallocate the block but it's not clear that UNMAP needs to be fast(yet).

I don't see why you need the sync. If the freelist update is lost,you've leaked a block which you can recover with a rebuild. If the L2update is lost, you have the freelist pointing into a used cluster, butyou can recover using fsck. If both are lost, nothing happened.


--
error compiling committee.c: too many arguments to function

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format, (continued)

Prev by Date: Re: [Qemu-devel] virtio block device and sysfs
Next by Date: [Qemu-devel] Re: [PATCH] trace: Fix user emulator dependency on trace objects
Previous by thread: Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format
Next by thread: Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format
Index(es):
- Date
- Thread