Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format

From:	Anthony Liguori
Subject:	Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format
Date:	Tue, 07 Sep 2010 17:27:55 -0500
User-agent:	Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.11) Gecko/20100713 Lightning/1.0b1 Thunderbird/3.0.6

On 09/07/2010 11:25 AM, Anthony Liguori wrote:

On 09/07/2010 11:09 AM, Avi Kivity wrote:
 On 09/07/2010 06:40 PM, Anthony Liguori wrote:
Need a checksum for the header.
Is that not a bit overkill for what we're doing?  What's the benefit?
Make sure we're not looking at a header write interrupted by a crash.
Couldn't hurt I guess. I don't think it's actually needed for L1/L2tables FWIW.
The L2 link '''should''' be made after the data is in place onstorage. However, when no ordering is enforced the worst casescenario is an L2 link to an unwritten cluster.
Or it may cause corruption if the physical file size is notcommitted, and L2 now points at a free cluster.
An fsync() will make sure the physical file size is committed. Themetadata does not carry an additional integrity guarantees over theactual disk data except that in order to avoid internal corruption,we have to order the L2 and L1 writes.
I was referring to "when no ordering is enforced, the worst casescenario is an L2 link to an unwritten cluster". This isn't true -worst case you point to an unallocated cluster which can then beclaimed by data or metadata.
Right, it's necessary to do an fsync to protect against this. To makethis user friendly, we could have a dirty bit in the header which getsset on first metadata write and then cleared on clean shutdown.
Upon startup, if the dirty bit is set, we do an fsck.
We can remove this requirement by copying-on-write any metadatawrite, and keeping two copies of the header (with version numbersand checksums).
QED has a property today that all metadata or cluster locations havea single location on the disk format that is immutable. Defragwould relax this but defrag can be slow.
Having an immutable on-disk location is a powerful property whicheliminates a lot of complexity with respect to reference countingand dealing with free lists.
However, it exposes the format to "writes may corrupt overwritten data".
No, you never write an L2 entry once it's been set. If an L2 entryisn't set, the contents of the cluster is all zeros.
If you write data to allocate an L2 entry, until you do a flush(), thedata can either be what was written or all zeros.
For the initial design I would avoid introducing something likethis. One of the nice things about features is that we canintroduce multi-level trees as a future feature if we really thinkit's the right thing to do.
But we should start at a simple design with high confidence and highperformance, and then introduce features with the burden that we'reabsolutely sure that we don't regress integrity or performance.
For most things, yes. Metadata checksums should be designed inthough (since we need to double the pointer size).
Variable height trees have the nice property that you don't needmulti cluster allocation. It's nice to avoid large L2s for verylarge disks.
FWIW, L2s are 256K at the moment and with a two level table, it cansupport 5PB of data.

I clearly suck at basic math today. The image supports 64TB today.Dropping to 128K tables would reduce it to 16TB and 64k tables would be 4TB.

BTW, I don't think your checksumming idea is sound. If you store a64-bit checksum along side each point, it becomes necessary to updatethe parent pointer every time the table changes. This introduces anordering requirement which means you need to sync() the file every timeyou update and L2 entry.

Today, we only need to sync() when we first allocate an L2 entry(because their locations never change). From a performance perspective,it's the difference between an fsync() every 64k vs. every 2GB.

Plus, doesn't btrfs do block level checksumming? IOW, if you run aworkload where you care about this level of data integrity validation,if you did btrfs + qed, you would be fine.

Since the majority of file systems don't do metadata checksumming, it'snot obvious to me that we should be. I think one of the critical flawsin qcow2 was trying to invent a better filesystem within qemu instead ofjust sticking to a very simple and obviously correct format and lettingthe FS folks do the really fancy stuff.


Regards,

Anthony Liguori

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [Qemu-devel] Re: [RFC] qed: Add QEMU Enhanced Disk format, (continued)
- Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format, Daniel P. Berrange, 2010/09/06
  - Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format, Anthony Liguori, 2010/09/06
    - Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format, Daniel P. Berrange, 2010/09/06
    - Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format, Anthony Liguori, 2010/09/06
- Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format, Anthony Liguori, 2010/09/06
  - Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format, Avi Kivity, 2010/09/07
    - Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format, Anthony Liguori, 2010/09/07
    - Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format, Avi Kivity, 2010/09/07
    - Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format, Anthony Liguori, 2010/09/07
    - Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format, Anthony Liguori <=
    - Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format, Avi Kivity, 2010/09/08
    - Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format, Alexander Graf, 2010/09/08
    - Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format, Avi Kivity, 2010/09/08
    - Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format, Stefan Hajnoczi, 2010/09/08
    - Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format, Christoph Hellwig, 2010/09/08
    - Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format, Anthony Liguori, 2010/09/08
    - Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format, Christoph Hellwig, 2010/09/08
    - Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format, Anthony Liguori, 2010/09/08
    - Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format, Christoph Hellwig, 2010/09/08
    - Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format, Avi Kivity, 2010/09/09

Prev by Date: Re: [Qemu-devel] [PATCH 4/4] PPC: Change PPC maintainer
Next by Date: Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format
Previous by thread: Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format
Next by thread: Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format
Index(es):
- Date
- Thread