qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format


From: Anthony Liguori
Subject: Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format
Date: Tue, 07 Sep 2010 11:12:15 -0500
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.11) Gecko/20100713 Lightning/1.0b1 Thunderbird/3.0.6

On 09/07/2010 09:51 AM, Avi Kivity wrote:
     /* if (features & QED_F_BACKING_FILE) */
     uint32_t backing_file_offset; /* in bytes from start of header */
     uint32_t backing_file_size;   /* in bytes */

It's really the filename size, not the file size. Also, make a note that it is not zero terminated.


     /* if (compat_features & QED_CF_BACKING_FORMAT) */
     uint32_t backing_fmt_offset;  /* in bytes from start of header */
     uint32_t backing_fmt_size;    /* in bytes */

Why not make it mandatory?

You mean, why not make it:

/* if (features & QED_F_BACKING_FILE) */

As opposed to an independent compat feature. Mandatory features mean that you cannot read an image format if you don't understand the feature. In the context of backing_format, it means you have to have all of the possible values fully defined.

IOW, what are valid values for backing_fmt? "raw" and "qed" are obvious but what does it mean from a formal specification perspective to have "vmdk"? Is that VMDK v3 or v4, what if there's a v5?

If we make backing_fmt a suggestion, it gives us flexibility to leave this poorly defined whereas implementation can fall back to probing if there's any doubt.

For the spec, I'd like to define "raw" and "qed". I'd like to modify the qemu implementation to refuse to load an image as raw unless backing_fmt is raw but otherwise just probing.

For image creation, if an explicit backing format isn't specified by the user, I'd like to insert backing_fmt=raw for probed raw images and otherwise, not specify a backing_fmt.

Regards,

Anthony Liguori


 }

Need a checksum for the header.


==Extent table==

 #define TABLE_NOFFSETS (table_size * cluster_size / sizeof(uint64_t))

 Table {
     uint64_t offsets[TABLE_NOFFSETS];
 }

It's fashionable to put checksums here.

Do we want a real extent-based format like modern filesystems? So after defragmentation a full image has O(1) metadata?


The extent tables are organized as follows:

                    +----------+
                    | L1 table |
                    +----------+
               ,------'  |  '------.
          +----------+   |    +----------+
          | L2 table |  ...   | L2 table |
          +----------+        +----------+
      ,------'  |  '------.
 +----------+   |    +----------+
 |   Data   |  ...   |   Data   |
 +----------+        +----------+

The table_size field allows tables to be multiples of the cluster size. For example, cluster_size=64 KB and table_size=4 results in 256 KB tables.

=Operations=

==Read==
# If L2 table is not present in L1, read from backing image.
# If data cluster is not present in L2, read from backing image.
# Otherwise read data from cluster.

If not in backing image, provide zeros


==Write==
# If L2 table is not present in L1, allocate new cluster and L2. Perform L2 and L1 link after writing data. # If data cluster is not present in L2, allocate new cluster. Perform L1 link after writing data.
# Otherwise overwrite data cluster.

Detail copy-on-write from backing image.

On a partial write without a backing file, do we recommend zero-filling the cluster (to avoid intra-cluster fragmentation)?


The L2 link '''should''' be made after the data is in place on storage. However, when no ordering is enforced the worst case scenario is an L2 link to an unwritten cluster.

Or it may cause corruption if the physical file size is not committed, and L2 now points at a free cluster.


The L1 link '''must''' be made after the L2 cluster is in place on storage. If the order is reversed then the L1 table may point to a bogus L2 table. (Is this a problem since clusters are allocated at the end of the file?)

==Grow==
# If table_size * TABLE_NOFFSETS < new_image_size, fail -EOVERFLOW. The L1 table is not big enough.

With a variable-height tree, we allocate a new root, link its first entry to the old root, and write the new header with updated root and height.

# Write new image_size header field.

=Data integrity=
==Write==
Writes that complete before a flush must be stable when the flush completes.

If storage is interrupted (e.g. power outage) then writes in progress may be lost, stable, or partially completed. The storage must not be otherwise corrupted or inaccessible after it is restarted.

We can remove this requirement by copying-on-write any metadata write, and keeping two copies of the header (with version numbers and checksums). Enterprise storage will not corrupt on writes, but commodity storage may.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]