qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format


From: Stefan Hajnoczi
Subject: Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format
Date: Wed, 8 Sep 2010 16:07:46 +0100

On Tue, Sep 7, 2010 at 3:51 PM, Avi Kivity <address@hidden> wrote:
>  On 09/06/2010 04:06 PM, Anthony Liguori wrote:
>>
>> Another point worth mentioning is that our intention is to have a formal
>> specification of the format before merging.  A start of that is located at
>> http://wiki.qemu.org/Features/QED
>>
>
>> =Specification=
>>
>> The file format looks like this:
>>
>>  +---------+---------+---------+-----+
>>  | extent0 | extent1 | extent1 | ... |
>>  +---------+---------+---------+-----+
>>
>> The first extent contains a header.  The header contains information about
>> the first data extent.  A data extent may be a data cluster, an L2, or an L1
>> table.  L1 and L2 tables are composed of one or more contiguous extents.
>>
>> ==Header==
>>  Header {
>>     uint32_t magic;               /* QED\0 */
>
> Endianness?

Little-endian for all metadata.  Updated on wiki page.

>>
>>     uint32_t cluster_size;        /* in bytes */
>
> Does cluster == extent?  If so, use the same terminology.  If not, explain.
>
> Usually extent is a variable size structure.

QED does not use extents.  It uses fixed size clusters, 64 KB by
default but configurable at image creation time.  The wiki page has
been fleshed out more to describe the cluster-based layout.

>>     uint32_t table_size;          /* table size, in clusters */
>
> Presumably L1 table size?  Or any table size?
>
> Hm.  It would be nicer not to require contiguous sectors anywhere.  How
> about a variable- or fixed-height tree?

Both extents and fancier trees don't fit the philosophy, which is to
keep things straightforward and fast by doing less.  With extents and
trees you've got something that looks much more like a full-blown
filesystem.  Is there an essential feature or characteristic that QED
cannot provide in its current design?

>>     uint32_t first_cluster;       /* in clusters */
>
> First cluster of what?
>
>>
>>     uint64_t features;            /* format feature bits */
>>     uint64_t compat_features;     /* compat feature bits */
>>     uint64_t l1_table_offset;     /* L1 table offset, in clusters */
>>     uint64_t image_size;          /* total image size, in clusters */
>
> Logical, yes?

Yes.  Wiki updated.

> Is the physical image size always derived from the host file metadata?  Is
> this always safe?

In my email summarizing crash scenarios and recovery we cover the
bases and I think it is safe to rely on file size as physical image
size.  The drawback is that you need a host filesystem and cannot
directly use a bare block device.  I think that is acceptable for a
sparse format, otherwise we'd be using raw.

>
>>     /* if (features & QED_F_BACKING_FILE) */
>>     uint32_t backing_file_offset; /* in bytes from start of header */
>>     uint32_t backing_file_size;   /* in bytes */
>
> It's really the filename size, not the file size.  Also, make a note that it
> is not zero terminated.

Fixed both on wiki.

>
>>
>>     /* if (compat_features & QED_CF_BACKING_FORMAT) */
>>     uint32_t backing_fmt_offset;  /* in bytes from start of header */
>>     uint32_t backing_fmt_size;    /* in bytes */
>
> Why not make it mandatory?
>
>
>>  }
>
> Need a checksum for the header.
>
>>
>> ==Extent table==
>>
>>  #define TABLE_NOFFSETS (table_size * cluster_size / sizeof(uint64_t))
>>
>>  Table {
>>     uint64_t offsets[TABLE_NOFFSETS];
>>  }
>
> It's fashionable to put checksums here.
>
> Do we want a real extent-based format like modern filesystems?  So after
> defragmentation a full image has O(1) metadata?
>
>>
>> The extent tables are organized as follows:
>>
>>                    +----------+
>>                    | L1 table |
>>                    +----------+
>>               ,------'  |  '------.
>>          +----------+   |    +----------+
>>          | L2 table |  ...   | L2 table |
>>          +----------+        +----------+
>>      ,------'  |  '------.
>>  +----------+   |    +----------+
>>  |   Data   |  ...   |   Data   |
>>  +----------+        +----------+
>>
>> The table_size field allows tables to be multiples of the cluster size.
>>  For example, cluster_size=64 KB and table_size=4 results in 256 KB tables.
>>
>> =Operations=
>>
>> ==Read==
>> # If L2 table is not present in L1, read from backing image.
>> # If data cluster is not present in L2, read from backing image.
>> # Otherwise read data from cluster.
>
> If not in backing image, provide zeros

Wiki updated.

>
>>
>> ==Write==
>> # If L2 table is not present in L1, allocate new cluster and L2.  Perform
>> L2 and L1 link after writing data.
>> # If data cluster is not present in L2, allocate new cluster.  Perform L1
>> link after writing data.
>> # Otherwise overwrite data cluster.
>
> Detail copy-on-write from backing image.
>
> On a partial write without a backing file, do we recommend zero-filling the
> cluster (to avoid intra-cluster fragmentation)?

Currently zeroes are written and with 64 KB cluster size hopefully
isn't too painful.

>
>>
>> The L2 link '''should''' be made after the data is in place on storage.
>>  However, when no ordering is enforced the worst case scenario is an L2 link
>> to an unwritten cluster.
>
> Or it may cause corruption if the physical file size is not committed, and
> L2 now points at a free cluster.
>
>>
>> The L1 link '''must''' be made after the L2 cluster is in place on
>> storage.  If the order is reversed then the L1 table may point to a bogus L2
>> table.  (Is this a problem since clusters are allocated at the end of the
>> file?)
>>
>> ==Grow==
>> # If table_size * TABLE_NOFFSETS < new_image_size, fail -EOVERFLOW.  The
>> L1 table is not big enough.
>
> With a variable-height tree, we allocate a new root, link its first entry to
> the old root, and write the new header with updated root and height.
>
>> # Write new image_size header field.
>>
>> =Data integrity=
>> ==Write==
>> Writes that complete before a flush must be stable when the flush
>> completes.
>>
>> If storage is interrupted (e.g. power outage) then writes in progress may
>> be lost, stable, or partially completed.  The storage must not be otherwise
>> corrupted or inaccessible after it is restarted.
>
> We can remove this requirement by copying-on-write any metadata write, and
> keeping two copies of the header (with version numbers and checksums).
>  Enterprise storage will not corrupt on writes, but commodity storage may.
>
> --
> error compiling committee.c: too many arguments to function
>
>
>



reply via email to

[Prev in Thread] Current Thread [Next in Thread]