[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Qemu-devel] Re: [PATCH v2 3/7] docs: Add QED image format specification
From: |
Stefan Hajnoczi |
Subject: |
[Qemu-devel] Re: [PATCH v2 3/7] docs: Add QED image format specification |
Date: |
Mon, 11 Oct 2010 16:30:40 +0100 |
User-agent: |
Mutt/1.5.20 (2009-06-14) |
On Mon, Oct 11, 2010 at 03:58:07PM +0200, Kevin Wolf wrote:
> Am 08.10.2010 17:48, schrieb Stefan Hajnoczi:
> > Signed-off-by: Stefan Hajnoczi <address@hidden>
> > ---
> > docs/specs/qed_spec.txt | 94
> > +++++++++++++++++++++++++++++++++++++++++++++++
> > 1 files changed, 94 insertions(+), 0 deletions(-)
> > create mode 100644 docs/specs/qed_spec.txt
> >
> > diff --git a/docs/specs/qed_spec.txt b/docs/specs/qed_spec.txt
> > new file mode 100644
> > index 0000000..c942b8e
> > --- /dev/null
> > +++ b/docs/specs/qed_spec.txt
> > @@ -0,0 +1,94 @@
> > +=Specification=
> > +
> > +The file format looks like this:
> > +
> > + +----------+----------+----------+-----+
> > + | cluster0 | cluster1 | cluster2 | ... |
> > + +----------+----------+----------+-----+
> > +
> > +The first cluster begins with the '''header'''. The header contains
> > information about where regular clusters start; this allows the header to
> > be extensible and store extra information about the image file. A regular
> > cluster may be a '''data cluster''', an '''L2''', or an '''L1 table'''. L1
> > and L2 tables are composed of one or more contiguous clusters.
> > +
> > +Normally the file size will be a multiple of the cluster size. If the
> > file size is not a multiple, extra information after the last cluster may
> > not be preserved if data is written. Legitimate extra information should
> > use space between the header and the first regular cluster.
> > +
> > +All fields are little-endian.
> > +
> > +==Header==
> > + Header {
> > + uint32_t magic; /* QED\0 */
> > +
> > + uint32_t cluster_size; /* in bytes */
> > + uint32_t table_size; /* for L1 and L2 tables, in clusters */
> > + uint32_t header_size; /* in clusters */
> > +
> > + uint64_t features; /* format feature bits */
> > + uint64_t compat_features; /* compat feature bits */
> > + uint64_t l1_table_offset; /* in bytes */
> > + uint64_t image_size; /* total logical image size, in bytes */
> > +
> > + /* if (features & QED_F_BACKING_FILE) */
> > + uint32_t backing_filename_offset; /* in bytes from start of header */
> > + uint32_t backing_filename_size; /* in bytes */
> > +
> > + /* if (compat_features & QED_CF_BACKING_FORMAT) */
> > + uint32_t backing_fmt_offset; /* in bytes from start of header */
> > + uint32_t backing_fmt_size; /* in bytes */
>
> It was discussed before, but I don't think we came to a conclusion. Are
> there any circumstances under which you don't want to set the
> QED_CF_BACKING_FORMAT flag?
I suggest the following:
QED_CF_BACKING_FORMAT_RAW = 0x1
When set, the backing file is a raw image and should not be probed for
its file format. The default (unset) means that the backing image file
format may be probed.
Now the backing_fmt_{offset,size} are no longer necessary.
>
> Also it's unclear what this "if" actually means: If the flag isn't set,
> are the fields zero, are they undefined or are they even completely
> missing and the offsets of the following fields must be adjusted?
I have updated the wiki:
"Fields predicated on a feature bit are only used when that feature is
set. The fields always take up header space, regardless of whether or
not the feature bit is set."
>
> > + }
> > +
> > +Field descriptions:
> > +* cluster_size must be a power of 2 in range [2^12, 2^26].
> > +* table_size must be a power of 2 in range [1, 16].
>
> Is there a reason why this must be a power of two?
The power of two makes logical-to-cluster offset translation easy and
cheap:
l2_table = get_l2_table(l1_table[(logical >> l2_shift) & l2_mask])
cluster = l2_table[logical >> l1_shift] + (logical & cluster_mask)
>
> > +* header_size is the number of clusters used by the header and any
> > additional information stored before regular clusters.
> > +* features and compat_features are bitmaps where active file format
> > features can be selectively enabled. The difference between the two is
> > that an image file that uses unknown compat_features bits can be safely
> > opened without knowing how to interpret those bits. If an image file has
> > an unsupported features bit set then it is not possible to open that image
> > (the image is not backwards-compatible).
> > +* l1_table_offset must be a multiple of cluster_size.
>
> And it is the offset of the first byte of the L1 table in the image file.
Updated, thanks.
>
> > +* image_size is the block device size seen by the guest and must be a
> > multiple of cluster_size.
>
> So there are image sizes that can't be accurately represented in QED? I
> think that's a bad idea. Even more so because I can't see how it greatly
> simplifies implementation (you save the operation for rounding up on
> open/create, that's it) - it looks like a completely arbitrary restriction.
Good point. I will try to lift this restriction in v3.
>
> > +* backing_filename and backing_fmt are both strings in (byte offset, byte
> > size) form. They are not NUL-terminated and do not have alignment
> > constraints.
>
> A description of the meaning of these strings is missing.
Update:
"The backing filename string is given in the
backing_filename_{offset,size} fields and may be an absolute path or
relative to the image file."
>
> > +
> > +Feature bits:
> > +* QED_F_BACKING_FILE = 0x01. The image uses a backing file.
> > +* QED_F_NEED_CHECK = 0x02. The image needs a consistency check before use.
> > +* QED_CF_BACKING_FORMAT = 0x01. The image has a specific backing file
> > format stored.
>
> I suggest adding a headline "Compatibility Feature Bits". Seeing 0x01
> twice is confusing at first sight.
Updated, thanks.
>
> > +
> > +==Tables==
> > +
> > +Tables provide the translation from logical offsets in the block device to
> > cluster offsets in the file.
> > +
> > + #define TABLE_NOFFSETS (table_size * cluster_size / sizeof(uint64_t))
> > +
> > + Table {
> > + uint64_t offsets[TABLE_NOFFSETS];
> > + }
> > +
> > +The tables are organized as follows:
> > +
> > + +----------+
> > + | L1 table |
> > + +----------+
> > + ,------' | '------.
> > + +----------+ | +----------+
> > + | L2 table | ... | L2 table |
> > + +----------+ +----------+
> > + ,------' | '------.
> > + +----------+ | +----------+
> > + | Data | ... | Data |
> > + +----------+ +----------+
> > +
> > +A table is made up of one or more contiguous clusters. The table_size
> > header field determines table size for an image file. For example,
> > cluster_size=64 KB and table_size=4 results in 256 KB tables.
> > +
> > +The logical image size must be less than or equal to the maximum possible
> > size of clusters rooted by the L1 table:
> > + header.image_size <= TABLE_NOFFSETS * TABLE_NOFFSETS * header.cluster_size
> > +
> > +Logical offsets are translated into cluster offsets as follows:
> > +
> > + table_bits table_bits cluster_bits
> > + <--------> <--------> <--------------->
> > + +----------+----------+-----------------+
> > + | L1 index | L2 index | byte offset |
> > + +----------+----------+-----------------+
> > +
> > + Structure of a logical offset
> > +
> > + def logical_to_cluster_offset(l1_index, l2_index, byte_offset):
> > + l2_offset = l1_table[l1_index]
> > + l2_table = load_table(l2_offset)
> > + cluster_offset = l2_table[l2_index]
> > + return cluster_offset + byte_offset
>
> Should we reserve some bits in the table entries in case we need some
> flags later? Also, I suppose all table entries must be cluster aligned?
Yes, let's do that. At least for sparse zero cluster tracking we need a
bit. The minimum 4k cluster size gives us 12 bits to play with.
>
> What happened to the other sections that older versions of the spec
> contained? For example, this version doesn't specify any more what the
> semantics of unallocated clusters and backing files is.
I removed them because they don't describe the on-disk layout and were
more of a way to think through the implementation than a format
specification. It was more a decision to focus my effort on improving
the on-disk layout specification than anything else.
Do you want the semantics in the specification, or is it okay to leave
that part on the wiki only?
Stefan
- [Qemu-devel] Re: [PATCH v2 3/7] docs: Add QED image format specification, (continued)
- Re: [Qemu-devel] Re: [PATCH v2 3/7] docs: Add QED image format specification, Anthony Liguori, 2010/10/11
- Re: [Qemu-devel] Re: [PATCH v2 3/7] docs: Add QED image format specification, Avi Kivity, 2010/10/11
- Re: [Qemu-devel] Re: [PATCH v2 3/7] docs: Add QED image format specification, Anthony Liguori, 2010/10/11
- Re: [Qemu-devel] Re: [PATCH v2 3/7] docs: Add QED image format specification, Avi Kivity, 2010/10/11
- Re: [Qemu-devel] Re: [PATCH v2 3/7] docs: Add QED image format specification, Anthony Liguori, 2010/10/11
- Re: [Qemu-devel] Re: [PATCH v2 3/7] docs: Add QED image format specification, Avi Kivity, 2010/10/12
[Qemu-devel] Re: [PATCH v2 3/7] docs: Add QED image format specification, Kevin Wolf, 2010/10/11
- [Qemu-devel] Re: [PATCH v2 3/7] docs: Add QED image format specification,
Stefan Hajnoczi <=
- [Qemu-devel] Re: [PATCH v2 3/7] docs: Add QED image format specification, Avi Kivity, 2010/10/11
- [Qemu-devel] Re: [PATCH v2 3/7] docs: Add QED image format specification, Stefan Hajnoczi, 2010/10/11
- Re: [Qemu-devel] Re: [PATCH v2 3/7] docs: Add QED image format specification, Anthony Liguori, 2010/10/11
- Re: [Qemu-devel] Re: [PATCH v2 3/7] docs: Add QED image format specification, Anthony Liguori, 2010/10/11
- Re: [Qemu-devel] Re: [PATCH v2 3/7] docs: Add QED image format specification, Kevin Wolf, 2010/10/12
- Re: [Qemu-devel] Re: [PATCH v2 3/7] docs: Add QED image format specification, Stefan Hajnoczi, 2010/10/12
- Re: [Qemu-devel] Re: [PATCH v2 3/7] docs: Add QED image format specification, Anthony Liguori, 2010/10/12
[Qemu-devel] Re: [PATCH v2 3/7] docs: Add QED image format specification, Kevin Wolf, 2010/10/11
[Qemu-devel] [PATCH v2 4/7] qed: Add QEMU Enhanced Disk image format, Stefan Hajnoczi, 2010/10/08