Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format

qemu-devel
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format

From:	Daniel P. Berrange
Subject:	Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format
Date:	Mon, 6 Sep 2010 12:18:21 +0100
User-agent:	Mutt/1.4.1i
On Mon, Sep 06, 2010 at 11:04:38AM +0100, Stefan Hajnoczi wrote:
> QEMU Enhanced Disk format is a disk image format that forgoes features
> found in qcow2 in favor of better levels of performance and data
> integrity.  Due to its simpler on-disk layout, it is possible to safely
> perform metadata updates more efficiently.
> 
> Installations, suspend-to-disk, and other allocation-heavy I/O workloads
> will see increased performance due to fewer I/Os and syncs.  Workloads
> that do not cause new clusters to be allocated will perform similar to
> raw images due to in-memory metadata caching.
> 
> The format supports sparse disk images.  It does not rely on the host
> filesystem holes feature, making it a good choice for sparse disk images
> that need to be transferred over channels where holes are not supported.
> 
> Backing files are supported so only deltas against a base image can be
> stored.
> 
> The file format is extensible so that additional features can be added
> later with graceful compatibility handling.
> 
> Internal snapshots are not supported.  This eliminates the need for
> additional metadata to track copy-on-write clusters.
> 
> Compression and encryption are not supported.  They add complexity and
> can be implemented at other layers in the stack (i.e. inside the guest
> or on the host).

I agree with ditching compression, but encryption is an important
capability which cannot be satisfactorily added at other layers
in the stack. While block devices / local filesystems can layer
in dm-crypt in the host, this is not possible with network/cluster
filesystems which account for a non-trivial target audience. Adding
encryption inside the guest is sub-optimal because you cannot do
secure automation of guest startup. Either you require manaual
intervention to start every guest to enter the key, or if you
hardcode the key, then anyone who can access the guest disk image
can start the guest. The qcow2 encryption is the perfect solution
for this problem, guaranteeing the data security even when the
storage system / network transport offers no security, and allowing
for secure control over guest startup. Further, adding encryptiuon
does not add any serious complexity to the on disk format - just
1 extra header field, nor to the implmenetation - just pass the
data block through a encrypt/decrypt filter, with no extra I/O
paths.

> diff --git a/block/qed-cluster.c b/block/qed-cluster.c
> new file mode 100644
> index 0000000..6deea27
> --- /dev/null
> +++ b/block/qed-cluster.c
> @@ -0,0 +1,136 @@
> +/*
> + * QEMU Enhanced Disk Format Cluster functions
> + *
> + * Copyright IBM, Corp. 2010
> + *
> + * Authors:
> + *  Stefan Hajnoczi   <address@hidden>
> + *  Anthony Liguori   <address@hidden>
> + *
> + * This work is licensed under the terms of the GNU LGPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + *
> + */
> +
> +#include "qed.h"
> +
> +/**
> + * Count the number of contiguous data clusters
> + *
> + * @s:              QED state
> + * @table:          L2 table
> + * @index:          First cluster index
> + * @n:              Maximum number of clusters
> + * @offset:         Set to first cluster offset
> + *
> + * This function scans tables for contiguous allocated or free clusters.
> + */
> +static unsigned int qed_count_contiguous_clusters(BDRVQEDState *s,
> +                                                  QEDTable *table,
> +                                                  unsigned int index,
> +                                                  unsigned int n,
> +                                                  uint64_t *offset)
> +{
> +    unsigned int end = MIN(index + n, s->table_nelems);
> +    uint64_t last = table->offsets[index];
> +    unsigned int i;
> +
> +    *offset = last;
> +
> +    for (i = index + 1; i < end; i++) {
> +        if (last == 0) {
> +            /* Counting free clusters */
> +            if (table->offsets[i] != 0) {
> +                break;
> +            }
> +        } else {
> +            /* Counting allocated clusters */
> +            if (table->offsets[i] != last + s->header.cluster_size) {
> +                break;
> +            }
> +            last = table->offsets[i];
> +        }
> +    }
> +    return i - index;
> +}
> +
> +typedef struct {
> +    BDRVQEDState *s;
> +    uint64_t pos;
> +    size_t len;
> +
> +    QEDRequest *request;
> +
> +    /* User callback */
> +    QEDFindClusterFunc *cb;
> +    void *opaque;
> +} QEDFindClusterCB;
> +
> +static void qed_find_cluster_cb(void *opaque, int ret)
> +{
> +    QEDFindClusterCB *find_cluster_cb = opaque;
> +    BDRVQEDState *s = find_cluster_cb->s;
> +    QEDRequest *request = find_cluster_cb->request;
> +    uint64_t offset = 0;
> +    size_t len = 0;
> +    unsigned int index;
> +    unsigned int n;
> +
> +    if (ret) {
> +        ret = QED_CLUSTER_ERROR;
> +        goto out;
> +    }
> +
> +    index = qed_l2_index(s, find_cluster_cb->pos);
> +    n = qed_bytes_to_clusters(s,
> +                              qed_offset_into_cluster(s, 
> find_cluster_cb->pos) +
> +                              find_cluster_cb->len);
> +    n = qed_count_contiguous_clusters(s, request->l2_table->table,
> +                                      index, n, &offset);
> +
> +    ret = offset ? QED_CLUSTER_FOUND : QED_CLUSTER_L2;
> +    len = MIN(find_cluster_cb->len, n * s->header.cluster_size -
> +              qed_offset_into_cluster(s, find_cluster_cb->pos));
> +
> +out:
> +    find_cluster_cb->cb(find_cluster_cb->opaque, ret, offset, len);
> +    qemu_free(find_cluster_cb);
> +}
> +
> +/**
> + * Find the offset of a data cluster
> + *
> + * @s:          QED state
> + * @pos:        Byte position in device
> + * @len:        Number of bytes
> + * @cb:         Completion function
> + * @opaque:     User data for completion function
> + */
> +void qed_find_cluster(BDRVQEDState *s, QEDRequest *request, uint64_t pos,
> +                      size_t len, QEDFindClusterFunc *cb, void *opaque)
> +{
> +    QEDFindClusterCB *find_cluster_cb;
> +    uint64_t l2_offset;
> +
> +    /* Limit length to L2 boundary.  Requests are broken up at the L2 
> boundary
> +     * so that a request acts on one L2 table at a time.
> +     */
> +    len = MIN(len, (((pos >> s->l1_shift) + 1) << s->l1_shift) - pos);
> +
> +    l2_offset = s->l1_table->offsets[qed_l1_index(s, pos)];
> +    if (!l2_offset) {
> +        cb(opaque, QED_CLUSTER_L1, 0, len);
> +        return;
> +    }
> +
> +    find_cluster_cb = qemu_malloc(sizeof(*find_cluster_cb));
> +    find_cluster_cb->s = s;
> +    find_cluster_cb->pos = pos;
> +    find_cluster_cb->len = len;
> +    find_cluster_cb->cb = cb;
> +    find_cluster_cb->opaque = opaque;
> +    find_cluster_cb->request = request;
> +
> +    qed_read_l2_table(s, request, l2_offset,
> +                      qed_find_cluster_cb, find_cluster_cb);
> +}
> diff --git a/block/qed-gencb.c b/block/qed-gencb.c
> new file mode 100644
> index 0000000..d389e12
> --- /dev/null
> +++ b/block/qed-gencb.c
> @@ -0,0 +1,32 @@
> +/*
> + * QEMU Enhanced Disk Format
> + *
> + * Copyright IBM, Corp. 2010
> + *
> + * Authors:
> + *  Stefan Hajnoczi   <address@hidden>
> + *
> + * This work is licensed under the terms of the GNU LGPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + *
> + */
> +
> +#include "qed.h"
> +
> +void *gencb_alloc(size_t len, BlockDriverCompletionFunc *cb, void *opaque)
> +{
> +    GenericCB *gencb = qemu_malloc(len);
> +    gencb->cb = cb;
> +    gencb->opaque = opaque;
> +    return gencb;
> +}
> +
> +void gencb_complete(void *opaque, int ret)
> +{
> +    GenericCB *gencb = opaque;
> +    BlockDriverCompletionFunc *cb = gencb->cb;
> +    void *user_opaque = gencb->opaque;
> +
> +    qemu_free(gencb);
> +    cb(user_opaque, ret);
> +}
> diff --git a/block/qed-l2-cache.c b/block/qed-l2-cache.c
> new file mode 100644
> index 0000000..747a629
> --- /dev/null
> +++ b/block/qed-l2-cache.c
> @@ -0,0 +1,131 @@
> +/*
> + * QEMU Enhanced Disk Format L2 Cache
> + *
> + * Copyright IBM, Corp. 2010
> + *
> + * Authors:
> + *  Anthony Liguori   <address@hidden>
> + *
> + * This work is licensed under the terms of the GNU LGPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + *
> + */
> +
> +#include "qed.h"
> +
> +/* Each L2 holds 2GB so this let's us fully cache a 100GB disk */
> +#define MAX_L2_CACHE_SIZE 50
> +
> +/**
> + * Initialize the L2 cache
> + */
> +void qed_init_l2_cache(L2TableCache *l2_cache,
> +                       L2TableAllocFunc *alloc_l2_table,
> +                       void *alloc_l2_table_opaque)
> +{
> +    QTAILQ_INIT(&l2_cache->entries);
> +    l2_cache->n_entries = 0;
> +    l2_cache->alloc_l2_table = alloc_l2_table;
> +    l2_cache->alloc_l2_table_opaque = alloc_l2_table_opaque;
> +}
> +
> +/**
> + * Free the L2 cache
> + */
> +void qed_free_l2_cache(L2TableCache *l2_cache)
> +{
> +    CachedL2Table *entry, *next_entry;
> +
> +    QTAILQ_FOREACH_SAFE(entry, &l2_cache->entries, node, next_entry) {
> +        qemu_free(entry->table);
> +        qemu_free(entry);
> +    }
> +}
> +
> +/**
> + * Allocate an uninitialized entry from the cache
> + *
> + * The returned entry has a reference count of 1 and is owned by the caller.
> + */
> +CachedL2Table *qed_alloc_l2_cache_entry(L2TableCache *l2_cache)
> +{
> +    CachedL2Table *entry;
> +
> +    entry = qemu_mallocz(sizeof(*entry));
> +    entry->table = l2_cache->alloc_l2_table(l2_cache->alloc_l2_table_opaque);
> +    entry->ref++;
> +
> +    return entry;
> +}
> +
> +/**
> + * Decrease an entry's reference count and free if necessary when the 
> reference
> + * count drops to zero.
> + */
> +void qed_unref_l2_cache_entry(L2TableCache *l2_cache, CachedL2Table *entry)
> +{
> +    if (!entry) {
> +        return;
> +    }
> +
> +    entry->ref--;
> +    if (entry->ref == 0) {
> +        qemu_free(entry->table);
> +        qemu_free(entry);
> +    }
> +}
> +
> +/**
> + * Find an entry in the L2 cache.  This may return NULL and it's up to the
> + * caller to satisfy the cache miss.
> + *
> + * For a cached entry, this function increases the reference count and 
> returns
> + * the entry.
> + */
> +CachedL2Table *qed_find_l2_cache_entry(L2TableCache *l2_cache, uint64_t 
> offset)
> +{
> +    CachedL2Table *entry;
> +
> +    QTAILQ_FOREACH(entry, &l2_cache->entries, node) {
> +        if (entry->offset == offset) {
> +            entry->ref++;
> +            return entry;
> +        }
> +    }
> +    return NULL;
> +}
> +
> +/**
> + * Commit an L2 cache entry into the cache.  This is meant to be used as 
> part of
> + * the process to satisfy a cache miss.  A caller would allocate an entry 
> which
> + * is not actually in the L2 cache and then once the entry was valid and
> + * present on disk, the entry can be committed into the cache.
> + *
> + * Since the cache is write-through, it's important that this function is not
> + * called until the entry is present on disk and the L1 has been updated to
> + * point to the entry.
> + *
> + * This function will take a reference to the entry so the caller is still
> + * responsible for unreferencing the entry.
> + */
> +void qed_commit_l2_cache_entry(L2TableCache *l2_cache, CachedL2Table 
> *l2_table)
> +{
> +    CachedL2Table *entry;
> +
> +    entry = qed_find_l2_cache_entry(l2_cache, l2_table->offset);
> +    if (entry) {
> +        qed_unref_l2_cache_entry(l2_cache, entry);
> +        return;
> +    }
> +
> +    if (l2_cache->n_entries >= MAX_L2_CACHE_SIZE) {
> +        entry = QTAILQ_FIRST(&l2_cache->entries);
> +        QTAILQ_REMOVE(&l2_cache->entries, entry, node);
> +        l2_cache->n_entries--;
> +        qed_unref_l2_cache_entry(l2_cache, entry);
> +    }
> +
> +    l2_table->ref++;
> +    l2_cache->n_entries++;
> +    QTAILQ_INSERT_TAIL(&l2_cache->entries, l2_table, node);
> +}
> diff --git a/block/qed-table.c b/block/qed-table.c
> new file mode 100644
> index 0000000..9a72582
> --- /dev/null
> +++ b/block/qed-table.c
> @@ -0,0 +1,242 @@
> +/*
> + * QEMU Enhanced Disk Format Table I/O
> + *
> + * Copyright IBM, Corp. 2010
> + *
> + * Authors:
> + *  Stefan Hajnoczi   <address@hidden>
> + *  Anthony Liguori   <address@hidden>
> + *
> + * This work is licensed under the terms of the GNU LGPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + *
> + */
> +
> +#include "qed.h"
> +
> +typedef struct {
> +    GenericCB gencb;
> +    BDRVQEDState *s;
> +    QEDTable *table;
> +
> +    struct iovec iov;
> +    QEMUIOVector qiov;
> +} QEDReadTableCB;
> +
> +static void qed_read_table_cb(void *opaque, int ret)
> +{
> +    QEDReadTableCB *read_table_cb = opaque;
> +    QEDTable *table = read_table_cb->table;
> +    int noffsets = read_table_cb->iov.iov_len / sizeof(uint64_t);
> +    int i;
> +
> +    /* Handle I/O error */
> +    if (ret) {
> +        goto out;
> +    }
> +
> +    /* Byteswap and verify offsets */
> +    for (i = 0; i < noffsets; i++) {
> +        table->offsets[i] = le64_to_cpu(table->offsets[i]);
> +    }
> +
> +out:
> +    /* Completion */
> +    gencb_complete(&read_table_cb->gencb, ret);
> +}
> +
> +static void qed_read_table(BDRVQEDState *s, uint64_t offset, QEDTable *table,
> +                           BlockDriverCompletionFunc *cb, void *opaque)
> +{
> +    QEDReadTableCB *read_table_cb = gencb_alloc(sizeof(*read_table_cb),
> +                                                cb, opaque);
> +    QEMUIOVector *qiov = &read_table_cb->qiov;
> +    BlockDriverAIOCB *aiocb;
> +
> +    read_table_cb->s = s;
> +    read_table_cb->table = table;
> +    read_table_cb->iov.iov_base = table->offsets,
> +    read_table_cb->iov.iov_len = s->header.cluster_size * 
> s->header.table_size,
> +
> +    qemu_iovec_init_external(qiov, &read_table_cb->iov, 1);
> +    aiocb = bdrv_aio_readv(s->bs->file, offset / BDRV_SECTOR_SIZE, qiov,
> +                           read_table_cb->iov.iov_len / BDRV_SECTOR_SIZE,
> +                           qed_read_table_cb, read_table_cb);
> +    if (!aiocb) {
> +        qed_read_table_cb(read_table_cb, -EIO);
> +    }
> +}
> +
> +typedef struct {
> +    GenericCB gencb;
> +    BDRVQEDState *s;
> +    QEDTable *orig_table;
> +    bool flush;             /* flush after write? */
> +
> +    struct iovec iov;
> +    QEMUIOVector qiov;
> +
> +    QEDTable table;
> +} QEDWriteTableCB;
> +
> +static void qed_write_table_cb(void *opaque, int ret)
> +{
> +    QEDWriteTableCB *write_table_cb = opaque;
> +
> +    if (ret) {
> +        goto out;
> +    }
> +
> +    if (write_table_cb->flush) {
> +        /* We still need to flush first */
> +        write_table_cb->flush = false;
> +        bdrv_aio_flush(write_table_cb->s->bs, qed_write_table_cb,
> +                       write_table_cb);
> +        return;
> +    }
> +
> +out:
> +    gencb_complete(&write_table_cb->gencb, ret);
> +    return;
> +}
> +
> +/**
> + * Write out an updated part or all of a table
> + *
> + * @s:          QED state
> + * @offset:     Offset of table in image file, in bytes
> + * @table:      Table
> + * @index:      Index of first element
> + * @n:          Number of elements
> + * @flush:      Whether or not to sync to disk
> + * @cb:         Completion function
> + * @opaque:     Argument for completion function
> + */
> +static void qed_write_table(BDRVQEDState *s, uint64_t offset, QEDTable 
> *table,
> +                            unsigned int index, unsigned int n, bool flush,
> +                            BlockDriverCompletionFunc *cb, void *opaque)
> +{
> +    QEDWriteTableCB *write_table_cb;
> +    BlockDriverAIOCB *aiocb;
> +    unsigned int sector_mask = BDRV_SECTOR_SIZE / sizeof(uint64_t) - 1;
> +    unsigned int start, end, i;
> +    size_t len_bytes;
> +
> +    /* Calculate indices of the first and one after last elements */
> +    start = index & ~sector_mask;
> +    end = (index + n + sector_mask) & ~sector_mask;
> +
> +    len_bytes = (end - start) * sizeof(uint64_t);
> +
> +    write_table_cb = gencb_alloc(sizeof(*write_table_cb) + len_bytes,
> +                                 cb, opaque);
> +    write_table_cb->s = s;
> +    write_table_cb->orig_table = table;
> +    write_table_cb->flush = flush;
> +    write_table_cb->iov.iov_base = write_table_cb->table.offsets;
> +    write_table_cb->iov.iov_len = len_bytes;
> +    qemu_iovec_init_external(&write_table_cb->qiov, &write_table_cb->iov, 1);
> +
> +    /* Byteswap table */
> +    for (i = start; i < end; i++) {
> +        write_table_cb->table.offsets[i - start] = 
> cpu_to_le64(table->offsets[i]);
> +    }
> +
> +    /* Adjust for offset into table */
> +    offset += start * sizeof(uint64_t);
> +
> +    aiocb = bdrv_aio_writev(s->bs->file, offset / BDRV_SECTOR_SIZE,
> +                            &write_table_cb->qiov,
> +                            write_table_cb->iov.iov_len / BDRV_SECTOR_SIZE,
> +                            qed_write_table_cb, write_table_cb);
> +    if (!aiocb) {
> +        qed_write_table_cb(write_table_cb, -EIO);
> +    }
> +}
> +
> +static void qed_read_l1_table_cb(void *opaque, int ret)
> +{
> +    *(int *)opaque = ret;
> +}
> +
> +/**
> + * Read the L1 table synchronously
> + */
> +int qed_read_l1_table(BDRVQEDState *s)
> +{
> +    int ret = -EINPROGRESS;
> +
> +    /* TODO push/pop async context? */
> +
> +    qed_read_table(s, s->header.l1_table_offset,
> +                   s->l1_table, qed_read_l1_table_cb, &ret);
> +    while (ret == -EINPROGRESS) {
> +        qemu_aio_wait();
> +    }
> +    return ret;
> +}
> +
> +void qed_write_l1_table(BDRVQEDState *s, unsigned int index, unsigned int n,
> +                        BlockDriverCompletionFunc *cb, void *opaque)
> +{
> +    qed_write_table(s, s->header.l1_table_offset,
> +                    s->l1_table, index, n, false, cb, opaque);
> +}
> +
> +typedef struct {
> +    GenericCB gencb;
> +    BDRVQEDState *s;
> +    uint64_t l2_offset;
> +    QEDRequest *request;
> +} QEDReadL2TableCB;
> +
> +static void qed_read_l2_table_cb(void *opaque, int ret)
> +{
> +    QEDReadL2TableCB *read_l2_table_cb = opaque;
> +    QEDRequest *request = read_l2_table_cb->request;
> +    BDRVQEDState *s = read_l2_table_cb->s;
> +
> +    if (ret) {
> +        /* can't trust loaded L2 table anymore */
> +        qed_unref_l2_cache_entry(&s->l2_cache, request->l2_table);
> +        request->l2_table = NULL;
> +    } else {
> +        request->l2_table->offset = read_l2_table_cb->l2_offset;
> +        qed_commit_l2_cache_entry(&s->l2_cache, request->l2_table);
> +    }
> +
> +    gencb_complete(&read_l2_table_cb->gencb, ret);
> +}
> +
> +void qed_read_l2_table(BDRVQEDState *s, QEDRequest *request, uint64_t offset,
> +                       BlockDriverCompletionFunc *cb, void *opaque)
> +{
> +    QEDReadL2TableCB *read_l2_table_cb;
> +
> +    qed_unref_l2_cache_entry(&s->l2_cache, request->l2_table);
> +
> +    /* Check for cached L2 entry */
> +    request->l2_table = qed_find_l2_cache_entry(&s->l2_cache, offset);
> +    if (request->l2_table) {
> +        cb(opaque, 0);
> +        return;
> +    }
> +
> +    request->l2_table = qed_alloc_l2_cache_entry(&s->l2_cache);
> +
> +    read_l2_table_cb = gencb_alloc(sizeof(*read_l2_table_cb), cb, opaque);
> +    read_l2_table_cb->s = s;
> +    read_l2_table_cb->l2_offset = offset;
> +    read_l2_table_cb->request = request;
> +
> +    qed_read_table(s, offset, request->l2_table->table,
> +                   qed_read_l2_table_cb, read_l2_table_cb);
> +}
> +
> +void qed_write_l2_table(BDRVQEDState *s, QEDRequest *request,
> +                        unsigned int index, unsigned int n, bool flush,
> +                        BlockDriverCompletionFunc *cb, void *opaque)
> +{
> +    qed_write_table(s, request->l2_table->offset,
> +                    request->l2_table->table, index, n, flush, cb, opaque);
> +}
> diff --git a/block/qed.c b/block/qed.c
> new file mode 100644
> index 0000000..cf64418
> --- /dev/null
> +++ b/block/qed.c
> @@ -0,0 +1,1103 @@
> +/*
> + * QEMU Enhanced Disk Format
> + *
> + * Copyright IBM, Corp. 2010
> + *
> + * Authors:
> + *  Stefan Hajnoczi   <address@hidden>
> + *  Anthony Liguori   <address@hidden>
> + *
> + * This work is licensed under the terms of the GNU LGPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + *
> + */
> +
> +#include "qed.h"
> +
> +/* TODO blkdebug support */
> +/* TODO BlockDriverState::buffer_alignment */
> +/* TODO check L2 table sizes before accessing them? */
> +/* TODO skip zero prefill since the filesystem should zero the sectors 
> anyway */
> +/* TODO if a table element's offset is invalid then the image is broken.  If
> + * there was a power failure and the table update reached storage but the 
> data
> + * being pointed to did not, forget about the lost data by clearing the 
> offset.
> + * However, need to be careful to detect invalid offsets for tables that are
> + * read *after* more clusters have been allocated. */
> +
> +enum {
> +    QED_MAGIC = 'Q' | 'E' << 8 | 'D' << 16 | '\0' << 24,
> +
> +    /* The image supports a backing file */
> +    QED_F_BACKING_FILE = 0x01,
> +
> +    /* The image has the backing file format */
> +    QED_CF_BACKING_FORMAT = 0x01,
> +
> +    /* Feature bits must be used when the on-disk format changes */
> +    QED_FEATURE_MASK = QED_F_BACKING_FILE,            /* supported feature 
> bits */
> +    QED_COMPAT_FEATURE_MASK = QED_CF_BACKING_FORMAT,  /* supported compat 
> feature bits */
> +
> +    /* Data is stored in groups of sectors called clusters.  Cluster size 
> must
> +     * be large to avoid keeping too much metadata.  I/O requests that have
> +     * sub-cluster size will require read-modify-write.
> +     */
> +    QED_MIN_CLUSTER_SIZE = 4 * 1024, /* in bytes */
> +    QED_MAX_CLUSTER_SIZE = 64 * 1024 * 1024,
> +    QED_DEFAULT_CLUSTER_SIZE = 64 * 1024,
> +
> +    /* Allocated clusters are tracked using a 2-level pagetable.  Table size 
> is
> +     * a multiple of clusters so large maximum image sizes can be supported
> +     * without jacking up the cluster size too much.
> +     */
> +    QED_MIN_TABLE_SIZE = 1,        /* in clusters */
> +    QED_MAX_TABLE_SIZE = 16,
> +    QED_DEFAULT_TABLE_SIZE = 4,
> +};
> +
> +static void qed_aio_cancel(BlockDriverAIOCB *acb)
> +{
> +    qemu_aio_release(acb);
> +}
> +
> +static AIOPool qed_aio_pool = {
> +    .aiocb_size         = sizeof(QEDAIOCB),
> +    .cancel             = qed_aio_cancel,
> +};
> +
> +/**
> + * Allocate memory that satisfies image file and backing file alignment 
> requirements
> + *
> + * TODO make this common and consider propagating max buffer_alignment to 
> the root image
> + */
> +static void *qed_memalign(BDRVQEDState *s, size_t len)
> +{
> +    size_t align = s->bs->file->buffer_alignment;
> +    BlockDriverState *backing_hd = s->bs->backing_hd;
> +
> +    if (backing_hd && backing_hd->buffer_alignment > align) {
> +        align = backing_hd->buffer_alignment;
> +    }
> +
> +    return qemu_memalign(align, len);
> +}
> +
> +static int bdrv_qed_probe(const uint8_t *buf, int buf_size,
> +                          const char *filename)
> +{
> +    const QEDHeader *header = (const void *)buf;
> +
> +    if (buf_size < sizeof(*header)) {
> +        return 0;
> +    }
> +    if (le32_to_cpu(header->magic) != QED_MAGIC) {
> +        return 0;
> +    }
> +    return 100;
> +}
> +
> +static void qed_header_le_to_cpu(const QEDHeader *le, QEDHeader *cpu)
> +{
> +    cpu->magic = le32_to_cpu(le->magic);
> +    cpu->cluster_size = le32_to_cpu(le->cluster_size);
> +    cpu->table_size = le32_to_cpu(le->table_size);
> +    cpu->first_cluster = le32_to_cpu(le->first_cluster);
> +    cpu->features = le64_to_cpu(le->features);
> +    cpu->compat_features = le64_to_cpu(le->compat_features);
> +    cpu->l1_table_offset = le64_to_cpu(le->l1_table_offset);
> +    cpu->image_size = le64_to_cpu(le->image_size);
> +    cpu->backing_file_offset = le32_to_cpu(le->backing_file_offset);
> +    cpu->backing_file_size = le32_to_cpu(le->backing_file_size);
> +    cpu->backing_fmt_offset = le32_to_cpu(le->backing_fmt_offset);
> +    cpu->backing_fmt_size = le32_to_cpu(le->backing_fmt_size);
> +}
> +
> +static void qed_header_cpu_to_le(const QEDHeader *cpu, QEDHeader *le)
> +{
> +    le->magic = cpu_to_le32(cpu->magic);
> +    le->cluster_size = cpu_to_le32(cpu->cluster_size);
> +    le->table_size = cpu_to_le32(cpu->table_size);
> +    le->first_cluster = cpu_to_le32(cpu->first_cluster);
> +    le->features = cpu_to_le64(cpu->features);
> +    le->compat_features = cpu_to_le64(cpu->compat_features);
> +    le->l1_table_offset = cpu_to_le64(cpu->l1_table_offset);
> +    le->image_size = cpu_to_le64(cpu->image_size);
> +    le->backing_file_offset = cpu_to_le32(cpu->backing_file_offset);
> +    le->backing_file_size = cpu_to_le32(cpu->backing_file_size);
> +    le->backing_fmt_offset = cpu_to_le32(cpu->backing_fmt_offset);
> +    le->backing_fmt_size = cpu_to_le32(cpu->backing_fmt_size);
> +}
> +
> +static uint64_t qed_max_image_size(uint32_t cluster_size, uint32_t 
> table_size)
> +{
> +    uint64_t table_entries;
> +    uint64_t l2_size;
> +
> +    table_entries = (table_size * cluster_size) / 8;
> +    l2_size = table_entries * cluster_size;
> +
> +    return l2_size * table_entries;
> +}
> +
> +static bool qed_is_cluster_size_valid(uint32_t cluster_size)
> +{
> +    if (cluster_size < QED_MIN_CLUSTER_SIZE ||
> +        cluster_size > QED_MAX_CLUSTER_SIZE) {
> +        return false;
> +    }
> +    if (cluster_size & (cluster_size - 1)) {
> +        return false; /* not power of 2 */
> +    }
> +    return true;
> +}
> +
> +static bool qed_is_table_size_valid(uint32_t table_size)
> +{
> +    if (table_size < QED_MIN_TABLE_SIZE ||
> +        table_size > QED_MAX_TABLE_SIZE) {
> +        return false;
> +    }
> +    if (table_size & (table_size - 1)) {
> +        return false; /* not power of 2 */
> +    }
> +    return true;
> +}
> +
> +static bool qed_is_image_size_valid(uint64_t image_size, uint32_t 
> cluster_size,
> +                                    uint32_t table_size)
> +{
> +    if (image_size == 0) {
> +        /* Supporting zero size images makes life harder because even the L1
> +         * table is not needed.  Make life simple and forbid zero size 
> images.
> +         */
> +        return false;
> +    }
> +    if (image_size & (cluster_size - 1)) {
> +        return false; /* not multiple of cluster size */
> +    }
> +    if (image_size > qed_max_image_size(cluster_size, table_size)) {
> +        return false; /* image is too large */
> +    }
> +    return true;
> +}
> +
> +/**
> + * Test if a byte offset is cluster aligned and within the image file
> + */
> +static bool qed_check_byte_offset(BDRVQEDState *s, uint64_t offset)
> +{
> +    if (offset & (s->header.cluster_size - 1)) {
> +        return false;
> +    }
> +    if (offset == 0) {
> +        return false; /* first cluster contains the header and is not valid 
> */
> +    }
> +    return offset < s->file_size;
> +}
> +
> +/**
> + * Read a string of known length from the image file
> + *
> + * @file:       Image file
> + * @offset:     File offset to start of string, in bytes
> + * @n:          String length in bytes
> + * @buf:        Destination buffer
> + * @buflen:     Destination buffer length in bytes
> + *
> + * The string is NUL-terminated.
> + */
> +static int qed_read_string(BlockDriverState *file, uint64_t offset, size_t n,
> +                           char *buf, size_t buflen)
> +{
> +    int ret;
> +    if (n >= buflen) {
> +        return -EINVAL;
> +    }
> +    ret = bdrv_pread(file, offset, buf, n);
> +    if (ret != n) {
> +        return ret;
> +    }
> +    buf[n] = '\0';
> +    return 0;
> +}
> +
> +/**
> + * Allocate new clusters
> + *
> + * @s:          QED state
> + * @n:          Number of contiguous clusters to allocate
> + * @offset:     Offset of first allocated cluster, filled in on success
> + */
> +static int qed_alloc_clusters(BDRVQEDState *s, unsigned int n, uint64_t 
> *offset)
> +{
> +    *offset = s->file_size;
> +    s->file_size += n * s->header.cluster_size;
> +    return 0;
> +}
> +
> +static QEDTable *qed_alloc_table(void *opaque)
> +{
> +    BDRVQEDState *s = opaque;
> +
> +    /* Honor O_DIRECT memory alignment requirements */
> +    return qed_memalign(s, s->header.cluster_size * s->header.table_size);
> +}
> +
> +/**
> + * Allocate a new zeroed L2 table
> + */
> +static CachedL2Table *qed_new_l2_table(BDRVQEDState *s)
> +{
> +    uint64_t offset;
> +    int ret;
> +    CachedL2Table *l2_table;
> +
> +    ret = qed_alloc_clusters(s, s->header.table_size, &offset);
> +    if (ret) {
> +        return NULL;
> +    }
> +
> +    l2_table = qed_alloc_l2_cache_entry(&s->l2_cache);
> +    l2_table->offset = offset;
> +
> +    memset(l2_table->table->offsets, 0,
> +           s->header.cluster_size * s->header.table_size);
> +    return l2_table;
> +}
> +
> +static int bdrv_qed_open(BlockDriverState *bs, int flags)
> +{
> +    BDRVQEDState *s = bs->opaque;
> +    QEDHeader le_header;
> +    int64_t file_size;
> +    int ret;
> +
> +    s->bs = bs;
> +    QSIMPLEQ_INIT(&s->allocating_write_reqs);
> +
> +    ret = bdrv_pread(bs->file, 0, &le_header, sizeof(le_header));
> +    if (ret != sizeof(le_header)) {
> +        return ret;
> +    }
> +    qed_header_le_to_cpu(&le_header, &s->header);
> +
> +    if (s->header.magic != QED_MAGIC) {
> +        return -ENOENT;
> +    }
> +    if (s->header.features & ~QED_FEATURE_MASK) {
> +        return -ENOTSUP; /* image uses unsupported feature bits */
> +    }
> +    if (!qed_is_cluster_size_valid(s->header.cluster_size)) {
> +        return -EINVAL;
> +    }
> +
> +    /* Round up file size to the next cluster */
> +    file_size = bdrv_getlength(bs->file);
> +    if (file_size < 0) {
> +        return file_size;
> +    }
> +    s->file_size = qed_start_of_cluster(s, file_size + 
> s->header.cluster_size - 1);
> +
> +    if (!qed_is_table_size_valid(s->header.table_size)) {
> +        return -EINVAL;
> +    }
> +    if (!qed_is_image_size_valid(s->header.image_size,
> +                                 s->header.cluster_size,
> +                                 s->header.table_size)) {
> +        return -EINVAL;
> +    }
> +    if (!qed_check_byte_offset(s, s->header.l1_table_offset)) {
> +        return -EINVAL;
> +    }
> +
> +    s->table_nelems = (s->header.cluster_size * s->header.table_size) /
> +        sizeof(s->l1_table->offsets[0]);
> +    s->l2_shift = get_bits_from_size(s->header.cluster_size);
> +    s->l2_mask = s->table_nelems - 1;
> +    s->l1_shift = s->l2_shift + get_bits_from_size(s->l2_mask + 1);
> +
> +    if ((s->header.features & QED_F_BACKING_FILE)) {
> +        ret = qed_read_string(bs->file, s->header.backing_file_offset,
> +                              s->header.backing_file_size, bs->backing_file,
> +                              sizeof(bs->backing_file));
> +        if (ret < 0) {
> +            return ret;
> +        }
> +
> +        if ((s->header.compat_features & QED_CF_BACKING_FORMAT)) {
> +            ret = qed_read_string(bs->file, s->header.backing_fmt_offset,
> +                                  s->header.backing_fmt_size,
> +                                  bs->backing_format,
> +                                  sizeof(bs->backing_format));
> +            if (ret < 0) {
> +                return ret;
> +            }
> +        }

IMHO we should make the backing format compulsory with use of
the backing file. The only time probing is required is when
initially creating the child image, thereafter there's no
benefit to probing again.

Regards,
Daniel
-- 
|: Red Hat, Engineering, London    -o-   http://people.redhat.com/berrange/ :|
|: http://libvirt.org -o- http://virt-manager.org -o- http://deltacloud.org :|
|: http://autobuild.org        -o-         http://search.cpan.org/~danberr/ :|
|: GnuPG: 7D3B9505  -o-   F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|
[Prev in Thread]
Current Thread
[Next in Thread]
Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format, (continued)
- [Qemu-devel] Re: [RFC] qed: Add QEMU Enhanced Disk format, Kevin Wolf, 2010/09/06
  - Re: [Qemu-devel] Re: [RFC] qed: Add QEMU Enhanced Disk format, Stefan Hajnoczi, 2010/09/06
    - Re: [Qemu-devel] Re: [RFC] qed: Add QEMU Enhanced Disk format, Anthony Liguori, 2010/09/06
    - Re: [Qemu-devel] Re: [RFC] qed: Add QEMU Enhanced Disk format, Stefan Hajnoczi, 2010/09/06
    - Re: [Qemu-devel] Re: [RFC] qed: Add QEMU Enhanced Disk format, Kevin Wolf, 2010/09/06
    - Re: [Qemu-devel] Re: [RFC] qed: Add QEMU Enhanced Disk format, Anthony Liguori, 2010/09/06
  - Re: [Qemu-devel] Re: [RFC] qed: Add QEMU Enhanced Disk format, Anthony Liguori, 2010/09/06
    - Re: [Qemu-devel] Re: [RFC] qed: Add QEMU Enhanced Disk format, H. Peter Anvin, 2010/09/10
- Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format, Daniel P. Berrange <=
  - Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format, Anthony Liguori, 2010/09/06
    - Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format, Daniel P. Berrange, 2010/09/06
    - Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format, Anthony Liguori, 2010/09/06
- Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format, Anthony Liguori, 2010/09/06
  - Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format, Avi Kivity, 2010/09/07
    - Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format, Anthony Liguori, 2010/09/07
    - Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format, Avi Kivity, 2010/09/07
    - Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format, Anthony Liguori, 2010/09/07
    - Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format, Anthony Liguori, 2010/09/07
    - Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format, Avi Kivity, 2010/09/08
Prev by Date: Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format
Next by Date: [Qemu-devel] Re: [PATCH] qemu: e1000 fix TOR math
Previous by thread: Re: [Qemu-devel] Re: [RFC] qed: Add QEMU Enhanced Disk format
Next by thread: Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format
Index(es):
- Date
- Thread