Re: [Qemu-devel] Re: [PATCH 2/2] qcow2: Implement bdrv

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] Re: [PATCH 2/2] qcow2: Implement bdrv_truncate() for gr

From:	Stefan Hajnoczi
Subject:	Re: [Qemu-devel] Re: [PATCH 2/2] qcow2: Implement bdrv_truncate() for growing images
Date:	Mon, 26 Apr 2010 15:01:42 +0100

On Mon, Apr 26, 2010 at 12:46 PM, Kevin Wolf <address@hidden> wrote:
> Am 24.04.2010 10:12, schrieb Stefan Hajnoczi:
>> This patch adds the ability to grow qcow2 images in-place using
>> bdrv_truncate().  This enables qemu-img resize command support for
>> qcow2.
>>
>> Snapshots are not supported and bdrv_truncate() will return -ENOTSUP.
>> The notion of resizing an image with snapshots could lead to confusion:
>> users may expect snapshots to remain unchanged, but this is not possible
>> with the current qcow2 on-disk format where the header.size field is
>> global instead of per-snapshot.  Others may expect snapshots to change
>> size along with the current image data.  I think it is safest to not
>> support snapshots and perhaps add behavior later if there is a
>> consensus.
>
> I think it would make most sense if we kept the disk size per snapshot
> (so your first option). The snapshot header is extensible, so it should
> be possible.
>
> Just disabling it is okay with me, too, but in that case this patch is
> much more complicated than it needs to be: If there are no snapshots,
> there is no vmstate area.

Good point, things are simplified greatly if there is no vm state
data.  I haven't read the save/load vm yet and failed to make the
connection here ;).

I will eliminate most of the code as you explained and send a v2
patch.  I have responded below because I hope to learn more by
discussing the comments with you, but most of the issues will not
apply to a slimmed down v2 so feel free to ignore if you are busy.

>> Backing images continue to work.  If the image is now larger than its
>> backing image, zeroes are read when accessing beyond the end of the
>> backing image.
>>
>> Signed-off-by: Stefan Hajnoczi <address@hidden>
>> ---
>> This applies to kevin/block.
>>
>>  block/qcow2-cluster.c  |   64 
>> ++++++++++++++++++++++++++++++++++++++---------
>>  block/qcow2-snapshot.c |    2 +-
>>  block/qcow2.c          |   43 +++++++++++++++++++++++++++++---
>>  block/qcow2.h          |    9 ++++++-
>>  4 files changed, 99 insertions(+), 19 deletions(-)
>>
>> diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
>> index c11680d..20c8426 100644
>> --- a/block/qcow2-cluster.c
>> +++ b/block/qcow2-cluster.c
>> @@ -28,30 +28,39 @@
>>  #include "block_int.h"
>>  #include "block/qcow2.h"
>>
>> -int qcow2_grow_l1_table(BlockDriverState *bs, int min_size)
>> +/*
>> + * qcow2_grow_l1_table_common
>> + *
>> + * Grows the L1 table and updates the header on disk.
>> + *
>> + * Setting new_l1_vm_state_index to s->l1_vm_state_index grows the vm state
>> + * area.
>> + *
>> + * Setting new_l1_vm_state_index to the new end of image grows the image 
>> data
>> + * area.
>
> I don't understand this distinction. I'm sure you describe something
> trivial here, but it sounds like magic.
>
> There's really not much difference between data and vmstate clusters and
> callers don't know on which of both they are working.

The new_l1_vm_state_index argument is used to position the vm state
data in the new L1 table.  If we're growing the L1 table to make room
for more vm state data, then new_l1_vm_state_index should keep its old
value (s->l1_vm_state_index).  If we're growing the image data and
need to move the vm state out of the way, then new_l1_vm_state_index
should be recalculated for the new image size.

The function allocates the new L1 table and performs two separate
copies: one of the image data L1 entries and one for the vm state L1
entries.  The new_l1_vm_state_index argument controls where the vm
state data entries are copied; it provides the ability the move the vm
state data to make room for new image data entries.

>> + *
>> + * Returns 0 on success, -errno in failure case.
>> + */
>> +static int qcow2_grow_l1_table_common(BlockDriverState *bs,
>> +                                      int new_l1_vm_state_index,
>> +                                      int new_l1_size)
>>  {
>>      BDRVQcowState *s = bs->opaque;
>> -    int new_l1_size, new_l1_size2, ret, i;
>> +    int new_l1_size2, ret, i;
>>      uint64_t *new_l1_table;
>>      int64_t new_l1_table_offset;
>>      uint8_t data[12];
>>
>> -    new_l1_size = s->l1_size;
>> -    if (min_size <= new_l1_size)
>> -        return 0;
>> -    if (new_l1_size == 0) {
>> -        new_l1_size = 1;
>> -    }
>> -    while (min_size > new_l1_size) {
>> -        new_l1_size = (new_l1_size * 3 + 1) / 2;
>> -    }
>>  #ifdef DEBUG_ALLOC2
>>      printf("grow l1_table from %d to %d\n", s->l1_size, new_l1_size);
>>  #endif
>>
>>      new_l1_size2 = sizeof(uint64_t) * new_l1_size;
>>      new_l1_table = qemu_mallocz(align_offset(new_l1_size2, 512));
>> -    memcpy(new_l1_table, s->l1_table, s->l1_size * sizeof(uint64_t));
>> +    memcpy(new_l1_table, s->l1_table, s->l1_vm_state_index * 
>> sizeof(uint64_t));
>
> This change smells like a buffer overflow. It both can read beyond the
> end of the old L1 table and can overflow the new one if the L1 table
> hasn't yet reached its full size.
>
> I think this is the main problem of this patch: You seem to assume that
> the L1 table is created with the full size needed to cover all of the
> data area. Which seems to be true at least for images created by
> qemu-img, but I don't think we can rely on it (nor do I think we should,
> because it introduces a special case where there is none).
>
> I rather consider this preallocation a performance optimization. If we
> stop treating it as such, we would need to refuse opening any images
> that don't fulfill this requirement.

>From qcow_open():

/* read the level 1 table */
s->l1_size = header.l1_size;
shift = s->cluster_bits + s->l2_bits;
s->l1_vm_state_index = (header.size + (1LL << shift) - 1) >> shift;
/* the L1 table must contain at least enough entries to put
   header.size bytes */
if (s->l1_size < s->l1_vm_state_index)
    goto fail;

Images where L1 size is not at least l1_vm_state_index cannot be
opened by QEMU.  Right now the special case exists and perhaps some
qcow2 code already relies on this assumption.

>> +    memcpy(&new_l1_table[new_l1_vm_state_index],
>> +           &s->l1_table[s->l1_vm_state_index],
>> +           (s->l1_size - s->l1_vm_state_index) * sizeof(uint64_t));
>>
>>      /* write new table (align to cluster) */
>>      BLKDBG_EVENT(bs->file, BLKDBG_L1_GROW_ALLOC_TABLE);
>> @@ -83,6 +92,7 @@ int qcow2_grow_l1_table(BlockDriverState *bs, int min_size)
>>      s->l1_table_offset = new_l1_table_offset;
>>      s->l1_table = new_l1_table;
>>      s->l1_size = new_l1_size;
>> +    s->l1_vm_state_index = new_l1_vm_state_index;
>>      return 0;
>>   fail:
>>      qemu_free(new_l1_table);
>> @@ -90,6 +100,34 @@ int qcow2_grow_l1_table(BlockDriverState *bs, int 
>> min_size)
>>      return ret < 0 ? ret : -EIO;
>>  }
>>
>> +int qcow2_grow_l1_vm_state(BlockDriverState *bs, int min_size)
>> +{
>> +    BDRVQcowState *s = bs->opaque;
>> +    int new_l1_size;
>> +
>> +    new_l1_size = s->l1_size;
>> +    if (min_size <= new_l1_size)
>> +        return 0;
>> +    if (new_l1_size == 0) {
>> +        new_l1_size = 1;
>> +    }
>> +    while (min_size > new_l1_size) {
>> +        new_l1_size = (new_l1_size * 3 + 1) / 2;
>> +    }
>> +
>> +    return qcow2_grow_l1_table_common(bs, s->l1_vm_state_index, 
>> new_l1_size);
>> +}
>> +
>> +int qcow2_grow_l1_image_data(BlockDriverState *bs, int new_l1_size)
>> +{
>> +    BDRVQcowState *s = bs->opaque;
>> +    int new_l1_vm_state_index;
>> +
>> +    new_l1_vm_state_index = new_l1_size;
>> +    new_l1_size += s->l1_size - s->l1_vm_state_index;
>
> Same assumption as above. This might turn into a negative number added
> to new_l1_size.
>
>> +    return qcow2_grow_l1_table_common(bs, new_l1_vm_state_index, 
>> new_l1_size);
>> +}
>> +
>>  void qcow2_l2_cache_reset(BlockDriverState *bs)
>>  {
>>      BDRVQcowState *s = bs->opaque;
>> @@ -524,7 +562,7 @@ static int get_cluster_table(BlockDriverState *bs, 
>> uint64_t offset,
>>
>>      l1_index = offset >> (s->l2_bits + s->cluster_bits);
>>      if (l1_index >= s->l1_size) {qcow2_grow_l1_vm_state(
>> -        ret = qcow2_grow_l1_table(bs, l1_index + 1);
>> +        ret = qcow2_grow_l1_vm_state(bs, l1_index + 1);
>
> Once you agree that get_cluster_table can't know if it's working on data
> or vmstate, qcow2_grow_l1_vm_state is a misnomer at least.
>
>>          if (ret < 0) {
>>              return ret;
>>          }
>> diff --git a/block/qcow2-snapshot.c b/block/qcow2-snapshot.c
>> index 2a21c17..7f0d810 100644
>> --- a/block/qcow2-snapshot.c
>> +++ b/block/qcow2-snapshot.c
>> @@ -326,7 +326,7 @@ int qcow2_snapshot_goto(BlockDriverState *bs, const char 
>> *snapshot_id)
>>      if (qcow2_update_snapshot_refcount(bs, s->l1_table_offset, s->l1_size, 
>> -1) < 0)
>>          goto fail;
>>
>> -    if (qcow2_grow_l1_table(bs, sn->l1_size) < 0)
>> +    if (qcow2_grow_l1_vm_state(bs, sn->l1_size) < 0)
>>          goto fail;
>>
>>      s->l1_size = sn->l1_size;
>> diff --git a/block/qcow2.c b/block/qcow2.c
>> index 4a7ab66..ab622a2 100644
>> --- a/block/qcow2.c
>> +++ b/block/qcow2.c
>> @@ -140,7 +140,7 @@ static int qcow_read_extensions(BlockDriverState *bs, 
>> uint64_t start_offset,
>>  static int qcow_open(BlockDriverState *bs, int flags)
>>  {
>>      BDRVQcowState *s = bs->opaque;
>> -    int len, i, shift;
>> +    int len, i;
>>      QCowHeader header;
>>      uint64_t ext_end;
>>
>> @@ -188,8 +188,7 @@ static int qcow_open(BlockDriverState *bs, int flags)
>>
>>      /* read the level 1 table */
>>      s->l1_size = header.l1_size;
>> -    shift = s->cluster_bits + s->l2_bits;
>> -    s->l1_vm_state_index = (header.size + (1LL << shift) - 1) >> shift;
>> +    s->l1_vm_state_index = size_to_l1(s, header.size);
>>      /* the L1 table must contain at least enough entries to put
>>         header.size bytes */
>>      if (s->l1_size < s->l1_vm_state_index)
>> @@ -1095,6 +1094,40 @@ static int qcow_make_empty(BlockDriverState *bs)
>>      return 0;
>>  }
>>
>> +static int qcow2_truncate(BlockDriverState *bs, int64_t offset)
>> +{
>> +    BDRVQcowState *s = bs->opaque;
>> +    int ret, new_l1_size;
>> +
>> +    if (offset & 511) {
>> +        return -EINVAL;
>> +    }
>> +
>> +    /* cannot proceed if image has snapshots */
>> +    if (s->nb_snapshots) {
>> +        return -ENOTSUP;
>> +    }
>> +
>> +    /* shrinking is currently not supported */
>> +    if (offset < bs->total_sectors << BDRV_SECTOR_BITS) {
>> +        return -ENOTSUP;
>> +    }
>> +
>> +    new_l1_size = size_to_l1(s, offset);
>> +    ret = qcow2_grow_l1_image_data(bs, new_l1_size);
>> +    if (ret < 0) {
>> +        return ret;
>> +    }
>
> Instead of the confusing changes above you could just increase the L1
> table size using the old function and keep the data/vmstate thing local
> to qcow2_truncate (or maybe better a qcow2_move_vmstate(bs, offset)
> which internally grows the L1 table as needed)
>
> Actually, I think this is not that different from your patch (you called
> the almost same function qcow2_grow_l1_image_data and avoided the normal
> calculation of the next L1 table size for some reason), but probably a
> lot less confusing.

I like the qcow2_move_vmstate() name.  It is clearer than
qcow2_grow_l1_image_data().

If I understand correctly, the next L1 table size calculation tries to
grow the table in steps large enough so that the grow operation does
not need to be performed frequently.  This makes sense when appending
arbitrary vm state data, but the truncate operation knows the exact
new size of the image and doesn't need to speculatively allocate more
L1 table space.

>> +
>> +    /* write updated header.size */
>> +    offset = cpu_to_be64(offset);
>> +    if (bdrv_pwrite(bs->file, offsetof(QCowHeader, size), &offset,
>> +                    sizeof(uint64_t)) != sizeof(uint64_t)) {
>> +        return -EIO;
>> +    }
>
> The vmstate_offset field needs to be updated somewhere, too. In my
> suggestion this would be in qcow2_move_vmstate.

Which field (I can't find a vmstate_offset field)?  The patch updates
s->l1_vm_state_index in qcow2_grow_l1_table_common(), is that the one
you mean?

Stefan

[Prev in Thread]

Current Thread

[Next in Thread]

[Qemu-devel] [PATCH 1/2] qemu-img: Add 'resize' command to grow/shrink disk images, Stefan Hajnoczi, 2010/04/24
- [Qemu-devel] [PATCH 2/2] qcow2: Implement bdrv_truncate() for growing images, Stefan Hajnoczi, 2010/04/24
  - [Qemu-devel] Re: [PATCH 2/2] qcow2: Implement bdrv_truncate() for growing images, Kevin Wolf, 2010/04/26
    - Re: [Qemu-devel] Re: [PATCH 2/2] qcow2: Implement bdrv_truncate() for growing images, Stefan Hajnoczi <=
    - Re: [Qemu-devel] Re: [PATCH 2/2] qcow2: Implement bdrv_truncate() for growing images, Kevin Wolf, 2010/04/26
- [Qemu-devel] Re: [PATCH 1/2] qemu-img: Add 'resize' command to grow/shrink disk images, Kevin Wolf, 2010/04/26

Prev by Date: Re: [Qemu-devel] Re: [libvirt] Libvirt debug API
Next by Date: Re: [Qemu-devel] Re: [libvirt] Libvirt debug API
Previous by thread: [Qemu-devel] Re: [PATCH 2/2] qcow2: Implement bdrv_truncate() for growing images
Next by thread: Re: [Qemu-devel] Re: [PATCH 2/2] qcow2: Implement bdrv_truncate() for growing images
Index(es):
- Date
- Thread