qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH 1/2] qemu-img convert: Rewrite copying logic


From: Kevin Wolf
Subject: Re: [Qemu-devel] [PATCH 1/2] qemu-img convert: Rewrite copying logic
Date: Thu, 19 Feb 2015 16:47:26 +0100
User-agent: Mutt/1.5.21 (2010-09-15)

Am 18.02.2015 um 18:18 hat Max Reitz geschrieben:
> On 2015-02-18 at 10:19, Kevin Wolf wrote:
> >The implementation of qemu-img convert is (a) messy, (b) buggy, and
> >(c) less efficient than possible. The changes required to beat some
> >sense into it are massive enough that incremental changes would only
> >make my and the reviewers' life harder. So throw it away and reimplement
> >it from scratch.
> >
> >Let me give some examples what I mean by messy, buggy and inefficient:
> >
> >(a) The copying logic of qemu-img convert has two separate branches for
> >     compressed and normal target images, which roughly do the same -
> >     except for a little code that handles actual differences between
> >     compressed and uncompressed images, and much more code that
> >     implements just a different set of optimisations and bugs. This is
> >     unnecessary code duplication, and makes the code for compressed
> >     output (unsurprisingly) suffer from bitrot.
> >
> >     The code for uncompressed ouput is run twice to count the the total
> >     length for the progress bar. In the first run it just takes a
> >     shortcut and runs only half the loop, and when it's done, it toggles
> >     a boolean, jumps out of the loop with a backwards goto and starts
> >     over. Works, but pretty is something different.
> >
> >(b) Converting while keeping a backing file (-B option) is broken in
> >     several ways. This includes not writing to the image file if the
> >     input has zero clusters or data filled with zeros (ignoring that the
> >     backing file will be visible instead).
> >
> >     It also doesn't correctly limit every iteration of the copy loop to
> >     sectors of the same status so that too many sectors may be copied to
> >     in the target image. For -B this gives an unexpected result, for
> >     other images it just does more work than necessary.
> >
> >     Conversion with a compressed target completely ignores any target
> >     backing file.
> >
> >(c) qemu-img convert skips reading and writing an area if it knows from
> >     metadata that copying isn't needed (except for the bug mentioned
> >     above that ignores a status change in some cases). It does, however,
> >     read from the source even if it knows that it will read zeros, and
> >     then search for non-zero bytes in the read buffer, if it's possible
> >     that a write might be needed.
> >
> >This reimplementation of the copying core reorganises the code to remove
> >the duplication and have a much more obvious code flow, by essentially
> >splitting the copy iteration loop into three parts:
> >
> >1. Find the number of contiguous sectors of the same status at the
> >    current offset (This can also be called in a separate loop for the
> >    copying loop in order to determine the total sectors for the progress
> >    bar.)
> >
> >2. Read sectors. If the status implies that there is no data there to
> >    read (zero or unallocated cluster), don't do anything.
> >
> >3. Write sectors depending on the status. If it's data, write it. If
> >    we want the backing file to be visible (with -B), don't write it. If
> >    it's zeroed, skip it if you can, otherwise use bdrv_write_zeroes() to
> >    optimise the write at least where possible.
> >
> >Signed-off-by: Kevin Wolf <address@hidden>
> >---
> >  qemu-img.c | 511 
> > ++++++++++++++++++++++++++++++++++++-------------------------
> >  1 file changed, 305 insertions(+), 206 deletions(-)
> 
> You have been very careful not to write "Rewrite img_convert()" or
> something like that in the subject, so I can't really complain that
> there are still bugs in the function (which are not related to the
> copying logic), for instance:
> 
> $ ./qemu-img create -f qcow2 i1.qcow2 64M
> Formatting 'i1.qcow2', fmt=qcow2 size=67108864 encryption=off
> cluster_size=65536 lazy_refcounts=off
> $ ./qemu-img create -f qcow2 i2.qcow2 64M
> Formatting 'i2.qcow2', fmt=qcow2 size=67108864 encryption=off
> cluster_size=65536 lazy_refcounts=off
> $ ./qemu-img snapshot -c foo i1.qcow2
> $ ./qemu-img snapshot -c foo i2.qcow2
> $ ./qemu-io -c 'write 0 64M' i2.qcow2
> wrote 67108864/67108864 bytes at offset 0
> 64 MiB, 1 ops; 0:00:01.32 (48.152 MiB/sec and 0.7524 ops/sec)
> $ ./qemu-img convert -l snapshot.name=foo -O qcow2 i{1,2}.qcow2 o.qcow2
> qemu-img: error while reading sector 131072: Input/output error
> 
> ("No support for concatenating multiple snapshot" should be enforced
> for sn_opts != NULL)

Probably worth addressing, though not in this patch.

> >diff --git a/qemu-img.c b/qemu-img.c
> >index e148af8..5c8386e 100644
> >--- a/qemu-img.c
> >+++ b/qemu-img.c
> >@@ -1306,20 +1306,307 @@ out3:
> >      return ret;
> >  }
> >+enum ImgConvertBlockStatus {
> >+    BLK_DATA,
> >+    BLK_ZERO,
> >+    BLK_BACKING_FILE,
> >+};
> >+
> >+typedef struct ImgConvertState {
> >+    BlockDriverState **src;
> >+    int64_t *src_sectors;
> >+    int src_cur, src_num;
> >+    int64_t src_cur_offset;
> >+    int64_t total_sectors;
> >+    int64_t allocated_sectors;
> >+    enum ImgConvertBlockStatus status;
> >+    int64_t sector_next_status;
> >+    BlockDriverState *target;
> >+    bool has_zero_init;
> >+    bool compressed;
> >+    bool out_backing;
> 
> Maybe "out_backed" (or "out_is_backed" or "out_has_backing") would
> be better; "out_backing" to me sounds like this should describe the
> backing file.

Okay, will rename as target_has_backing.

> >+    int min_sparse;
> >+    size_t cluster_sectors;
> >+    size_t buf_sectors;
> >+} ImgConvertState;
> >+
> >+static void convert_select_part(ImgConvertState *s, int64_t sector_num)
> >+{
> >+    while (sector_num - s->src_cur_offset >= s->src_sectors[s->src_cur]) {
> >+        s->src_cur_offset += s->src_sectors[s->src_cur];
> >+        s->src_cur++;
> >+        assert(s->src_cur < s->src_num);
> >+    }
> >+}
> >+
> >+static int convert_iteration_sectors(ImgConvertState *s, int64_t sector_num)
> >+{
> >+    int64_t ret;
> >+    int n;
> >+
> >+    convert_select_part(s, sector_num);
> >+
> >+    assert(s->total_sectors > sector_num);
> >+    n = MIN(s->total_sectors - sector_num, INT_MAX);
> 
> Maybe it would be better to use INT_MAX / BDRV_SECTOR_SIZE here (and
> in other places, too)... In practice, this would only be relevant to
> reads and writes, though, and those are limited by s->buf_sectors.

I'll make that BDRV_REQUEST_MAX_SECTORS.

> >+
> >+    if (s->sector_next_status <= sector_num) {
> >+        ret = bdrv_get_block_status(s->src[s->src_cur],
> >+                                    sector_num - s->src_cur_offset,
> >+                                    n, &n);
> >+        if (ret < 0) {
> >+            return ret;
> >+        }
> >+
> >+        if (ret & BDRV_BLOCK_ZERO) {
> >+            s->status = BLK_ZERO;
> >+        } else if (ret & BDRV_BLOCK_DATA) {
> >+            s->status = BLK_DATA;
> >+        } else if (!s->out_backing) {
> >+            /* Without a target backing file we must copy over the contents 
> >of
> >+             * the backing file as well. */
> >+            /* TODO Check block status of the backing file chain to avoid
> >+             * needlessly reading zeroes and limiting the iteration to the
> >+             * buffer size */
> >+            s->status = BLK_DATA;
> >+        } else {
> >+            s->status = BLK_BACKING_FILE;
> >+        }
> >+
> >+        s->sector_next_status = sector_num + n;
> >+    }
> >+
> >+    n = MIN(n, s->sector_next_status - sector_num);
> >+    if (s->status == BLK_DATA) {
> >+        n = MIN(n, s->buf_sectors);
> >+    }
> >+
> >+    /* We need to write complete clusters for compressed images, so if an
> >+     * unallocated area is shorter than that, we must consider the whole
> >+     * cluster allocated. */
> >+    if (s->compressed) {
> >+        if (n < s->cluster_sectors) {
> >+            n = MIN(s->cluster_sectors, s->total_sectors - sector_num);
> >+            s->status = BLK_DATA;
> >+        } else {
> >+            n = QEMU_ALIGN_DOWN(n, s->cluster_sectors);
> >+        }
> >+    }
> >+
> >+    return n;
> >+}
> >+
> >+static int convert_read(ImgConvertState *s, int64_t sector_num, int 
> >nb_sectors,
> >+                        uint8_t *buf)
> >+{
> >+    int n;
> >+    int ret;
> >+
> >+    if (s->status == BLK_ZERO || s->status == BLK_BACKING_FILE) {
> >+        return 0;
> >+    }
> >+
> >+    assert(nb_sectors <= s->buf_sectors);
> >+    while (nb_sectors > 0) {
> >+        BlockDriverState *bs;
> >+        int64_t bs_sectors;
> >+
> >+        /* In the case of compression with multiple source files, we can 
> >get a
> >+         * nb_sectors that spreads into the next part. So we must be able to
> >+         * read across multiple BDSes for one convert_read() call. */
> >+        convert_select_part(s, sector_num);
> >+        bs = s->src[s->src_cur];
> >+        bs_sectors = s->src_sectors[s->src_cur];
> >+
> >+        n = MIN(nb_sectors, bs_sectors - (sector_num - s->src_cur_offset));
> >+        ret = bdrv_read(bs, sector_num, buf, n);
> 
> Shouldn't this be sector_num - s->src_cur_offset?

I suppose it should be, thanks for catching this. Not a good sign with
respect to our qemu-iotests coverage.

> >+        if (ret < 0) {
> >+            return ret;
> >+        }
> >+
> >+        sector_num += n;
> >+        nb_sectors -= n;
> >+        buf += n * BDRV_SECTOR_SIZE;
> >+    }
> >+
> >+    return 0;
> >+}
> >+
> >+static int convert_write(ImgConvertState *s, int64_t sector_num, int 
> >nb_sectors,
> >+                         const uint8_t *buf)
> >+{
> >+    int ret;
> >+
> >+    while (nb_sectors > 0) {
> >+
> 
> This empty line looks a bit strange to me...

Removed.

> >+        int n = nb_sectors;
> >+
> >+        switch (s->status) {
> >+        case BLK_BACKING_FILE:
> >+            /* If we have a backing file, leave clusters unallocated that 
> >are
> >+             * unallocated in the source image, so that the backing file is
> >+             * visible at the respective offset. */
> >+            assert(s->out_backing);
> >+            break;
> >+
> >+        case BLK_DATA:
> >+            /* We must always write compressed clusters as a whole, so don't
> >+             * try to find zeroed parts in the buffer. We can only save the
> >+             * write if the buffer is completely zeroed. */
> 
> But you shouldn't be doing that if n < s->min_sparse, I think (if
> people specify -S 0, they don't want you to zero anything, and
> that's guaranteed by the man page). Considering that
> is_allocated_sectors_min() basically has a min = MIN(min, n) part,
> too, I'd be fine with making -S 0 a special case here ("if
> (s->has_zero_init && s->min_sparse && buffer_is_zero(...))").

Hm, okay. With a compressed target n isn't variable, so there's only
always or never, depending on min_sparse < cluster_size. But I can do
it.

I'm actually not really sure what the use case for -S with a compressed
target would be.

> Actually, now that I'm looking at is_allocated_sectors_min(), if the
> first sectors is not allocated, it will return false and thus both
> the current qemu-img convert and the new one after this patch won't
> write data. That's a bug, I think (because it is guaranteed by the
> man page).

Or the description in the man page is wrong.

The intention with -S was that we avoid splitting up writes into too
many small chunks because that costs a lot of time. If you look at it
from that angle, it's doing exactly the right thing because skipping
zeros at the start doesn't increase the number of write requests.

> >+            if (s->compressed) {
> >+                if (s->has_zero_init &&
> >+                    buffer_is_zero(buf, n * BDRV_SECTOR_SIZE))
> >+                {
> >+                    assert(!s->out_backing);
> >+                    break;
> >+                }

I wonder whether !s->has_zero_init shouldn't actually treat it as
BLK_ZERO. That should "compress" even better than writing a gzip
compressed cluster of zeros.

> >+                ret = bdrv_write_compressed(s->target, sector_num, buf, n);
> >+                if (ret < 0) {
> >+                    return ret;
> >+                }
> >+                break;
> >+            }
> >+
> >+            /* If there is real non-zero data, we must write it. Otherwise 
> >we
> >+             * can treat it as zero sectors. */
> >+            if (is_allocated_sectors_min(buf, n, &n, s->min_sparse)) {
> 
> So I think this should be "if (!s->min_sparse || ...)".

I'd rather change is_allocated_sectors_min(). But only if there is a use
case, otherwise we should fix the description in the man page.

> >+                ret = bdrv_write(s->target, sector_num, buf, n);
> >+                if (ret < 0) {
> >+                    return ret;
> >+                }
> >+                break;
> >+            }
> >+            /* fall-through */
> >+
> >+        case BLK_ZERO:
> >+            if (s->has_zero_init) {
> 
> And I don't really know what to do about this. Should we write zeros
> if !s->min_sparse? Or is this a special case and the user can't
> expect qemu-img convert to really write zeros if the source image
> contained unallocated/zero clusters? That would make sense, but once
> again, the man page clearly states that "If sparse_size is 0, […]
> the destination image will always be fully allocated." Maybe we
> could argue that "fully" in this case means "as fully as the source
> image was allocated".

FWIW, bdrv_is_allocated() considers zero clusters allocated. ;-)

I don't think that -S 0 should mean full preallocation. If you want
that, I think you can get it with -o (at the cost of writing some parts
of the image twice).

> >+                break;
> >+            }
> >+            /* TODO Should this use BDRV_REQ_MAY_UNMAP? */
> 
> Can't we remove this comment? If BDRV_REQ_MAP_UNMAP results in zeros
> read, bdrv_make_zero() below will have been invoked and thus
> s->has_zero_init = true (unless you take my suggestion and leave
> s->has_zero_init false if bdrv_make_zero() failed, but I wouldn't
> really mind that case here).

Good point.

(I wanted to remove the comment either way, it was just there to get
input during the review.)

> >+            ret = bdrv_write_zeroes(s->target, sector_num, n, 0);
> >+            if (ret < 0) {
> >+                return ret;
> >+            }
> >+            break;
> >+        }
> >+
> >+        sector_num += n;
> >+        nb_sectors -= n;
> >+        buf += n * BDRV_SECTOR_SIZE;
> >+    }
> >+
> >+    return 0;
> >+}
> >+
> >+static int convert_do_copy(ImgConvertState *s)
> >+{
> >+    uint8_t *buf = NULL;
> >+    int64_t sector_num, allocated_done;
> >+    int ret;
> >+    int n;
> >+
> >+    /* Check whether we have zero initialisation or can get it efficiently 
> >*/
> >+    s->has_zero_init = s->min_sparse && !s->out_backing
> >+                     ? bdrv_has_zero_init(s->target)
> >+                     : false;
> >+
> >+    if (!s->has_zero_init && !s->out_backing &&
> >+        bdrv_can_write_zeroes_with_unmap(s->target))
> >+    {
> >+        ret = bdrv_make_zero(s->target, BDRV_REQ_MAY_UNMAP);
> >+        if (ret < 0) {
> >+            goto fail;
> 
> I think just leaving s->has_zero_init false should be fine, too
> (instead of failing).
> 
> >+        }
> >+        s->has_zero_init = 1;
> 
> s/1/true/

Yes.

> >+    }
> >+
> >+    /* Allocate buffer for copied data. For compressed images, only one 
> >cluster
> >+     * can be copied at a time. */
> >+    if (s->compressed) {
> >+        if (s->cluster_sectors <= 0 || s->cluster_sectors > s->buf_sectors) 
> >{
> >+            error_report("invalid cluster size");
> >+            ret = -EINVAL;
> >+            goto fail;
> >+        }
> >+        s->buf_sectors = s->cluster_sectors;
> >+    }
> >+    buf = qemu_blockalign(s->target, s->buf_sectors * BDRV_SECTOR_SIZE);
> >+
> >+    /* Calculate allocated sectors for progress */
> >+    s->allocated_sectors = 0;
> >+    sector_num = 0;
> >+    while (sector_num < s->total_sectors) {
> >+        n = convert_iteration_sectors(s, sector_num);
> >+        if (n < 0) {
> >+            ret = n;
> >+            goto fail;
> >+        }
> >+        if (s->status == BLK_DATA) {
> >+            s->allocated_sectors += n;
> >+        }
> >+        sector_num += n;
> >+    }
> >+
> >+    /* Do the copy */
> >+    s->src_cur = 0;
> >+    s->src_cur_offset = 0;
> >+    s->sector_next_status = 0;
> >+
> >+    sector_num = 0;
> >+    allocated_done = 0;
> >+
> >+    while (sector_num < s->total_sectors) {
> >+        n = convert_iteration_sectors(s, sector_num);
> >+        if (n < 0) {
> >+            ret = n;
> >+            goto fail;
> >+        }
> >+        if (s->status == BLK_DATA) {
> >+            allocated_done += n;
> >+            qemu_progress_print(100.0 * allocated_done / 
> >s->allocated_sectors,
> >+                                0);
> >+        }
> >+
> >+        ret = convert_read(s, sector_num, n, buf);
> >+        if (ret < 0) {
> >+            error_report("error while reading sector %" PRId64
> >+                         ": %s", sector_num, strerror(-ret));
> >+            goto fail;
> >+        }
> >+
> >+        ret = convert_write(s, sector_num, n, buf);
> >+        if (ret < 0) {
> >+            error_report("error while writing sector %" PRId64
> >+                         ": %s", sector_num, strerror(-ret));
> >+            goto fail;
> >+        }
> >+
> >+        sector_num += n;
> >+    }
> >+
> >+    if (s->compressed) {
> >+        /* signal EOF to align */
> >+        bdrv_write_compressed(s->target, 0, NULL, 0);
> 
> Is there a reason for ignoring the return value other than "That's
> how img_convert() used to do it"?

No. Isn't that one good enough? ;-)

So the code in qcow2 says this:

    if (nb_sectors == 0) {
        /* align end of file to a sector boundary to ease reading with
           sector based I/Os */
        cluster_offset = bdrv_getlength(bs->file);
        return bdrv_truncate(bs->file, cluster_offset);
    }

I don't think we have any such restrictions any more, so it's mostly
useless. Perhaps ancient qemu versions would fail to read such an image,
but recent ones shouldn't.

In fact, our bdrv_pwrite() currently maps to sector-aligned functions in
the protocol driver, so I think at the moment we already get the
alignment automatically. This might change again once we convert block
drivers to byte offsets.

Kevin



reply via email to

[Prev in Thread] Current Thread [Next in Thread]