qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v5 1/3] qcow2: Add qcow2_shrink_l1_and_l2_table


From: Max Reitz
Subject: Re: [Qemu-devel] [PATCH v5 1/3] qcow2: Add qcow2_shrink_l1_and_l2_table for qcow2 shrinking
Date: Thu, 22 Jan 2015 14:14:06 -0500
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.4.0

On 2015-01-19 at 08:16, Jun Li wrote:
On Thu, 01/15 13:47, Max Reitz wrote:
On 2015-01-03 at 07:23, Jun Li wrote:
On Fri, 11/21 11:56, Max Reitz wrote:
So, as for what I think we do need to do when shrinking (and keep in mind:
The offset given to qcow2_truncate() is the guest size! NOT the host image
size!):

(1) Determine the first L2 table and the first entry in the table which will
lie beyond the new guest disk size.
Here is not correct always. Due to the COW, using offset to calculate the
first entry of the first L2 table will be incorrect.
Again: This is *not* about the host disk size or the host offset of some
cluster, but about the *guest* disk size.

Let's make up an example. You have a 2 GB disk but you want to resize it to
1.25 GB. The cluster size is 64 kB, therefore we have 2 GB / 64 kB = 32,768
data clusters (as long as there aren't any internal snapshots, which is a
prerequisite for resizing qcow2 images).

Every L2 table contains 65,536 / 8 = 8,192 entries; there are thus 32,768 /
8,192 = 4 L2 tables.

As you can see, one can directly derive the number of data clusters and L2
tables from the guest disk size (as long as there aren't any internal
snapshots).

So of course we can do the same for the target disk size: 1.25 GB / 64 kB =
20,480 data clusters; 20,480 / 8,192 = 2.5 L2 tables, therefore we need
three L2 tables but only half of the last one (4,096 entries).

Sorry, last time is my mis-understanding. If do not use qcow2_truncate(), I
think don't existing above issue.

For my original thought, I want to say:
Sometimes the second L2 table will contain some entry, the pointer in this
entry will point to a cluster which address is larger than 1.25 GB.

Correct.

So if not use qcow2_truncate(), won't discard above cluster which address is
larger than 1.25 GB.

I'm sorry, I can't really follow what you are trying to say here, so I'll just try to reply with things that may or may not be what you wanted to talk about.

If you are using qemu-img resize and thus subsequently qcow2_truncate() to shrink an image, you cannot expect the image to shrink to the specified file length, for several reasons.

First, if you shrink it to 1 GB, but only half of that is actually used, the image might of course very well have a length below 1 GB.

Second, there is metadata overhead. So if you are changing the guest disk size to 1 GB (all of which is occupied), the host file size will exceed 1 GB because of that overhead.

Third, I keep repeating myself here, but file length is not file size. So you may observe a file length of 10 GB or more because the clusters are spread all over the image file. This is something we'd have to combat with defragmentation; but the question is whether we really need to (see below for more on that). The point is that it doesn't matter whether the image has a file length of 10 GB; the file size will be around 1 GB anyway.

But I still have another worry.

Suppose "virtual size" and "disk size" are all 2G. After we resize it to
1.25G, seems we will get "virtual size" is 1.25G but "disk size" is still 2G

No, it won't. I can prove it to you:

$ qemu-img create -f qcow2 test.qcow2 64M
$ qemu-io -c 'write 0 64M' test.qcow2
$ qemu-img info test.qcow2
...
disk size: 64M
...

Okay, so far it's just what we'd expect. Now let's implement my proposal for truncation: Let's assume the image should be shrinked to 32 MB, so we discard all clusters starting at 32 MB (guest offset) (which is 64 MB - 32 MB = 32 MB of data):

$ qemu-io -c 'discard 32M 32M' test.qcow2
$ qemu-img info test.qcow2
...
disk size: 32M
...

Great!

if do not use "qcow2_truncate()" to truncate the file(Yes, I know use
qcow2_truncate is not a resolution). This seems strange, not so perfect.

We know that every cluster references somewhere after that limit (that is,
every entry in the fourth L2 table and every entry starting with index 4,096
in the third L2 table) is a data cluster with a guest offset somewhere
beyond 1.25 GB, so we don't need it anymore.

Thus, we simply discard all those data clusters and after that we can
discard the fourth L2 table. That's it.

If we really want to we can calculate the highest cluster host offset in use
and truncate the image accordingly. But that's optional, see the last point
in my "problems with this approach" list (having discarded the clusters
should save us all the space already). Furthermore, as I'm saying in that
list, to really solve this issue, we'd need qcow2 defragmentation.

Do we already have "qcow2 defragmentation" realization?

No, we don't. The only way to defragment a qcow2 image right now is using qemu-img convert to create a (defragmented) copy and then delete the old image, which has the disadvantage of temporarily requiring double the disk space and being an offline operation.

So far, nobody has implemented online defragmentation, mainly for two reasons: It would probably be pretty complicated (it'd probably need to be a block job which links into a pretty low-level function provided by qcow2 (defragment_some_clusters or something)) and second, so far there has been little demand. Disk space is not an issue (as said before), because it doesn't really matter to a modern file system whether your file has a length of 100 MB of 100 GB; that's just some number. What really matters is how much of that space is actually used; and if all unused clusters are discarded, there won't be any space used for them (well, maybe there is some metadata overhead, but that should be negligible).

There are a couple of reasons why you'd want to defragment an image:

First, it makes you feel better. I can relate to that, but it's not a real reason.

Second, it may improve performance: The guest may expect consecutive reads to be fast; but if the clusters are sprinkled all over the host, consecutive guest reads no longer necessarily translate to consecutive reads on the host (same for writes, of course). Defragmentation would probably fix that, but if you want to rely on this, you'd better use preallocated image files.

Third, it looks better. People expect the file length to be raw indicator of the file size. However, for me this is related to "it makes you feel better", because this also is not a really good reason.

Fourth, using a non-modern file system may let your file size explode because suddenly, file length is actually equal to the file size. But I think, in this case you should just use a better file system.

I don't know whether "cp" copies holes in files; its manpage says it does create sparse images, but I don't know how well it works; but I just assume it works well enough.

Max

Jun Li



reply via email to

[Prev in Thread] Current Thread [Next in Thread]