[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-block] [Qemu-devel] [PATCH 2/2] qcow2: Avoid memory over-alloc
From: |
John Snow |
Subject: |
Re: [Qemu-block] [Qemu-devel] [PATCH 2/2] qcow2: Avoid memory over-allocation on compressed images |
Date: |
Wed, 21 Feb 2018 13:32:34 -0500 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.6.0 |
On 02/21/2018 10:59 AM, Eric Blake wrote:
> On 02/21/2018 09:00 AM, Eric Blake wrote:
>> On 02/21/2018 04:04 AM, Alberto Garcia wrote:
>>> On Tue 20 Feb 2018 11:24:59 PM CET, Eric Blake wrote:
>>>
>>> I was also preparing a patch to change this, but you arrived first :-)
>>>
>>>> So, it's time to cut back on the waste. A compressed cluster
>>>> will NEVER occupy more than an uncompressed cluster
>
>
>> And here, csize can only get smaller than the length picked by
>> nb_csectors. Therefore, csize is GUARANTEED to be <= c->sector_size.
>>
>>>
>>> - Solution a: check that csize < s->cluster_size and return an error if
>>> it's not. However! although QEMU won't produce an image with a
>>> compressed cluster that is larger than the uncompressed one, the
>>> qcow2
>>> on-disk format in principle allows for that, so arguably we should
>>> accept it.
>>
>> No, the qcow2 on-disk format does not have enough bits reserved for
>> that. It is impossible to store an inflated cluster, because you
>> don't have enough bits to record it.
>
> Okay, the spec is WRONG, compared to our code base.
>
>>
>> That said, we MAY have a bug, more likely to be visible in 512-byte
>> clusters but possible on any size. While the 'number sectors' field
>> IS sufficient for any compressed cluster starting at offset 0 in the
>> cluster, we run into issues if the starting offset is later in the
>> cluster. That is, even though the compressed length of a 512-byte
>> cluster is <= 512 (or we wouldn't compress it), if we start at offset
>> 510, we NEED to read the next cluster to get the rest of the
>> compressed stream - but at 512-byte clusters, there are 0 bits
>> reserved for 'number sectors'. Per the above calculations with the
>> example offset of 510, nb_csectors would be 1 (it can't be anything
>> else for a 512-byte cluster image), and csize would then be 2 bytes,
>> which is insufficient for reading back enough data to reconstruct the
>> cluster.
>
> In fact, here's a demonstration of a discrepancy, where qemu-img and
> John's qcheck tool [1] disagree about the validity of an image created
> by qemu (although it may just be that qcheck hasn't yet learned about
> compressed clusters):
>
> [1]https://github.com/jnsnow/qcheck
>
I wouldn't trust my tool's ability to understand compressed clusters :)
I didn't get very far, though I did run across the fact that there
appeared to be a discrepancy between the spec and the actual
implementation, IIRC.
It looked like you came to the same conclusion when you stepped through
it manually.
> $ f=12345678
> $ f=$f$f$f$f # 32
> $ f=$f$f$f$f # 128
> $ f=$f$f$f$f # 512
> $ f=$f$f$f$f # 2k
> $ f=$f$f$f$f # 8k
> $ f=$f$f$f$f # 32k
> $ f=$f$f$f$f # 128k
> $ printf "$f" > data
> $ qemu-img convert -c -f raw -O qcow2 -o cluster_size=512 \
> data data.qcow2
> $ qemu-img check data.qcow2
> No errors were found on the image.
> 256/256 = 100.00% allocated, 100.00% fragmented, 100.00% compressed
> clusters
> Image end offset: 18944
> $ ./qcheck data.qcow2
> ...
> == L2 Tables ==
>
> == L2 cluster l1[0] : 0x0000000000000800 ==
> Corrupt entries: 63; Non-zero entries: 64; Corrupt:Non-zero ratio: 0.984375
> == L2 cluster l1[1] : 0x0000000000000e00 ==
> Corrupt entries: 63; Non-zero entries: 64; Corrupt:Non-zero ratio: 0.984375
> == L2 cluster l1[2] : 0x0000000000001400 ==
> Corrupt entries: 63; Non-zero entries: 64; Corrupt:Non-zero ratio: 0.984375
> == L2 cluster l1[3] : 0x0000000000001a00 ==
> Corrupt entries: 63; Non-zero entries: 64; Corrupt:Non-zero ratio: 0.984375
> L2 tables: Could not complete analysis, 257 problems found
>
>
> == Reference Count Analysis ==
>
> Refcount analysis: 00 vacant clusters
> Refcount analysis: 04 leaked clusters
> Refcount analysis: 00 ghost clusters
> Refcount analysis: 04 miscounted clusters
> Refcount analysis: 8 problems found
>
>
> == Cluster Counts ==
>
> Metadata: 0x1000
> Data: 0x800
> Leaked: 0x800
> Vacant: 0x0
> total: 0x2000
> qcheck: 73 problems found
>
>
>>
>> Not true. It is (cluster_bits - 9) or (cluster_size / 512).
>> Remember, x = 62 - (cluster_bits - 8); for a 512-byte cluster, x =
>> 61. The 'number sectors' field is then bits x+1 - 61 (but you can't
>> have a bitfield occupying bit 62 upto 61; especially since bit 62 is
>> the bit for compressed cluster).
>
> So instead of blindly reading the spec, I'm now going to single-stepping
> through the 'qemu-img convert' command line above, to see what REALLY
> happens:
>
> Line numbers from commit a6e0344fa0
> $ gdb --args ./qemu-img convert -c -f raw -O qcow2 -o cluster_size=512
> data data.qcow2
> ...
> (gdb) b qcow2_alloc_bytes
> Breakpoint 1 at 0x57610: file block/qcow2-refcount.c, line 1052.
> (gdb) r
> Thread 1 "qemu-img" hit Breakpoint 1, qcow2_alloc_bytes (
> address@hidden, address@hidden)
> at block/qcow2-refcount.c:1052
> 1052 {
> (gdb)
>
> So we are compressing 512 bytes down to 15 every time, which means that
> after 34 clusters compressed, we should be at offset 510. Let's resume
> debugging:
>
> (gdb) c 34
> Will ignore next 33 crossings of breakpoint 1. Continuing.
> [Thread 0x7fffe3cfe700 (LWP 32229) exited]
> [New Thread 0x7fffe3cfe700 (LWP 32300)]
> [New Thread 0x7fffe25ed700 (LWP 32301)]
>
> Thread 1 "qemu-img" hit Breakpoint 1, qcow2_alloc_bytes (
> address@hidden, address@hidden)
> at block/qcow2-refcount.c:1052
> 1052 {
> (gdb) n
> 1053 BDRVQcow2State *s = bs->opaque;
> (gdb)
> 1058 BLKDBG_EVENT(bs->file, BLKDBG_CLUSTER_ALLOC_BYTES);
> (gdb)
> 1059 assert(size > 0 && size <= s->cluster_size);
> (gdb) p s->free_byte_offset
> $2 = 3070
> (gdb) p 3070%512
> $3 = 510
> ...
> (gdb)
> 1076 free_in_cluster = s->cluster_size - offset_into_cluster(s,
> offset);
> (gdb)
> 1078 if (!offset || free_in_cluster < size) {
> (gdb) p free_in_cluster
> $4 = 2
> 1079 int64_t new_cluster = alloc_clusters_noref(bs,
> s->cluster_size);
> (gdb)
> 1080 if (new_cluster < 0) {
> (gdb)
> 1084 if (new_cluster == 0) {
> (gdb)
> 1091 if (!offset || ROUND_UP(offset, s->cluster_size) !=
> new_cluster) {
> (gdb)
> 1095 free_in_cluster += s->cluster_size;
> (gdb)
> 1099 assert(offset);
>
> so we got a contiguous cluster, and our goal is to let the caller bleed
> the compressed cluster into to the tail of the current sector and into
> the head of the next cluster. Continuing:
>
> (gdb) fin
> Run till exit from #0 qcow2_alloc_bytes (address@hidden,
> address@hidden) at block/qcow2-refcount.c:1118
> [Thread 0x7fffe25ed700 (LWP 32301) exited]
> [Thread 0x7fffe3cfe700 (LWP 32300) exited]
> qcow2_alloc_compressed_cluster_offset (address@hidden,
> address@hidden, address@hidden)
> at block/qcow2-cluster.c:768
> 768 if (cluster_offset < 0) {
> Value returned is $5 = 3070
>
> (gdb) n
> 773 nb_csectors = ((cluster_offset + compressed_size - 1) >> 9) -
> (gdb)
> 774 (cluster_offset >> 9);
> (gdb)
> 773 nb_csectors = ((cluster_offset + compressed_size - 1) >> 9) -
> (gdb)
> 777 ((uint64_t)nb_csectors << s->csize_shift);
> (gdb) l
> 772
> 773 nb_csectors = ((cluster_offset + compressed_size - 1) >> 9) -
> 774 (cluster_offset >> 9);
> 775
> 776 cluster_offset |= QCOW_OFLAG_COMPRESSED |
> 777 ((uint64_t)nb_csectors << s->csize_shift);
> 778
> 779 /* update L2 table */
> 780
> 781 /* compressed clusters never have the copied flag */
> (gdb) p nb_csectors
> $6 = 1
> (gdb) p s->csize_shift
> $7 = 61
> (gdb) p/x cluster_offset
> $8 = 0xbfe
> (gdb) n
> 776 cluster_offset |= QCOW_OFLAG_COMPRESSED |
> (gdb)
> 783 BLKDBG_EVENT(bs->file, BLKDBG_L2_UPDATE_COMPRESSED);
> (gdb) p/x cluster_offset
> $9 = 0x6000000000000bfe
>
> Where is s->csize_shift initialized? In qcow2_do_open():
>
> s->csize_shift = (62 - (s->cluster_bits - 8));
> s->csize_mask = (1 << (s->cluster_bits - 8)) - 1;
> s->cluster_offset_mask = (1LL << s->csize_shift) - 1;
>
> Revisiting the wording in the spec:
>
> Compressed Clusters Descriptor (x = 62 - (cluster_bits - 8)):
>
> Bit 0 - x: Host cluster offset. This is usually _not_ aligned to a
> cluster boundary!
>
> x+1 - 61: Compressed size of the images in sectors of 512 bytes
>
> which says bits 0-61 are the host cluster offset, and 62-61 is the
> number of sectors. But our code sets s->csize_shift to divide this
> differently, at 0-60 and 61-61. Which means your earlier claim that
> there are enough 'number sector' bits to allow for up to 2*cluster_size
> as the size of the compressed stream (rather than my claim of exactly
> cluster_size) is right, and other implementations CAN inflate a cluster
> (if we don't document otherwise), and that even if they DON'T inflate,
> they can STILL cause a read larger than a cluster size when the offset
> is near the tail of one sector (most likely at 512-byte clusters, but
> remotely possible at other cluster sizes as well).
>
--
—js