[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Lzip-bug] Optionally exit with nonzero status if trailing garbage
From: |
Antonio Diaz Diaz |
Subject: |
[Lzip-bug] Optionally exit with nonzero status if trailing garbage |
Date: |
Tue, 04 Aug 2015 20:03:46 +0200 |
User-agent: |
Mozilla/5.0 (X11; U; Linux i586; en-US; rv:1.8.1.4) Gecko/20070601 SeaMonkey/1.1.2 |
Jakub Wilk wrote:
"Lzip will correctly decompress a file which is the concatenation
of two or more compressed files. The result is the concatenation of
the corresponding uncompressed files. Integrity testing of
concatenated compressed files is also supported."
Whatever follows a file that is not a valid header is classified as
"trailing garbage" and ignored.
Sounds like a serious design flaw that could lead to data loss.
You are about right. IMHO, this is a (not so) serious design flaw of
gzip, improved somewhat by bzip2 and lzip, worsened by xz, but never
properly addressed. Except in the case of xz (see below), this "flaw" is
not related to any format, but just to what should the decompressor do
in a situation that may involve a corrupt header or just trailing garbage.
The attached files differ only by one bit. The output for the
corrupted file is truncated, yet there is no error or warning:
Just use "lzip -vvvv" to see the warning:
When decompressing or testing, further -v's (up to 4) increase the
verbosity level, showing status, compression ratio, dictionary
size, trailer contents (CRC, data size, member size), and up to 6
bytes of trailing garbage (if any).
BTW, lzip is the only one that shows the "trailing garbage", allowing
you to determine if it is really garbage or not. In this case the
"garbage" is awfully similar to a lzip signature (4C 5A 49 50):
$ lzip -tvvvv corrupted.lz
corrupted.lz: dictionary size 4 KiB. 0.100:1, 80.000 bits/byte,
-900.00% saved. data CRC 7E3265A8, data size 4, member size
40. ok
corrupted.lz: first bytes of trailing garbage found = 4D 5A 49 50 01 0C
I see that the bit-flip in corrupted.lz affects one of the magic bytes
in the second member of the file.
The probability of corruption happening in the magic bytes of the second
or successive members/streams is (except in the case of xz) about 4
times smaller than the probability of getting a false positive caused by
the corruption of the integrity information itself. It can be considered
to be under the noise level. This along with the fact that human
judgement is needed to tell garbage from a corrupt header are probably
the causes why AFAIK nobody has never cared about it so much as to write
a feature request in bug-gzip or lzip-bug.
> Xz has broken with this tradition
Glad to hear that.
Don't be so glad about xz breaking the tradition. Xz did it because its
probability of truncating the output is the highest of all, both because
of its longer magic string and because of possible corruption in stream
padding. (The stream padding of xz is optional, but its size has no limit).
bzip2/gzip/lzip
+========+========+========+
| member | member | member |
+========+========+========+
xz
+========+=========+========+=========+========+=========+
| stream | padding | stream | padding | stream | padding |
+========+=========+========+=========+========+=========+
Bzip2 and lzip behave optimally in the most frequent case of files with
just one member/stream, where trailing garbage can't make the decoder
produce incorrect output, and there is no risk in ignoring it by
default. (This is, I think, the case of Debian packages). In the four
examples below tar extracted the files correctly, but only bzip2 and
lzip returned with 0 status (the string "garbage" was appended to each
tarball):
$ tar -xf garbage_added.tar.bz2 ; echo $?
bzip2: (stdin): trailing garbage after EOF ignored
0
$ tar -xf garbage_added.tar.gz ; echo $?
gzip: stdin: decompression OK, trailing garbage ignored
tar: Child returned status 2
tar: Error is not recoverable: exiting now
2
$ tar -xf garbage_added.tar.lz ; echo $?
0
$ tar -xf garbage_added.tar.xz ; echo $?
xz: (stdin): Unexpected end of input
tar: Child returned status 1
tar: Error is not recoverable: exiting now
2
Note the contradictory messages in the gzip example: "decompression OK"
vs "Error is not recoverable". Xz missed the point entirely.
For more advanced (but less frequent) uses like multimember or
concatenated files I propose the following change:
1) Ignore trailing garbage by default, as bzip2 and lzip do now.
2) Add an option (say --trailing-error) that forces the decompressor to
exit with nonzero status if any remaining input is detected after the
last member.
The proposed option would catch the improbable case of corruption in the
magic bytes of the second or successive members, but there is nothing
the decompressor can do to catch the similarly improbable case of file
truncation just after the last byte of a member/stream.
I suggest any replies to this message to be made in lzip-bug. I guess
discussing the behaviour of decompressors in corner cases like this is
off-topic in debian-devel.
Best regards,
Antonio.
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- [Lzip-bug] Optionally exit with nonzero status if trailing garbage,
Antonio Diaz Diaz <=