[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Lzip-bug] Selection of CRC32 Polynomial for lzip
From: |
Antonio Diaz Diaz |
Subject: |
Re: [Lzip-bug] Selection of CRC32 Polynomial for lzip |
Date: |
Thu, 18 May 2017 01:45:25 +0200 |
User-agent: |
Mozilla/5.0 (X11; U; Linux i586; en-US; rv:1.9.1.19) Gecko/20110420 SeaMonkey/2.0.14 |
Hello Damir,
Damir wrote:
Have you considered choosing a different polynomial for Crc32 calculation
in lzip file format?
Yes, but I found no compelling reason to change.
Some recent CPUs (x86_64 SSE4.2, PowerPC ISA 2.07, ARM v8.1) offer
hardware accelerated calculation of CRC32 with a different polynomial
(crc32c) than used in lzip (ethernet crc32).
Maybe hardware accelerated calculation of ethernet CRC32 also exists.
After all it is the same polynomial used by gzip and zlib.
So, picking crc32c poly instead has two benefits:
1) hardware accelerated integrity checking
Hardware acceleration of CRC calculation makes sense for storage devices
because the data is just moved; there is no time spent in processing it.
Calculating the CRC is the only calculation involved.
But calculating the CRC is just a small part of the total decompression
time. So, even if you accelerate it, the total speed gain is small.
(Probably smaller than 5%). For compression the speed gain is even smaller.
2) better protection against undetected errors
You will need to prove this one.
CRC32C has a slightly larger Hamming distance than ethernet CRC32 for
"small" packet sizes (see pags 3,4 of [1]). But beyond some size perhaps
not much larger than 128 KiB, both have the same HD of 2. For files
larger than that (uncompressed) size, there is little diference between
both CRCs.
[1] http://users.ece.cmu.edu/~koopman/networks/dsn02/dsn02_koopman.pdf
Even more important, we are talking about the interaction between
compression and integrity checking. The difference between a Hamming
distance of 2 or 3 is probably immaterial here. Maybe you would like to
read section 2.10 of [2]. I quote:
"Verification of data integrity in compressed files is different from
other cases (like Ethernet packets) because the data that can become
corrupted are the compressed data, but the data that are verified (the
dataword) are the decompressed data. Decompression can cause error
multiplication; even a single-bit error in the compressed data may
produce any random number of errors in the decompressed data, or even
modify the size of the decompressed data."
[2] http://www.nongnu.org/lzip/xz_inadequate.html
The downside is the compatibility problem, but changing version byte in
file header can help with that.
This is a very large downside, most probably to gain almost nothing.
IMO, one of the big problems of today's software development is that too
many people are willing to complicate the code without the slightest
proof that the proposed change is indeed an improvement.
Best regards,
Antonio.