[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Section "2.10.4 The 'Block Check' field" in your paper: Xz format in
From: |
Antonio Diaz Diaz |
Subject: |
Re: Section "2.10.4 The 'Block Check' field" in your paper: Xz format inadequate for long-term archiving |
Date: |
Tue, 28 Mar 2023 23:04:38 +0200 |
User-agent: |
Mozilla/5.0 (X11; U; Linux i586; en-US; rv:1.9.1.19) Gecko/20110420 SeaMonkey/2.0.14 |
Dear Wolfgang,
Thank you for your message. I'll try to clarify things and maybe improve the
wording of Section 2.10.4.
Wolfgang Liessmann wrote:
in section "2.10.4 The 'Block Check' field" you present a formula
Inaccuracy = (compressed_size * Pudc + CS_size) / (compressed_size + CS_size)
which is not explained (derived in a way so one sees its necessity).
That formula calculates inaccuracy by plain counting of cases. Let's see:
Inaccuracy is defined in the Glossary
(http://www.nongnu.org/lzip/xz_inadequate.html#glossary) as:
Inaccuracy: Ratio of error detection mistakes. Is defined as
(false_negatives + false_positives) / total_cases.
For the single-byte error model (one corrupt byte per test), the inaccuracy
for each compressed size and check sequence (CS) size is calculated by the
following formula (all sizes in bytes):
Inaccuracy = (compressed_size * Pudc + CS_size) / (compressed_size + CS_size)
Where
'compressed_size * Pudc' are the false negatives (cases where the check
sequence does not detect the corruption),
'CS_size' is the size of the check sequence, in bytes (cases where the
corrupt byte belongs to the check sequence, not to the compressed payload), and
'compressed_size + CS_size' is the total size of the file (total number of
cases).
In particular, the statement
It should be noted that SHA-256 provides worse accuracy than CRC64 for
all possible block sizes.
and the inaccuracy 3e-08 at 1e+09 bytes (= 1 GB) compressed size seems to
contradict simple cryptographic standards at the first glance.
This may be because "cryptographic standards" do not take into account
mismatches caused by the corruption of the hash value itself.
If we take Pudc (the probability of a single-byte error being undetected by
the check sequence) as 2^-n (See [Koopman], p. 5), where 'n' is the size of
the check sequence in bits, then the inaccuracy for the 3 types of check
sequences considered can be calculated for a compressed size 'x' as:
( x * 2^-32 + 4 ) / ( x + 4 ) for CRC32
( x * 2^-64 + 8 ) / ( x + 8 ) for CRC64
( x * 2^-256 + 32 ) / ( x + 32 ) for SHA-256
The largest block size that a xz file can contain is 2^63 bytes. So the
inaccuracies for the largest size are:
( 2^63 * 2^-32 + 4 ) / ( 2^63 + 4 ) ~= 2.3e-10 for CRC32
( 2^63 * 2^-64 + 8 ) / ( 2^63 + 8 ) ~= 9.2e-19 for CRC64
( 2^63 * 2^-256 + 32 ) / ( 2^63 + 32 ) ~= 3.5e-18 for SHA-256
When we apply the formula '(false_negatives + false_positives) /
total_cases' to a compressed size of 1 GB we get the following inaccuracies:
( 1e9 * 2^-32 + 4 ) / ( 1e9 + 4 ) ~= 4e-9 for CRC32
( 1e9 * 2^-64 + 8 ) / ( 1e9 + 8 ) ~= 8e-9 for CRC64
( 1e9 * 2^-256 + 32 ) / ( 1e9 + 32 ) ~= 3.2e-8 for SHA-256
Which matches figure 3 in the article.
The rule of thumb (the mathematical approximation) is that the security,
measured in bits of security, is half of the hash length (if there is no
attack on the algorithm, and there is no known effective attack on
SHA-256 as of now), which means that the probability of a collision of
two SHA-256 hashes is less than
With respect to the check sequence, the inaccuracy described in the article
has nothing to do with cryptography, collisions, security, or attacks. It
just relates to the size of the check sequence; the larger the check
sequence, the more probable that corruption affects it (causing a false
positive). Just to illustrate: if you use a check sequence as large as the
compressed payload, half of the bit flips occurring in the resulting file
will affect the payload and the other half will affect the check sequence
(on average). It is immaterial whether the check sequence is a CRC or a
cryptographic hash.
As a SHA-256 hash is 8 times larger than a 32-bit CRC, it is 8 times more
probable that a random bit corruption affects it, making the decompressor
believe that the payload is corrupt because it does not match the check
sequence. Similarly, it is 4 times more probable that a random bit
corruption affects a SHA-256 than a CRC64. Choosing a check sequence larger
than needed is just a bad idea.
While at the first glance lzip indeed seems to have a better design than
xz, and the compression rate is better comparing the maximum compression
levels, it is not clear to me why a secure cryptographic hash function
should perform that badly.
This is explained in [Koopman], p. 33:
"there can be safety tradeoffs with the addition of an error-detection
scheme. As with almost all fault tolerance mechanisms, there is a tradeoff
between availability and integrity. That is, techniques that increase
integrity tend to reduce availability and vice versa. Employing error
detection by adding a check sequence to a dataword increases integrity, but
decreases availability. The decrease in availability happens through
false-positive detections. These failures preclude the use of some data that
otherwise would not have been rejected had it not been for the addition of
error-detection coding.
False-positive detections can occur in two primary ways. The first occurs
because the check sequence adds more bits to the dataword and each bit that
is added is another bit that could fail. For checksums and CRCs, a failure
in the check sequence bits is indistinguishable from failures in the
dataword. Thus, a failure in the check sequence will cause the data to be
flagged as failed, which takes place because of the addition of the check
sequence. Thus, the addition of the check sequence bits can cause the lack
of availability of the dataword to which it is attached."
Best regards,
Antonio.