Re: Section "2.10.4 The 'Block Check' field" in your paper: Xz format in

lzip-bug

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Section "2.10.4 The 'Block Check' field" in your paper: Xz format in

From:	Antonio Diaz Diaz
Subject:	Re: Section "2.10.4 The 'Block Check' field" in your paper: Xz format inadequate for long-term archiving
Date:	Tue, 28 Mar 2023 23:04:38 +0200
User-agent:	Mozilla/5.0 (X11; U; Linux i586; en-US; rv:1.9.1.19) Gecko/20110420 SeaMonkey/2.0.14

Dear Wolfgang,

Thank you for your message. I'll try to clarify things and maybe improve thewording of Section 2.10.4.


Wolfgang Liessmann wrote:

in section "2.10.4 The 'Block Check' field" you present a formula
Inaccuracy = (compressed_size * Pudc + CS_size) / (compressed_size + CS_size)
which is not explained (derived in a way so one sees its necessity).


That formula calculates inaccuracy by plain counting of cases. Let's see:

Inaccuracy is defined in the Glossary(http://www.nongnu.org/lzip/xz_inadequate.html#glossary) as:

Inaccuracy: Ratio of error detection mistakes. Is defined as(false_negatives + false_positives) / total_cases.

For the single-byte error model (one corrupt byte per test), the inaccuracyfor each compressed size and check sequence (CS) size is calculated by thefollowing formula (all sizes in bytes):


Inaccuracy = (compressed_size * Pudc + CS_size) / (compressed_size + CS_size)

Where

'compressed_size * Pudc' are the false negatives (cases where the checksequence does not detect the corruption),'CS_size' is the size of the check sequence, in bytes (cases where thecorrupt byte belongs to the check sequence, not to the compressed payload), and'compressed_size + CS_size' is the total size of the file (total number ofcases).

In particular, the statement
  It should be noted that SHA-256 provides worse accuracy than CRC64 for
  all possible block sizes.
and the inaccuracy 3e-08 at 1e+09 bytes (= 1 GB) compressed size seems to
contradict simple cryptographic standards at the first glance.

This may be because "cryptographic standards" do not take into accountmismatches caused by the corruption of the hash value itself.

If we take Pudc (the probability of a single-byte error being undetected bythe check sequence) as 2^-n (See [Koopman], p. 5), where 'n' is the size ofthe check sequence in bits, then the inaccuracy for the 3 types of checksequences considered can be calculated for a compressed size 'x' as:


( x * 2^-32  +  4 ) / ( x +  4 ) for CRC32
( x * 2^-64  +  8 ) / ( x +  8 ) for CRC64
( x * 2^-256 + 32 ) / ( x + 32 ) for SHA-256

The largest block size that a xz file can contain is 2^63 bytes. So theinaccuracies for the largest size are:


( 2^63 * 2^-32  +  4 ) / ( 2^63 +  4 ) ~= 2.3e-10 for CRC32
( 2^63 * 2^-64  +  8 ) / ( 2^63 +  8 ) ~= 9.2e-19 for CRC64
( 2^63 * 2^-256 + 32 ) / ( 2^63 + 32 ) ~= 3.5e-18 for SHA-256

When we apply the formula '(false_negatives + false_positives) /total_cases' to a compressed size of 1 GB we get the following inaccuracies:


( 1e9 * 2^-32  +  4 ) / ( 1e9 +  4 ) ~= 4e-9   for CRC32
( 1e9 * 2^-64  +  8 ) / ( 1e9 +  8 ) ~= 8e-9   for CRC64
( 1e9 * 2^-256 + 32 ) / ( 1e9 + 32 ) ~= 3.2e-8 for SHA-256

Which matches figure 3 in the article.

The rule of thumb (the mathematical approximation) is that the security,
measured in bits of security, is half of the hash length (if there is no
attack on the algorithm, and there is no known effective attack on
SHA-256 as of now), which means that the probability of a collision of
two SHA-256 hashes is less than

With respect to the check sequence, the inaccuracy described in the articlehas nothing to do with cryptography, collisions, security, or attacks. Itjust relates to the size of the check sequence; the larger the checksequence, the more probable that corruption affects it (causing a falsepositive). Just to illustrate: if you use a check sequence as large as thecompressed payload, half of the bit flips occurring in the resulting filewill affect the payload and the other half will affect the check sequence(on average). It is immaterial whether the check sequence is a CRC or acryptographic hash.

As a SHA-256 hash is 8 times larger than a 32-bit CRC, it is 8 times moreprobable that a random bit corruption affects it, making the decompressorbelieve that the payload is corrupt because it does not match the checksequence. Similarly, it is 4 times more probable that a random bitcorruption affects a SHA-256 than a CRC64. Choosing a check sequence largerthan needed is just a bad idea.

While at the first glance lzip indeed seems to have a better design than
xz, and the compression rate is better comparing the maximum compression
levels, it is not clear to me why a secure cryptographic hash function
should perform that badly.


This is explained in [Koopman], p. 33:

"there can be safety tradeoffs with the addition of an error-detectionscheme. As with almost all fault tolerance mechanisms, there is a tradeoffbetween availability and integrity. That is, techniques that increaseintegrity tend to reduce availability and vice versa. Employing errordetection by adding a check sequence to a dataword increases integrity, butdecreases availability. The decrease in availability happens throughfalse-positive detections. These failures preclude the use of some data thatotherwise would not have been rejected had it not been for the addition oferror-detection coding.

False-positive detections can occur in two primary ways. The first occursbecause the check sequence adds more bits to the dataword and each bit thatis added is another bit that could fail. For checksums and CRCs, a failurein the check sequence bits is indistinguishable from failures in thedataword. Thus, a failure in the check sequence will cause the data to beflagged as failed, which takes place because of the addition of the checksequence. Thus, the addition of the check sequence bits can cause the lackof availability of the dataword to which it is attached."


Best regards,
Antonio.

[Prev in Thread]

Current Thread

[Next in Thread]

Section "2.10.4 The 'Block Check' field" in your paper: Xz format inadequate for long-term archiving, Wolfgang Liessmann, 2023/03/28
- Re: Section "2.10.4 The 'Block Check' field" in your paper: Xz format inadequate for long-term archiving, Antonio Diaz Diaz <=

Prev by Date: Section "2.10.4 The 'Block Check' field" in your paper: Xz format inadequate for long-term archiving
Previous by thread: Section "2.10.4 The 'Block Check' field" in your paper: Xz format inadequate for long-term archiving
Index(es):
- Date
- Thread