[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Section "2.10.4 The 'Block Check' field" in your paper: Xz format in
From: |
Wolfgang Liessmann |
Subject: |
Re: Section "2.10.4 The 'Block Check' field" in your paper: Xz format inadequate for long-term archiving |
Date: |
Mon, 3 Apr 2023 03:26:35 +0200 |
Dear Antonio,
Thank you very much for your excellent explanation.
I understand that while in cryptography only integrity (reducing the number of
false negatives) is relevant,
for archiving purposes you intend a balance between integrity and availability
(reducing the number of both false negatives and false positives),
which results in a definition of inaccuracy with a linear increase with the
size of the check sequence,
hence large checksums such as SHA-256 "perform" badly in that sense.
Now I had a closer look at your text and quoted the relevant passages below.
Since lzip (-9) has a better compression ratio than other tools, including
gzip, bzip2, zstd (-19), and xz (-9), I wonder whether its compression
algorithm can be implemented for the ZFS filesystem.
For maximum compression this would be desirable, and it currently isn't
implemented:
https://openzfs.github.io/openzfs-docs/man/7/zfsprops.7.html#compression
Again, thanks for your kind explanation.
I have placed a link to your answer at: https://stackoverflow.com/a/75852528
> "There can be safety tradeoffs with the addition of an error-detection
> scheme. As with almost all fault tolerance mechanisms, there is a tradeoff
> between availability and integrity. That is, techniques that increase
> integrity tend to reduce availability and vice versa. Employing error
> detection by adding a check sequence to a dataword increases integrity, but
> decreases availability. The decrease in availability happens through
> false-positive detections. These failures preclude the use of some data that
> otherwise would not have been rejected had it not been for the addition of
> error-detection coding". ([Koopman], p. 33).
>
> But the tradeoff between availability and integrity is different for data
> transmission than for data archiving. When transmitting data, usually the
> most important consideration is to avoid undetected errors (false negatives
> for corruption), because a retransmission can be requested if an error is
> detected. Archiving, on the other hand, usually implies that if a file is
> reported as corrupt, "retransmission" is not possible. Obtaining another copy
> of the file may be difficult or impossible. Therefore accuracy (freedom from
> mistakes) in the detection of errors becomes the most important consideration.
> There is a good reason why bzip2, gzip, lzip and most other compressed
> formats use a 32-bit check sequence; it provides for an optimal detection of
> errors. Larger check sequences may (or may not) reduce the number of false
> negatives at the cost of always increasing the number of false positives. But
> significantly reducing the number of false negatives may be impossible if the
> number of false negatives is already insignificant, as is the case in bzip2,
> gzip and lzip files. On the other hand, the number of false positives
> increases linearly with the size of the check sequence. CRC64 doubles the
> number of false positives of CRC32, and SHA-256 produces 8 times more false
> positives than CRC32, decreasing the accuracy of the error detection instead
> of increasing it.
>
> Increasing the probability of a false positive for corruption in the
> long-term storage of valuable data is a bad idea. This is why the lzip
> format, designed for long-term archiving, provides 3 factor integrity
> checking and the decompressor reports mismatches in each factor separately.
> This way if just one byte in one factor fails but the other two factors match
> the data, it probably means that the data are intact and the corruption just
> affects the mismatching check sequence. GNU gzip also reports mismatches in
> its 2 factors separately, but does not report the exact values, making it
> more difficult to tell real corruption from a false positive. Bzip2 reports
> separately its 2 levels of CRCs, allowing the detection of some false
> positives.
https://www.nongnu.org/lzip/xz_inadequate.html
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- Re: Section "2.10.4 The 'Block Check' field" in your paper: Xz format inadequate for long-term archiving,
Wolfgang Liessmann <=