bug-gzip
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RFC: fixing the 32-bit size and time limits in gzip file format


From: Paul Eggert
Subject: RFC: fixing the 32-bit size and time limits in gzip file format
Date: Mon, 16 Aug 2010 02:25:47 +0200
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.11) Gecko/20100713 Thunderbird/3.0.6

The most often-reported bug for GNU gzip is that gzip -l reports sizes
modulo 2**32, instead of full sizes.  This is because the gzip format
specifies a 4-byte (32-bit) size field.

A similar problem in gzip format is that it supports only nonzero
32-bit time stamps, which limits it to the range from 1970-01-01
00:00:01 through 2106-02-07 06:28:15 UTC.  OK, so this is not as
pressing a bug, but it wouldn't hurt to fix this while we're at it.

I am thinking that we should fix that by putting full sizes and time
stamps into the header, as follows:

* If the file size is 2**32 or larger, gzip should emit an extra field
  that records the size divided by 2**32 (discarding fractions).  gzip
  -l should read this field when reporting the size.

* We want to do this in such a way that is compatible with all the
  other gzip implementations out there, including old versions of GNU
  gzip.  So, we use the already-existing mechanism for extra fields,
  namely FLG.FEXTRA as per RFC 1952.  We use SI1='H', SI2='S' (this is
  short for High-order bits of the Size).  LEN is the length of the
  high-order bits field, and the field's value contains the high-order
  bits, represented as usual in little-endian order.  A missing HS
  field is treated as zero.

* Similarly, we use SI1='H', SI2='M' (High-order Modification time)
  for the high-order bits of the modification time, when a time stamp
  is less than 1 or greater than 2**32 - 1.  There are a few extra
  goodies here, though.  If the leading bit of the high-order time
  field is 1, then the entire time stamp (including the lower order
  bits) is treated as a negative number, using two's complement.
  Also, if the high-order bits are present but are all zero, the time
  stamp is considered to be zero rather than missing.
  
* This approach will allow us to represent sizes up to 2**65568, which
  should be enough for quite some time.  Similarly, representable times
  would range from 2**65567 seconds before 1970 to 2**65567 seconds
  after 1970, which would handle all file-system formats that I know of.

* This approach is backward-compatible with older versions of gzip,
  with any decompressor that conforms to Internet RFC 1952, and with
  all implementations of gzip decompressors that I know of.

* This approach does not address the issue of sub-second time stamp
  resolution, as I thought that would make the proposal too complicated.

Comments are welcome; please CC: to <address@hidden>.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]