bug#30719: Progressively compressing piped input

bug-gzip

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#30719: Progressively compressing piped input

From:	Mark Adler
Subject:	bug#30719: Progressively compressing piped input
Date:	Mon, 5 Mar 2018 14:54:21 -0800

deflate has an inherent latency that accumulates enough data in order to 
efficiently emit each deflate block. You can deliberately flush (with zlib, not 
gzip), but if you do that too frequently, e.g. each line, then you will get 
lousy compression or even expansion.

I wrote something called gzlog 
(https://github.com/madler/zlib/blob/master/examples/gzlog.h 
<https://github.com/madler/zlib/blob/master/examples/gzlog.h>), intended to 
solve this problem. It can take a small amount of input, e.g. a line, and 
update the output gzip file to be complete and valid after each line, yet also 
get good compression in the long run. It does this by writing the lines to the 
log.gz file effectively uncompressed (deflate has a “stored” block type), until 
it has accumulated, say, 1 MB of data. Then it goes back and compresses that 
uncompressed 1 MB, again always leaving the gzip file in a valid state. gzlog 
also maintains something like a journal, which allows gzlog to repair the gzip 
file if the last operation was interrupted, e.g. by a power failure.

> On Mar 5, 2018, at 1:18 PM, Garreau, Alexandre <address@hidden> wrote:
> 
> Hi,
> 
> I have a script which has a logged very repetitive textual output
> (mostly output of ping and date). To minimize disk usage, I thought to
> pipe it to gzip -9. Then I realized the log, contrarily to before,
> remained empty, and recalled the GNU policy of “reading all input and
> only then outputting” to maximize overall speed at the expense of the
> decreasingly expensive memory.
> 
> Yet I want to run that script all the time and being able to dirtily
> killing it or just shutdown, without loosing all its output (nor am I
> sure anyway it is a good practice of keeping everything in ram until
> shutdown, considering I suppose gzip only keeps the compressed output in
> memory anyway, discarding the then useless input), and “tail -f”-ing the
> files it writes.
> 
> I guess piping the whole output is the way to go to achieve optimal
> compression, since otherwise just gzipping each line/command output
> wouldn’t compress as much (since anyway the repetition occurs among the
> lines, not inside them). Yet would there be a way to obtain this maximal
> compression, while having gzip outputing each time I stop giving it
> input (has I do every 30 seconds or so), without having to save the
> uncompressed file, nor recompressing the whole file several times?
> 
> I mean, it seems to me a good thing to wait everything is compressed
> before to output, rather than outputing as soon as possible, but isn’t
> there a way to trigger the output each time it has been processed and
> there’s no more input for a certain amount of time (that is ~30s)?
> 
> Am I looking at something like this:
> #!/bin/bash
> while ping -c1 gnu.org ; do
>    date --rfc-3339=seconds
>    sleep 30
> done | gzip -9 -f | tee sample.log | zcat

[Prev in Thread]

Current Thread

[Next in Thread]

bug#30719: Progressively compressing piped input, Garreau\, Alexandre, 2018/03/05
- bug#30719: Progressively compressing piped input, Mark Adler <=
  - bug#30719: Progressively compressing piped input, Garreau\, Alexandre, 2018/03/06
    - bug#30719: Progressively compressing piped input, Mark Adler, 2018/03/06

Prev by Date: bug#30720: Dead link (translation page) on home page
Next by Date: bug#30720: Dead link (translation page) on home page
Previous by thread: bug#30719: Progressively compressing piped input
Next by thread: bug#30719: Progressively compressing piped input
Index(es):
- Date
- Thread