[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Lzip-bug] Tarball indexing and plzip
From: |
Dennis Katsonis |
Subject: |
Re: [Lzip-bug] Tarball indexing and plzip |
Date: |
Mon, 11 Mar 2019 11:25:08 +1100 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.4.0 |
Hi Antonio,
On 3/11/19 2:53 AM, Antonio Diaz Diaz wrote:
> Hello Dennis,
>
> Dennis Katsonis wrote:
>> I was wondering whether it would be difficult or not, to add
>> functionality to plzip, or create a variant of it, which had tarball
>> indexing capabilities like pixz.
>
> I am in the process of implementing something like that, and more, but
> in tarlz, not in plzip: http://www.nongnu.org/lzip/tarlz.html
> \
>
Thats great to hear. I was thinking of doing something like that
myself, as it seemed an obvious improvement to the lzip tools. I'm not
sure how often people do only partial extractions, but I'm guessing its
often enough to justify adding this capacity.
>> Pixz allows a more random access to the compressed tarball. Listing is
>> very quick, and even extracting a file at the end of a large tarball is
>> quite fast, not too much slower than extracting it from an uncompressed,
>> indexed tarball. A major advantage when extracting select files from an
>> archived compressed tarball.
>
> Tarlz is not complete yet, but it can already list pretty quick if the
> archive is created with the right options[1]. Parallel extraction should
> be similarly quick once it is implemented.
>
> http://www.nongnu.org/lzip/manual/tarlz_manual.html#Multi_002dthreaded-tar
>
> If the files in the archive are large, multi-threaded '--list' on a
> regular (seekable) tar.lz archive can be hundreds of times faster than
> sequential '--list' because, in addition to using several processors, it
> only needs to decompress part of each lzip member. See the following
> example listing the Silesia corpus on a dual core machine:
>
> tarlz -9 --no-solid -cf silesia.tar.lz silesia
> time lzip -cd silesia.tar.lz | tar -tf - (5.032s)
> time plzip -cd silesia.tar.lz | tar -tf - (3.256s)
> time tarlz -tf silesia.tar.lz (0.020s)
>
>
The difference is when decompressing a single file from near the end of
the archive. These tests were done on a ramdisk, to eliminate disk
activity as a factor. With indexing, you would only bother with the
compressed members that contained relevant data.
$ time pixz -x s/dickens < s.tpxz | tar x
pixz -x s/dickens < s.tpxz 0.44s user 0.02s system 103% cpu 0.436 total
$ time tarlz -xf silesia.tar.lz s/dickens
tarlz -xf silesia.tar.lz s/dickens 5.32s user 0.03s system 109% cpu
4.878 total
> I expect that tarlz, or something based on the same principles, will
> obsolete conventionally compressed tar archives.
>
The difficulty would be supporting the different tar formats (ustar,
xustar, exustar, etc). Perhaps indexing could be abstracted in a
particular way, so that anyone using the library could use this for
their own compressed file formats?
>
> Best regards,
> Antonio.
>