[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Bug-tar] Feature suggestion: reordering files by extension improves com
From: |
Jari Aalto |
Subject: |
[Bug-tar] Feature suggestion: reordering files by extension improves compression |
Date: |
13 Dec 2006 11:51:01 +0200 |
Please consider ordering the files by extension inside tar as this
produces better compression ratios according to Paul Sladen.
Jari
http://www.paul.sladen.org/projects/compression/
[...]
Quick and dirty
A simpler ordering method, involves clustering based purely on
filename and extension can be produced with a command similar to:
cat filelist.txt | rev | sort | rev > neworder.txt
This sorting process workings by reversing each line in the file;
hello.text would become txet.olleh allowing files with similar
file extensions or basenames to be ordered adjacently. The
filenames are reversed again producing the file order; this method
appears to work well for language-packs containing translated
strings, resulting in a 14% improvement using bzip2 compression
both before and afterwards, or 2% if using gzip (most files are
larger than the 32kB window size).
I came across a paper (without source code) which discusses
pre-ordering for efficient zdelta encoding as well as the tarfile
ordering: Compressing File Collections with a TSP-Based Approach
(PDF)[1]. For this paper, a relatively simple, greedy method is
chosen, yeilding compression improvements of ~10-15% on webpages
of online news services.
[1]
http://cis.poly.edu/tr/tr-cis-2004-02.pdf
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- [Bug-tar] Feature suggestion: reordering files by extension improves compression,
Jari Aalto <=