bug-tar
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Bug-tar] Speeding up GNU tar file I/O ?


From: Kaveh R. Ghazi
Subject: [Bug-tar] Speeding up GNU tar file I/O ?
Date: Fri, 20 Jan 2006 13:39:36 -0500 (EST)

Hi - I'd like to make a suggestion for GNU tar.

I'm using tar rather often to pack and unpack large directories
containing snapshot of code I distribute to various computers for
testing.  On certain platforms with slow filesystems (e.g. solaris2.7)
it takes a really long time relative to other platforms (say
linux-gnu).  So one day I decided to run the solaris system call
tracing utility on GNU tar (truss -p <pid>) to see what was taking so
long and saw this:

open64("snapshot-SVN20060120/libstdc++-v3/include/bits/ostream.tcc", 
O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
...
utime("snapshot-SVN20060120/libstdc++-v3/include/bits/ostream.tcc", 0xFFBFF620) 
= 0
open64("snapshot-SVN20060120/libstdc++-v3/include/bits/boost_concept_check.h", 
O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
...
utime("snapshot-SVN20060120/libstdc++-v3/include/bits/boost_concept_check.h", 
0xFFBFF620) = 0
open64("snapshot-SVN20060120/libstdc++-v3/include/bits/sstream.tcc", 
O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
...
utime("snapshot-SVN20060120/libstdc++-v3/include/bits/sstream.tcc", 0xFFBFF620) 
= 0
etc

(In the above, the "..." represents read/write/close calls.)  Notice
the system calls with really/long/path/names/being/repeated.  My
understanding is that this can slow down I/O dramatically on some
systems because for each such call the kernel has to stat each
directory component of the path going through the directory hierarchy.
Yes, modern kernels cache such lookups, but still it's better not to
have to do it in the first place.  And some of us have to use older
OSes which plotz.

Another problem this coding style causes is that it imposes a limit on
the directory depth of your tar archives.  I.e. MAXPATHLEN may be
different on the system where you created and unpacked your tar
archive leading to tar failing unnecessarily.

Many GNU utilities (e.g. GNU find, rm, mkdir -p) avoid this by calling
chdir("directoryname"), do I/O calls on relative filenames like "foo",
then calling chdir("..") when finished with that dir.  For example in
the GNU coreutils testsuite, the GNU rm utility is exercised by
creating a directory 400 levels deep and then removing it.

Perhaps tar archives are different in that there's no requirement for
files in the same directory to appear sequentially.  But you can do a
string check on the dirname of the previous and next files to see if
they match.  If they don't match then you have to chdir.  Otherwise
you can do file I/O on the basename of the next file.  I suspect this
would be an improvement in the vast majority of cases.

I looked in tar-1.15.1 which I believe is the latest release but it
seems to still exhibit the long pathname behavior I outline above.
Could the chdir machanism I described above be considered instead?

                Thanks,
                --Kaveh
--
Kaveh R. Ghazi                  address@hidden




reply via email to

[Prev in Thread] Current Thread [Next in Thread]