|
From: | Phillip Susi |
Subject: | Re: [Bug-tar] High per file overhead? |
Date: | Sat, 25 Feb 2006 12:26:58 -0500 |
User-agent: | Mail/News 1.5 (X11/20060213) |
Joerg Schilling wrote:
Phillip Susi <address@hidden> wrote:Can anyone explain this? ~$: du -bsh Maildir/ 98M Maildir/ ~$: tar cf Maildir.tar Maildir/ ~$: du -bsh Maildir.tar 112M Maildir.tar ~$: find Maildir | cpio -o -H newc > Maildir.cpio 204433 blocks ~$: du -bsh Maildir.cpio 100M Maildir.cpioWhy does tar have 12M more overhead than cpio? This Maildir is the lkml since Jan 1, so it contains ~20,000 messages/files, but ~734 bytes per file seems like a bit much for overhead.As cpio does not offer a -H newc format, let me asume that you are talking about the -c or -H crc format...
Yes, it does have a newc format, see the info page. It is also the format used by the linux kernel for initramfs images.
cpio is unblocked and thus has problems to resync after a part of the archivethat appears to be corrupted. du only counts the file contend and a part of the meta data (not counting e.g.the "inode" - see: /usr/include/sys/fs/ufs_inode.h)
Right, but the timestamps, owner, and mode only take up a handful of bytes, which cpio also stores.
cpio -Hcrc writes 110 Bytes header + the file path name + the file content.tar in the historical format or POSIX.1-1988 writes 512 bytes header + the file content rounded up to the next 512 byte boundary. recent tar (POSIX.1-2001 aka. "pax") writes at least 1 KB per file in addition.
I see. And the purpose for this is to try and recover from bad sectors since a file will always start on a sector boundary, so only the file contained in the bad sector will be lost?
Conclusion: if you write more metadata, you have more overhead.But in real world use this has no relevence: star -cPM -time f=/dev/null -C /usr . star: 107825 blocks + 6656 bytes (total of 1104134656 bytes = 1078256.50k). star: Total time 136.532sec (7897 kBytes/sec) star -cPM -Hasc -time f=/dev/null -C /usr . star: 104818 blocks + 2560 bytes (total of 1073338880 bytes = 1048182.50k). star: Total time 134.415sec (7798 kBytes/sec) The additional overhead that reasults from the tar format is typically less than 3%. If you compress the result and use an archiver that takes care about best compressibilty (as star does), even the small "advantage" of the cpio format will go away. If you compress the result, the remaining difference is less than 1%.
I'd say archiving my Maildir is a rather real world use, so this is somewhat relevant. I did notice though, that once compressed, the difference in size is greatly diminished.
Jörg
[Prev in Thread] | Current Thread | [Next in Thread] |