|
From: | Timothe Litt |
Subject: | tar is creating corrupt archives when soft links are present |
Date: | Thu, 1 Dec 2022 09:25:01 -0500 |
User-agent: | Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.5.0 |
These are serious problems - I've just spent 4 daze
reconstructing disks from faulty backups exhibiting these issues.
I was lucky that it was ONLY 4 days; there were >30,000 files
affected. I currently have NO reliable backups.
I have come up with a fairly small reproducer. Key
observations:
Here is the subject data; /bin on a fairly old system. I'm
showing just the links to keep this small. Note: NO hard links.
# ls -l /bin | grep '^[lh]'
lrwxrwxrwx 1 root root 4 Nov 28 08:45 awk ->
gawk
lrwxrwxrwx 1 root root 21 Nov 28 08:45 bash
-> ../usr/local/bin/bash
lrwxrwxrwx 1 root root 4 Nov 28 08:45 csh ->
tcsh
lrwxrwxrwx 1 root root 8 Nov 28 08:45
dnsdomainname -> hostname
lrwxrwxrwx 1 root root 8 Nov 28 08:45
domainname -> hostname
lrwxrwxrwx 1 root root 4 Nov 28 08:45 egrep
-> grep
lrwxrwxrwx 1 root root 2 Nov 28 08:45 ex ->
vi
lrwxrwxrwx 1 root root 4 Nov 28 08:45 fgrep
-> grep
lrwxrwxrwx 1 root root 3 Nov 28 08:45 gtar
-> tar
lrwxrwxrwx 1 root root 4 Nov 28 08:45 mailx
-> mail
lrwxrwxrwx 1 root root 8 Nov 28 08:45
nisdomainname -> hostname
lrwxrwxrwx 1 root root 13 Nov 28 08:45 perl
-> /usr/bin/perl
lrwxrwxrwx 1 root root 2 Nov 28 08:45 rvi ->
vi
lrwxrwxrwx 1 root root 2 Nov 28 08:45 rview
-> vi
lrwxrwxrwx 1 root root 4 Nov 28 08:45 sh ->
bash
lrwxrwxrwx 1 root root 10 Nov 28 08:45
traceroute6 -> traceroute
lrwxrwxrwx 1 root root 10 Nov 28 08:45 tracert
-> traceroute
lrwxrwxrwx 1 root root 2 Nov 28 08:45 view
-> vi
lrwxrwxrwx 1 root root 8 Nov 28 08:45
ypdomainname -> hostname
It shouldn't matter, but FWIW the filesystem is ext3.
Here's what happens with tar 1.34, which is the current release on ftp.gnu.org. I create an archive (explicit xz is to isolate & test the same way with older version)
Note that 'bin/*' is the key to global merging; ('cd /
&& .tar -cf - bin') will not fail in the same way as shown
later.
Note that in the following example, all the links are converted
to hard links in addition to being misdirected. It's actually
more common for most of the soft links to remain soft links, but
all pointing to the first soft link target encountered. (Think
libc.so => vi...) I don't have a small reproducer for the
latter.
Also, note that the output order differs. My guess is that tar
is processing soft links as if they were hard links, and caching
in order to merge names linked to a common inode. The directory
is not changing; /bin is stable.
# /usr/local/bin/tar --version | head -n1
tar (GNU tar) 1.34
# ( cd / && /usr/local/bin/tar -cf - bin/* | xz
--stdout >/root/test.1.34.tar.xz )
# tar -tvf /root/test.1.34.tar.xz | grep -- ' -> \| link
to '
lrwxrwxrwx root/root 0 2022-11-29 14:37
bin/awk -> gawk
hrwxrwxrwx root/root 0 2022-11-29 14:37
bin/bash link to bin/awk
hrwxrwxrwx root/root 0 2022-11-29 14:37
bin/csh link to bin/awk
hrwxrwxrwx root/root 0 2022-11-29 14:37
bin/dnsdomainname link to bin/awk
hrwxrwxrwx root/root 0 2022-11-29 14:37
bin/domainname link to bin/awk
hrwxrwxrwx root/root 0 2022-11-29 14:37
bin/egrep link to bin/awk
hrwxrwxrwx root/root 0 2022-11-29 14:37
bin/ex link to bin/awk
hrwxrwxrwx root/root 0 2022-11-29 14:37
bin/fgrep link to bin/awk
hrwxrwxrwx root/root 0 2022-11-29 14:37
bin/gtar link to bin/awk
hrwxr-xr-x root/root 0 2006-10-01 16:22
bin/gzip link to bin/gunzip
hrwxrwxrwx root/root 0 2022-11-29 14:37
bin/mailx link to bin/awk
hrwxrwxrwx root/root 0 2022-11-29 14:37
bin/nisdomainname link to bin/awk
hrwxrwxrwx root/root 0 2022-11-29 14:37
bin/perl link to bin/awk
hrwxr-xr-x root/root 0 2007-01-18 06:59
bin/red link to bin/ed
hrwxrwxrwx root/root 0 2022-11-29 14:37
bin/rvi link to bin/awk
hrwxrwxrwx root/root 0 2022-11-29 14:37
bin/rview link to bin/awk
hrwxrwxrwx root/root 0 2022-11-29 14:37
bin/sh link to bin/awk
hrwxrwxrwx root/root 0 2022-11-29 14:37
bin/traceroute6 link to bin/awk
hrwxrwxrwx root/root 0 2022-11-29 14:37
bin/tracert link to bin/awk
hrwxrwxrwx root/root 0 2022-11-29 14:37
bin/view link to bin/awk
hrwxrwxrwx root/root 0 2022-11-29 14:37
bin/ypdomainname link to bin/awk
hrwxr-xr-x root/root 0 2006-10-01 16:22
bin/zcat link to bin/gunzip
But it does convert some soft links to
hard, which is not semantically equivalent. (e.g. consider b soft
linked to a. Update a; b gets the new version. Convert to hard
link & update a. Now a is the new version, and b is the old.)
I have flagged the hard links with !! hard so they stand out from
the clutter.
# ( cd / && /usr/local/bin/tar -cf - bin | xz
--stdout >/root/test.1.34.tar.xz )
# tar -tvf /root/test.1.34.tar.xz | grep -- ' -> \| link
to '
lrwxrwxrwx root/root 0 2022-11-29 14:37
bin/nisdomainname -> hostname
lrwxrwxrwx root/root 0 2022-11-29 14:37
bin/traceroute6 -> traceroute
lrwxrwxrwx root/root 0 2022-11-29 14:37
bin/dnsdomainname -> hostname
lrwxrwxrwx root/root 0 2022-11-29 14:37
bin/bash -> ../usr/local/bin/bash
lrwxrwxrwx root/root 0 2022-11-29 14:37
bin/gtar -> tar
lrwxrwxrwx root/root 0 2022-11-29 14:37
bin/awk -> gawk
lrwxrwxrwx root/root 0 2022-11-29 14:37
bin/sh -> bash
lrwxrwxrwx root/root 0 2022-11-29 14:37
bin/ex -> vi
lrwxrwxrwx root/root 0 2022-11-29 14:37
bin/fgrep -> grep
lrwxrwxrwx root/root 0 2022-11-29 14:37
bin/csh -> tcsh
lrwxrwxrwx root/root 0 2022-11-29 14:37
bin/rview -> vi
lrwxrwxrwx root/root 0 2022-11-29 14:37
bin/egrep -> grep
lrwxrwxrwx root/root 0 2022-11-29 14:37
bin/mailx -> mail
lrwxrwxrwx root/root 0 2022-11-29 14:37
bin/view -> vi
lrwxrwxrwx root/root 0 2022-11-29 14:37
bin/tracert -> traceroute
hrwxr-xr-x root/root 0 2006-10-01 16:22
bin/gzip link to bin/zcat !! hard
hrwxr-xr-x root/root 0 2007-01-18 06:59
bin/ed link to bin/red !! hard
lrwxrwxrwx root/root 0 2022-11-29 14:37
bin/rvi -> vi
lrwxrwxrwx root/root 0 2022-11-29 14:37
bin/ypdomainname -> hostname
lrwxrwxrwx root/root 0 2022-11-29 14:37
bin/domainname -> hostname
hrwxr-xr-x root/root 0 2006-10-01 16:22
bin/gunzip link to bin/zcat !! hard
lrwxrwxrwx root/root 0 2022-11-29 14:37
bin/perl -> /usr/bin/perl
tar 1.15.1 exhibits the link conversion, but not the merging. Here is a sample
# /bin/tar --vers
ion
tar (GNU tar) 1.15.1
# ( cd /bin && /bin/tar -cf - * | xz --stdout
>/root/test.1.15.1.tar.xz )
[root@overkill:~]# tar -tvf /root/test.1.15.1.tar.xz
| grep -- ' -> \| link to '
lrwxrwxrwx root/root 0 2022-11-28 08:45 awk
-> gawk
lrwxrwxrwx root/root 0 2022-11-28 08:45 bash
-> ../usr/local/bin/bash
lrwxrwxrwx root/root 0 2022-11-28 08:45 csh
-> tcsh
lrwxrwxrwx root/root 0 2022-11-28 08:45
dnsdomainname -> hostname
lrwxrwxrwx root/root 0 2022-11-28 08:45
domainname -> hostname
lrwxrwxrwx root/root 0 2022-11-28 08:45 egrep
-> grep
lrwxrwxrwx root/root 0 2022-11-28 08:45 ex
-> vi
lrwxrwxrwx root/root 0 2022-11-28 08:45 fgrep
-> grep
lrwxrwxrwx root/root 0 2022-11-28 08:45 gtar
-> tar
hrwxr-xr-x root/root 0 2006-10-01 16:22 gzip
link to gunzip
|| hard
lrwxrwxrwx root/root 0 2022-11-28 08:45 mailx
-> mail
lrwxrwxrwx root/root 0 2022-11-28 08:45
nisdomainname -> hostname
lrwxrwxrwx root/root 0 2022-11-28 08:45 perl
-> /usr/bin/perl
hrwxr-xr-x root/root 0 2007-01-18 06:59 red
link to ed !! hard
lrwxrwxrwx root/root 0 2022-11-28 08:45 rvi
-> vi
lrwxrwxrwx root/root 0 2022-11-28 08:45 rview
-> vi
lrwxrwxrwx root/root 0 2022-11-28 08:45 sh
-> bash
lrwxrwxrwx root/root 0 2022-11-28 08:45
traceroute6 -> traceroute
lrwxrwxrwx root/root 0 2022-11-28 08:45
tracert -> traceroute
lrwxrwxrwx root/root 0 2022-11-28 08:45 view
-> vi
lrwxrwxrwx root/root 0 2022-11-28 08:45
ypdomainname -> hostname
hrwxr-xr-x root/root 0 2006-10-01 16:22 zcat
link to gunzip !! hard
Without the wildcard, links remain distinct, but different files
selected for bogus conversion to hard links.
# ( cd / && /bin/tar -cf - bin | xz --stdout
>/root/test.1.15.1.tar.xz )
[root@overkill:~]# tar -tvf /root/test.1.15.1.tar.xz
| grep -- ' -> \| link to '
lrwxrwxrwx root/root 0 2022-11-28 08:45
bin/nisdomainname -> hostname
lrwxrwxrwx root/root 0 2022-11-28 08:45
bin/traceroute6 -> traceroute
lrwxrwxrwx root/root 0 2022-11-28 08:45
bin/dnsdomainname -> hostname
lrwxrwxrwx root/root 0 2022-11-28 08:45
bin/bash -> ../usr/local/bin/bash
lrwxrwxrwx root/root 0 2022-11-28 08:45
bin/gtar -> tar
lrwxrwxrwx root/root 0 2022-11-28 08:45
bin/awk -> gawk
lrwxrwxrwx root/root 0 2022-11-28 08:45
bin/sh -> bash
lrwxrwxrwx root/root 0 2022-11-28 08:45
bin/ex -> vi
lrwxrwxrwx root/root 0 2022-11-28 08:45
bin/fgrep -> grep
lrwxrwxrwx root/root 0 2022-11-28 08:45
bin/csh -> tcsh
lrwxrwxrwx root/root 0 2022-11-28 08:45
bin/rview -> vi
lrwxrwxrwx root/root 0 2022-11-28 08:45
bin/egrep -> grep
lrwxrwxrwx root/root 0 2022-11-28 08:45
bin/mailx -> mail
lrwxrwxrwx root/root 0 2022-11-28 08:45
bin/view -> vi
lrwxrwxrwx root/root 0 2022-11-28 08:45
bin/tracert -> traceroute
hrwxr-xr-x root/root 0 2006-10-01 16:22
bin/gzip link to bin/zcat !! hard
hrwxr-xr-x root/root 0 2007-01-18 06:59
bin/ed link to bin/red !! hard
lrwxrwxrwx root/root 0 2022-11-28 08:45
bin/rvi -> vi
lrwxrwxrwx root/root 0 2022-11-28 08:45
bin/ypdomainname -> hostname
lrwxrwxrwx root/root 0 2022-11-28 08:45
bin/domainname -> hostname
hrwxr-xr-x root/root 0 2006-10-01 16:22
bin/gunzip link to bin/zcat !! hard
lrwxrwxrwx root/root 0 2022-11-28 08:45
bin/perl -> /usr/bin/perl !! hard
If you are wondering why many links have recent dates - that's an artifact of recovering the correct links after restoring from a corrupt archive.
Timothe Litt ACM Distinguished Engineer -------------------------- This communication may not represent the ACM or my employer's views, if any, on the matters discussed.
OpenPGP_signature
Description: OpenPGP digital signature
[Prev in Thread] | Current Thread | [Next in Thread] |