help-tar
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Help-tar] --deterministic option?


From: Jakob Bohm
Subject: Re: [Help-tar] --deterministic option?
Date: Wed, 27 May 2015 16:01:42 +0200
User-agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:31.0) Gecko/20100101 Thunderbird/31.7.0

On 27/05/2015 12:44, Jérémy Bobbio wrote:
Hi!

We are working in Debian— and I know other free software projects
care— in providing our users with a way to reproduce bit-for-bit
identical binary packages from the source and build enviroment.
See <https://wiki.debian.org/ReproducibleBuilds/About> for some
rationale and further explainations.

In order to do this, we need to make our build processes as
deterministic as possible. As you can imagine, Tar is quite involved in
producing Debian packages. A straightforward call leads to multiple
issues:

 * Order of files in the archive will depend on the filesystem order.
 * User and group names are recorded. This can be seen as a privacy leak
   for the package builder.
 * Permissions are dependent on the builder umask.
 * Last modification times of members of files created during the build
   will be dependent on the build time.
 * Also, if gzip compression is used, a timestamp will be recorded in
   gzip header.

So, we are currently turning calls like:

    tar -zcf archive.tar.gz src

into:

    find src -print0 | LC_ALL=C sort -z |
        GZIP=-9n tar --null -T - --no-recursion \
                --owner=root --group=root --numeric-owner \
                --mode=go=rX,u+rw,a-s \
                --mtime=debian/changelog \
                -zcf archive.tar

It would be great to avoid at least some of the boilerplate. Finding a
generic solution for permissions and modification times might be too
much, but having a `--deterministic` flag for the rest of the issues
would be quite helpful already.

What do you think?

Agree in principle.  Note that the boilerplate you
show looks like it doesn't handle:

- Creation/Access times (if stored in tar headers).
- Random gzip version dependencies (also affects DAK
 producing different gzipped index files depending on
 the Debian release installed on/near master).
- statoverride integration for suid/sgid binaries and
 special dir flags (mostly in basefiles and /usr/local).
- Adding .gz extention to archive.tar (probably just
 a typo).

Which probably makes the real command line even longer.

Also, at least a few versions back, dpkg-source
produced the wrong file timestamps in .diff.gz
files, affecting the consistency of source file
timestamps.

Now for tar, I would suggest (as a future feature) three
new determinism options:


--nomode  : Short for --owner=root --group=root
           --numeric-owner --mode=go=rX,u+rw, except
           for suid/sgid entries.  Combine with
           --mode=a-s to make all files root:root with
           no suid/sgid bits.
            For more advanced permission systems (acls
           etc.) --nomode will in general archive each
           entry as if all non-modify permissions are
           the union of those granted to any users, while
           modify permissions are for owner only and any
           special attributes (sgid/suid/capabilities
           etc.) are preserved.
--sort    : Causes the entries in each processed
           directory to be output in Asciibetical order
           (thus each dir needs to be loaded into memory
           and sorted, using a locale-independent
           strcmp() variant, but no need to preload
           entire file listing).

--onepass : (not for package builders): If a file
           changes while being archived, the archived
           file contents, file length and sparse holes
           will all be determined from a single read()
           pass over the file until end of file reached.
             This is in contrast to the current two-pass
           logic where length and holes are found on a
           first pass, contents of non-holes on a second
           pass, thus --onepass provides guarantees to
           applications (such as databases) that a
           restored file will have the property that if
           something in the file indicates that
           something earlier in the file was updated to
           checkpoint X, then that will be true, just
           as if the backup had been done with cat.
             The kernel/filesys is responsible for
           presenting a consistent view of each file to
           all processes/handles (a property already
           needed for ordinary interprocess use of a
           shared file).

Enjoy

Jakob
-- 
Jakob Bohm, CIO, Partner, WiseMo A/S.  https://www.wisemo.com
Transformervej 29, 2860 Søborg, Denmark.  Direct +45 31 13 16 10
This public discussion message is non-binding and may contain errors.
WiseMo - Remote Service Management for PCs, Phones and Embedded 

reply via email to

[Prev in Thread] Current Thread [Next in Thread]