[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Gnu-arch-users] Binary Diff System in Arch
From: |
Tom Lord |
Subject: |
Re: [Gnu-arch-users] Binary Diff System in Arch |
Date: |
Wed, 4 Feb 2004 08:59:18 -0800 (PST) |
> From: Sam Phillips <address@hidden>
> I'm starting to contemplate work on a project, and I want to use
> Arch for revision control. As part of the project I need to be able to
> store non-human readable data inside the tree.
> I remember a thread a while back (it may have been about a year ago
> now) where people were talking about having some kind of hooks for
> diff/patch so that you could use Xdelta -- or some other delta'ing tool
> -- in Arch. I was wondering if anyone has been working on this and what
> state this idea is in right now.
> If this needs works and is still worth doing I might take a stab at
> it soon.
You got lots of good replies which could be summarized:
~ nobody is known to be working on it
~ we've talked about the design a big
~ the trick is to get diff and patch to "Do the Right Thing"
~ one way to do that is by wrapping diff and patch, another by
modifying them
~ people will need your changes in one form or another if they
are going to be able to read your archives
I'm going to follow up with two additional points:
~ in addition to changing diff and patch, you'll also need to
change diff3.
~ enclosed is an explanation of how things work and an outline of what
needs to be done
-t
* text vs. binary files
GNU diff currently uses a heuristic to decide that a file is a
binary file before it does any further worth. _As_I_recall_, the
heuristic is to look at the first 1K (or is it 4K) of the file and,
if any 0-bytes are found, assume that the file is binary.
(GNU diff _may_ currently have -- or could be expected to have in
the future -- additional heuristics.)
When comparing binary files, rather than emitting useful diffs, diff
just says "binary files differ" or "binary files are the same" as
the case may be.
* diff output vs. patch
Patch reads its input file and tries (tolerantly of extra junk)
to parse it as one of the output formats of diff. Of course,
a string like "binary files differ" means nothing to patch
and it will exit with an error given such input.
* tla binary file detection
tla trusts diff to detect binary files. It uses a crude
technique: looking for the the string "binary files" at the
start of diff output. Strangely enough, this seems to work
regardless of how a users locale is set.
If diff has reported that at least one of the two files which differ
is a binary file, then rather than storing the diff output,
tla falls back to the strategy of storing whole-text copies
of both files.
* critical properties of diff and patch
If diff _can_ diff two files, tla relies on two key properties
of diff and patch:
Suppose we have two text files A and B. We can write
the diff between them:
diff(A,B)
Patch provides two operations, forward and backward patching, both
of which arch relies upon. These operations have the properties:
patch (A, diff(A,B)) == B
reverse_patch (B, diff(A,B)) == A
Additionally, patch provides for "inexact patching". So that if we
have a file C, we get:
patch (C, diff(A,B)) --> C', modified "similarly" to B
relative to A, plus a .rej file
if conflicts occured
reverse_patch (C, diff(A,B)) --> C', modified "similarly" to A
relative to B, plus a .rej file
if conflicts occured
The exit statuses of the diff and patch processes are critical:
diff exit statuses
0 -- files are the same
1 -- files differ
2 -- an error duing processing
patch exit statuses
0 -- patch applied without conflicts
1 -- patch applied with conflicts
2 -- an error duing processing
The output format of diff has a critical property: it is a
plain-text format suitable for displaying on a terminal or
sending in email. This is critical for the output of commands
such as "tla changes".
* critical properties of diff3
In some situations, tla is operating on three text files which could
be named MERGED-TO, ANCESTOR, and MERGED-FROM.
It wants to compute what is in essence:
patch (MERGED-TO, diff(ANCESTOR, MERGED-FROM))
but does that in a single step using diff3:
diff3 (MERGED-TO, ANCESTOR, MERGED-FROM)
There are two reasons for doing that. First, diff3 can sometimes
do a better job of avoiding conflicts. Second, if conflicts occur,
diff3 can generate a kind of in-file conflict markers that many
people like (comparable to those generated by CVS, for example).
diff3 exit status is critical and similar to patches:
diff3 exit statuses
0 -- merged without conflicts
1 -- merged with conflicts
2 -- an error duing processing
tla will never attempt to use diff3 if diff reports that at least
one of ANCESTOR and MERGED-FROM are binary files.
* So, What to Do? (part 1)
When people say "modify [or wrap] diff and patch" the general
idea they have in mind is to make it so that diff _never_
reports "binary files differ". Instead, it should emit some
kind of "binary diff" (such as xdelta might produce). Similarly,
patch should know how to apply the binary diff.
All of the critical properties listed above for diff and patch
must be preserved:
xpatch (a.jpg, xdiff(a.jpg, b.jpg)) == a.jpg
xreverse_patch (b.jpg, xdiff(a.jpg, b.jpg)) == b.jpg
xpatch (c.jpg, xdiff(a.jpg, b.jpg))
--> c'.jpg, modified "similarly" to b.jpg with
a .rej file if necessary
xreverse_patch (c.jpg, xdiff(a.jpg, b.jpg))
--> c'.jpg, modified "similarly" to a.jpg with
a .rej file if necessary
exit statuses as before
xdiff output must be plain-text -- suitable for what-changed
output and email.
The first two equations -- exact patching -- should be fairly easy.
A good suggestion was made at one point that the output of
% xdiff a.jpg b.jpg
should include md5 checksums for both a.jpg and b.jpg, along with
the xdelta output. (I'm assuming that xdelta doesn't already
include those checksums in its output).
The format of that xdiff output should be such that it can not be
mistaken for a textual diff.
The xpatch program should recognize that format and compare the
checksum for the file being patched to that of a.jpg. If they
match, it should apply the xdelta, producing b.jpg.
And similarly: the xreverse_patch program should recognize that
format and compare the checksum for the file being patched to that
of b.jpg. If they match, it should apply the xdelta, producing
a.jpg.
What if the checksums don't match? Unlike textual patching,
binary patching can't reasonably "fudge it". So if the checksums
don't match, then xpatch and xreverse_patch should leave the
file being patched unmodified -- and store the entire xdiff output
as a .rej file.
xdiff3 is easy, given xdiff. If ANCESTOR and MERGED-TO are exactly
the same, then the merged file output should just be a copy of
MERGED-FROM. Otherwise, the output should be a copy of MERGED-TO
and "xdiff ANCESTOR MERGED-FROM" stored as a .rej file. (Note --
tla will need a slight modification to how it handles the .rej
file in this case but nothing major.)
* What's the Result?
If you configure arch to use xdiff, xpatch, and xdiff3:
~ your revisions that involve binary files will, in general,
_not_ be readable by people not using xdiff, xpatch, and
xreverse_patch. (But if xdiff, xpatch, and xreverse_patch are
good, either they'll be folded into diff/patch or we'll make sure
that they are distributed with tla -- so this incompatibility
will be only a temporary problem.)
~ archives created by people not using xdiff, xpatch, and xdiff3
will be readable by you.
~ you revisions that don't involve binary files will be readable
by everyone else.
~ if you _merge_ changes involving binary files, and there are
conflicts in the binary files, you'll get conflict markers but
you _won't_ get a full-text copy of the merged-from file.
In many situations that's no loss at all. You can fetch the
merged-from file by other means if you need it.
~ if you _merge_ changes involving binary files, but there are
no conflicts (the copy in your tree is the same as in the base
revision from which the merge changeset was computed) -- that
will work as expected: you'll wind up with an accurately modified
version of the binary file.
~ The contents of .rej files will change: sometimes they will
contain rejected xdeltas instead of rejected text hunks from diff.
* So, What to Do? (part 2)
The only drawback to all of this is the result that says:
~ if you _merge_ changes involving binary files, and there are
conflicts in the binary files, you'll get conflict markers but
you _won't_ get a full-text copy of the merged-from file.
Sometimes that's desirable, sometimes not. I think it should be
an option.
A reasonable way to make it an option to allow users to set a
parameter in ./{arch} which will turn on the xdelta behavior and
otherwise, stick the current mechanism of using whole-text copies
of binary files. It's easy to modify tla to check for this
parameter and pass an option to diff.
The implication for xdiff is that it should take an option
(something like "--xdelta-if-binary") and, absent that option,
behave exactly like normal GNU diff.
Additionally, even if a changeset includes xdelta patches, a
user may sometimes want (as a convenience) tla to to produce
full-texts of merged-from files when conflicts occur. We
can add options for that, too -- though doing so is essentially
orthogonal to all of the other xdiff/xpatch/xdiff3 work.
* Summary:
You can get the functionality you're after by writing xdiff, xpatch,
and xdiff3. It will work reasonably without any modifications to
tla at all.
When that's done, we can make a few small changes to tla to make
the functionality slightly more convenient to use.