[Monotone-devel] Re: suturing, resurrection, renaming and unique file id

monotone-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Monotone-devel] Re: suturing, resurrection, renaming and unique file id

From:	Mikkel Fahnøe Jørgensen
Subject:	[Monotone-devel] Re: suturing, resurrection, renaming and unique file identifiers
Date:	Thu, 29 May 2008 10:04:27 +0200

2008/5/29 William Uther <address@hidden>:

OK, first I answer some of your complexity concerns as you do have a point.
During my answer I discovered a possible new index merge algorithm
that could make this routing decision efficient, if it actually works.
Mind you, it just occured to me and is listed at the bottom of this mail.
In addition, I have not studied general routing algorithms and these
might also have efficient solutions.

>  In general, the whole point of fileIDs is that it is a constant time
> operation to see if two of them are the same.


I can see that.
My point is that the concept has its limitations, so it is worthwhile
investigating what the underlying problem really is. Once you have
that model it is easier to optimise and objectively limit a solution
for performance reasons, I'm not arguing that my proposal can
efficiently be done in monotones current data structures.

 If you have to search some
> history graph, then things are already significantly less efficient.
Clearly, but if the choice is between not working and working ...
The interesting point here is how much slower will the cases that do
work today behave in this new scenario.
I'm not going to do detailed analysis here, but assume we do not
change node id's when all merges originate from the same file id.
Then we can do exactly as monotone does today.
So the cost is increased time and especially code complexity for the
other cases that is not handled well today.
It could turn out that performance is not an issue so trying to keep
same node id's is not worthwhile.

>
>  With file suturing, we are looking to implement a mechanism whereby we can
> handle multiple node-ids for a single conceptual file.

If we look at my proposal from a conceptual point of view, you can see
a collection of node id's as equivalent to a new node id from which
you can derive history.

 I was thinking some
> form of http://en.wikipedia.org/wiki/Disjoint-set_data_structure was what we
> needed.

If we consider a simple binary tree index for multiple node id's, then
the root node of such a tree corresponds to a new file id, and the
nodes correspond to past versions.
Thus, my proposal actually implements such a structure, but makes it
shared across all revisions. I'm not saying this is better than the
disjoint set solution, I'm just saying it probably is solving the same
problem.
The problem is of course to read such a structure efficiently into memory.
But again, if the conceptual model is agreed upon, there can be many
implementation.

 A bloom filter might work, but the fact that it is a stochastic
> data structure worries me.  You'd either need to have a large number of
> redundant hashes to reduce the probability significantly, or you'd need to
> have some double-check after matching.

That is true. You do need double checking and you do need a fairly
large hash. My point was to have one bloom filter for the entire
revision, not a per file hash.
Then you can check any given file id against this filter efficiently,
and you can afford a larger hash.
Also, if the file id is based on current cryptographic hashes, the
actual hash functions can be implemented cheaply:
Say you pick 4 bits (you probably need more), and a 256 byte filter.
Then the first four bytes of a file-id hash can each index a bit in
this filter.
The double checking would be a forward routing search from the common
ancestor to the target revision.

If each revision also stores a slower, but precise complete history,
for example using patricia tries storing all file ids in past revision
history, this check can be done without such a forward search.
Patricia tries might actually be fast enough to drop bloom filters. In
either case note that we must store the complete history, not just
current file ids, so the index must be compact (I have a reference to
paper on a compact, disk efficient patricia tries if anyone is
interested).

If such an index exactly maps past history to each current file, we
have solved the forward from ancestor search problem (this is not
unlike the suture multi-node proposal I guess, since it is a related
kind of index).


>
>  I must say I'm a little unclear about your other suggestions.  I can see
> that they might work, but would seem to be horribly inefficient compared to
> what monotone currently has.

The could be the case and I don't have all the answers. I don't think
the common cases would be slow.
The main concern is code complexity and avoiding exponential explosion
on border cases.

> I may be wrong about that, but I think you'd
> need to show your method efficient (less than a large constant times O(n)
> for matching n node-ids between two revisions, and less than O(nr) space
> where n is the number of node ids and r is the number of revisions being
> stored).

I don't think these requirements are entirely fair. You need to look
at worst case which could acceptably be worse if it is rare, but
exponential is not acceptable (as Darcs has shown).
The amortized average case is going to be low:
Assume you change 10% of the files between revisions. The 90% of the
file ids will immediately exist in the target revision. These can be
resolved through simple indexing and is probably already done today.

The remaining 10%:  Normally all of these will have a simple history
where we can optimise to not change file-id. The interesting case is
worst case behaviour where all files have been through some suture
forcing new file ids. And by extension, another interesting case is
where all files are different between source and target (new version
of vendor tree).
If we index all past node-ids in a revision history, we can cheaply
cut branches not leading to our target revision. We can also avoid
following multiple paths to the target revision because we only need
on e of the paths now that we know both leads to the revision. I'm not
even sure we need to search the path with such an index.

The complexity of files where node ids have to change (through suture)
is related to project activity, not the total revisions.
As an example, assume 10-30 revisions back to common ancestor from
merge source (one developer activity) and perhaps 100-200 revisions
forward to a merge target revision (all developers).
This is not exactly true, as we also have to eliminate paths not going
between source and target, and we can have multiple paths which is
exponential if searched naively.

So the remaining problem is: for the special case, of changed files
that goes through non-trivial history, can we find an acceptably fast
routing algorithm?
An extra problem is: even if we do efficiently find paths, can we
efficiently decide that there is no valid path, for example when a new
file is added.
Again, if a revision remembers all past node-ids for all files, these
special cases can be answered efficiently.

Having gone through this, it multi-node disjoint set suture solution
as is probably an efficient way to establish the routing path in my
proposal, but it is not the full solution:

The mapping between solutions goes like this:

Each file revision  gets a new node-id, but in many common cases, we
do need to explicitly create new id, which simplifies indexing.
For each file in a revision, we must be able to efficiently identify
the immediate predecessor file-ids, and also all past file-ids.
The index of past id's must be organized such that given an arbitrary
file id to a revision index (of all files in revision), we can see
which current files have a historic or current association with that
id.

If a sutured file stores multiple node-ids, and if we compress most
file-ids to a single id, and if we see non-sutured files as having a
set a single compressed node-id, then we have exactly the file-ids
needed in the revision index above, but it still requires efficient
indexing in the revison so we don't have to ask every single file if
it knows about a node-id.

But as I said, this is not the full solution:
The sutured file need to have new node-id, it can't just inherit a
random immediate ancestor id, since this would treat the other
histories unfairly.
Also in another branch we can have node-ids that will not occur in any
suture file-id and still have an association with one of these files.

So for any such file-ids we cannot find, we must see if any past
file-ids match. The search can be costly since for each file id in the
merge source not found in the above index for the merge target, we
must go back in time for each file. Since a file can have multiple
past id's we do not trivially stop a back-trace at common ancestor
since we are also looking at other histories until we find a match.

The trivial solution is to not give a suture a new id, but reuse one
ancestor. But this is completely arbitrary and will give undesirable
results. So doing this is a performance trade-off.
As I understand, this is the solution that the monotone community is
currently considering.

Having identified this back-search as being problematic, what can be done?

Given a proper index, both revisions immediately know all their past
history node-ids.
We can intersect these two indices efficiently using set intersection
algorithms.
The resulting index is exactly those file id's in a common history.
The result entry of each revsion index (before intersection) points to
one or more current file-ids such that given a current or past file-id
x, the index gives all current file-ids depending on x.

The intersection will merge the current file-ids of both revisions in
the resulting revision index.
We now have a set of index entries that are relations between file-ids
in the two revisions.
The same current file-id can appear in multiple index entries, so we
must cluster these entries further.
Once this is done, we can associate current file-ids between the two
revisions, and we can also find common ancestors if the index is
constructed properly.


Mikkel

[Prev in Thread]

Current Thread

[Next in Thread]

[Monotone-devel] suturing, resurrection, renaming and unique file identifiers, Mikkel Fahnøe Jørgensen, 2008/05/27
- [Monotone-devel] Re: suturing, resurrection, renaming and unique file identifiers, Mikkel Fahnøe Jørgensen, 2008/05/28
  - Re: [Monotone-devel] Re: suturing, resurrection, renaming and unique file identifiers, William Uther, 2008/05/29
    - Message not available
    - [Monotone-devel] Re: suturing, resurrection, renaming and unique file identifiers, Mikkel Fahnøe Jørgensen <=

Prev by Date: Re: [Monotone-devel] Re: suturing, resurrection, renaming and unique file identifiers
Next by Date: [Monotone-devel] Re: FreeBSD's requirements for its future VCS
Previous by thread: Re: [Monotone-devel] Re: suturing, resurrection, renaming and unique file identifiers
Next by thread: [Monotone-devel] automate stdio bug / botan problem
Index(es):
- Date
- Thread