gnu-arch-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gnu-arch-users] Re: situations where cached revisions are not so go


From: Tom Lord
Subject: Re: [Gnu-arch-users] Re: situations where cached revisions are not so good
Date: Sat, 27 Sep 2003 20:03:28 -0700 (PDT)




    > From: Jason McCarty <address@hidden>


    > As to your design questions, Here's a simple proposal:

    >   Summary deltas will be stored in the same directory as their target
    >   revision (i.e. a summary of patch-32..57 would go in patch-57's
    >   directory). 

Just as a word-choice issue, let's call patch-32 the ORIG revision and
patch-57 the MOD revision.  The TARGET is a tree to which we might
apply this summary delta (usually (but not always) equal to ORIG,
presumably).


    >   I don't make use of cachedrevs, but they could be
    >   integrated if desired (for example if a revision is really huge, a
    >   cachedrev might be better than summaries).
    > 
    >   Implementation of arch_build_revision(ALMOST, TARGET, DESTDIR):
    >   1) If ALMOST is nil:
    >      2) Try to find a pristine or libraryrev ALMOST which is the
    >         greatest lower bound of TARGET, with the same version as TARGET.
    >      Else:
    >      3) Set ALMOST to TARGET's base-0.

I sort of follow you here.   `versions' (in the arch sense of logical
development lines) don't come into play directly.   We have to talk
about `ancestry'.

And, if you don't mind, let me refer to what you call "TARGET" as
"GOAL".   The GOAL is the revision we're ultimately trying to build.

The GOAL revision has an ancestry.  The ancestry is a list of earlier
revisions.   If GOAL was created by `commit', then the immediate
ancestor of GOAL is one patch-level down.   If GOAL was created by
`tag', then the immediate ancestor is whatever was tagged.  If GOAL
was created by `import', then it has no immediate ancestor.

We don't know, a priori, the entire ancestry of GOAL.   To find it
out, we have to walk it backwards, querying an archive each step of
the way.

So, we can't pick the closest ancester ALMOST without some server
round-trips here -- but that's OK, because we'll need to make those
very same round-trips no matter what.

Ok: ALMOST is now the closest ancestor of GOAL for which we have
either a pristine or revlib copy (or is the oldest ancestor).  (This
differs from the current build_rev in which ALMOST would be the
closest ancestor for which we have a pristine or a revlib copy or find
an archive cached copy -- the presumption being that in the common
case, archive-cached revisions are _local_; and in the other common
case, we don't have any pristines or revlib copies further back.
Miles is, in essense, griping about _his_ common case which is
different from those two.)

There is a slight problem with your approach.   We can't count on
access to the oldest ancestor.  It may very well be in some archive
that we don't have access to.   We pretty much _have_ to be willing to
set ALMOST to an archive-cached revision.

Another problem is that the oldest ancestor may be very far away
compared to the nearest archive-cached ancestor -- another reason we
can not ignore archive cached revisions.

(Oh, and, just as a point of amusing trivia:  unpacking a _locally_
archive-cached revision is notably faster than copying that revision
from a revision library.)



    >   4) If ALMOST == TARGET, copy ALMOST to DESTDIR and return (may have to
    >      call arch_build_revision if it's a continuation, or connect to
    >      archive to grab import).

Sure, if ALMOST is the GOAL and we've managed to find a copy of ALMOST
-- we're done.


    >   5) Connect to TARGET's archive.
    >   6) Find a DELTA to TARGET whose FROM endpoint is the least upper bound
    >      of ALMOST (this will be either a revision or a summary delta in
    >      TARGET's directory). Continuations count also; let FROM be the
    >      revision it's a continuation of. Use a continuation from another
    >      version if it's all that's there.
    >   7) If DELTA is a continuation from a different version than TARGET:
    >      8) Call arch_build_revision(0, FROM, DESTDIR).
    >      Else call arch_build_revision(ALMOST, FROM, DESTDIR)
    >   9) Retrieve and apply DELTA to DESTDIR.

Ok, assuming that I follow you:   the oversimplification in what
you're saying is that you can not ignore archive cached revisions.
You can't count on searching way far back in ancestry.   Searching
back in ancestry requires one round-trip per step.   If the summary
delta you find is too far back relative to the next nearest cachedrev,
this is a pessimizing algorithm.   etc.

It's a graph searching problem.   We have the ancestry graph which is
just a linear, directed graph.   Now you're adding these summary-delta
links to the ancestry graph so we have things like:


   GOAL --> GOAL-ancestor --> GOAL-ancestor^2 .... -> GOAL-ancestor^N ...
               \                               ^
                \                             /
                 `---------------------------'

Each node on the graph is a fully qualified revision name.   For a
given revision name, we can cheaply ask: "do I have a pristine?" and
"do I have a rev lib entry?"

For the cost of a server roundtrip, we can ask: "is it archive
cached?"  and "is there a summary delta here" and "what is the name of
the next ancestor?"

Initially, we know only "GOAL".

We can build GOAL by walking backwards along this graph until we find
a sufficient collection of bits and pieces to assemble GOAL.   With
summary-deltas, and a more complicated build_rev algorithm, now we
have multiple paths we might want to explore.

One difficulty of your algorithm, if I'm following it correctly, is
these "least upper bound" computations -- they assume more knowledge
of the ancestry graph than we have on hand.

It's not an insurmountable difficulty, but it's not a trivial problem
to solve.   Certainly we can figure out _part_ of the ancestry graph
from the revlib if we have one.   Possibly we can add additional foo
to archives so that we can retrieve large parts of the ancestry in a
single round-trip.

Your algorithm isn't _wrong_, by a long stretch -- it's just missing
consideration of archive-cached revisions and missing a bunch of
important details.  (So, keep going?   Maybe this is a good time at
which to start playing with the code.)




    > I think that transitive deltas would probably be most effective
    > regardless of the algorithm.

I think we can talk about what to prove and how to define "optimal"
later -- it's hard and hairy -- your intuitions sound plausible to me,
too, and its just that more details are needed to make them
persuasive.  No need to worry too much about proving theory, yet.


    > > * what strategies might a smart server want to employ to manage 
    > >   summary deltas?   Feedback from access patterns?  Any-one-on-demand?
    > >   Algorithmic placement based on the length of the ancestry chain?
    > >   or on the sizes of various changes and trees?

    > It would probably insert more summaries where revisions are accessed a
    > lot, and remove some where revisions are accessed rarely (although it
    > may actually be faster to build a summary on the spot in some cases if
    > someone wants a rare revision). I would suggest placing them in order to
    > minimize the time it takes to build revisions (dependent on size and
    > more strongly on chain length right now).

That's interesting.


    > No hints are necessary; summaries will be treated much like cachedrevs.

I don't think so... hopefully my comments on the algorithm help
explain why.

    > As you can see from my algorithm, the amount of extra code is not that
    > large. 

I don't see that yet.   Might be true -- but I don't see it yet.

-t





reply via email to

[Prev in Thread] Current Thread [Next in Thread]