gnu-arch-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gnu-arch-users] [BUG] FEATURE PLANS: "perfect" summary deltas


From: Tom Lord
Subject: Re: [Gnu-arch-users] [BUG] FEATURE PLANS: "perfect" summary deltas
Date: Fri, 9 Jul 2004 09:40:59 -0700 (PDT)


    > From: Matthew Dempsky <address@hidden>

    > Tom Lord <address@hidden> writes:

    > > [ summary delta description ]

    > Out of curiousity, I decided to test out how well this idea works with
    > a copy of Tom's tla--devo--1.2 tree.

Neat.  Thanks.

    > The test was conducted by adding cacherevs to every revision in
    > tla--devo--1.2 along with conducting a delta between each basis-spine
    > and spine-leaf pair and then tar'ing the result.

    > The cacherevs ranged in size from about 1.4 to 1.6 MB, however, none
    > of the summary patches exceeded 200 kB (well under 1/4th of even the
    > smallest cacherevs).  In fact, only 28 of the 115 patches weighed in
    > over 100 kB.  Therefor, the entire version fits in a single phase,
    > giving a very nice alternative to cacherevs if even a single revision
    > is available.

Good news.

    > As to the builder algorithm, why not simply use Dijkstra's shortest
    > path algorithm using patch size as edge weight and an edge anywhere a
    > patch or patch summary exists?  (Also a starting node representing an
    > empty directory, so zero-weight edges between it and revlib'd or
    > pristine revisions and cacherev-weight edges between it and
    > cacherevs'd revisions.)  It would be general enough to still meet
    > Aaron's hypothetical smart server, while simple enough to fulfill
    > Tom's no-more-complicated-than-necessary goal.

    > This actually seems pretty similar to the algorithms described, but it
    > seems to differ that a) I actually called it Dijkstra's shortest path,
    > and b) I'm not sure if the other described algorithms would allow
    > mixing regular patches and summary patches.

Well....

There are three main points to the "perfect" summary delta proposal:

1. We always construct graphs having the same topology.   We don't
   need Dijkstra's algorithm in its general form: we can specialize it 
   for this topology.    

   Why does this matter?  Because in this graph, given just a node,
   _if_we_don't_know_the_graph_toplogy_, requires querying the
   archive.  We want to minimize such roundrtrips and we want to
   minimize the bandwidth consumed by each edge-querying roundtrip.

   If we _do_ know the graph topology (and it has some handy
   properties) then we can minimize the edge queries (to about 4, in
   this case).   You don't need a fully general shortest path
   algorithm.

   aaron's alternative idea (the "separate delta directory") also aims
   to minimize roundtrips: just get a listing of that one directory
   and now you know a big chunk of the graph.  First: I think it would
   interact poorly with smart servers because a smart server may be
   willing to offer up a complete graph of deltas (so, while we only
   have one roundtrip, the bandwidth can start to look pretty
   interesting in large versions).  Second: I think it would interact
   poorly with smart servers because it requires smart servers to
   eagerly describe what deltas are available rather than seeing if,
   upon demand for a specific delta, it's handy to provide it.


2. You're right that delta and cacherev sizes give you weights for
   the edges of graphs.

   Alas, our least-common-denominator dumb-file-system server model
   doesn't have any way to determine the size of a delta or cacherev
   other than by fetching it and measuring it.

   On the other hand, at the time that you _store_ a delta or cacherev
   in an archive, you can measure its size _without_ fetching it
   (since you've got it right there).

   So, absent modifying the archive format to record sizes (see
   below), why not split the shortest path algorithm into two
   asynchronous parts: writing clients evaluate edge weights and use
   that to decide which variation of a fixed-topology graph to build;
   reading clients assume that all edges on the graph have weight 1
   because they trust that the writing clients would not have 
   created those edges if that were not a reasonable assumption.


Putting 1 and 2 together (and ignoring some of the later parts of the
thread between abentley and i where we talked about variations on the
idea): yes, Dijkstra's algorithm is there but more as a "ghost" than a
literal rendition;  it's a more specialized algorithm that gives the
same answer Dijkstra's would but for a limited class of graphs.   Part
of the general algorithm is being computed at write time;  part of it
at read time.


3. Do we or do we not muck with the archive format?

   Actually, I shouldn't say "archive format".

   Do we or do we not muck with the archive _abstraction_ because,
   with a few exceptions (like checksums and signatures) when we
   change the archive format, we're implying a change to the archive
   abstraction.

   Without some fancy footwork, some of Abentley's ideas would change
   the archive abstraction in some deep ways that impact things such
   as what smart servers can do.   I see a negative impact in the 
   particulars.

   The "perfect" summary delta doesn't change the archive abstraction
   at all.  If tla hadn't dropped support for "commit --base"
   revisions, you'd be able to trick the builder into using summary
   deltas without any changes at all to tla and the only changes we'd
   need to make would be to make the interface to summaries more
   convenient.

Of course, the discussion of the elegence (imo) of "perfect" summary
deltas is hampered by three things:

A. I dropped "commit --base" (aka "commit --tag") in tla and
   therefore, the current builder knows nothing about it.

   So it _looks_ like we're adding something that requires new
   work on builder algorithms but, at core, it really only requires
   finishing the port from shell-script larch to C tla in this
   one, otherwise obscure area.


B. The discussion is concurrent with Abentley's work on the
   backwards-builder and so there's plenty of opportunity
   for discussion to get hung up on silly details about the
   impact on that work, making the "perfect" summary delta 
   idea sound weirder than it really is.

C. Your experiment is great work and very comforting.

   Something that is absolutely not a priority but that might help
   shed some light at some point is an empirically supported
   characterization of "typical" changerates and change-natures and
   their determining variables, combined with analysis about how
   effective "perfect" summaries (or any alternative) is predicted to
   be.  (For now, this not being rocket science, and especially given
   checks such as yours, I trust my intuition filtered through
   feedback from others thinking about the same topic.)
        

-t





reply via email to

[Prev in Thread] Current Thread [Next in Thread]