Re: [Gnu-arch-users] situations where cached revisions are not so good

gnu-arch-users
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Gnu-arch-users] situations where cached revisions are not so good

From:	Tom Lord
Subject:	Re: [Gnu-arch-users] situations where cached revisions are not so good
Date:	Wed, 24 Sep 2003 11:27:49 -0700 (PDT)

    > From: Miles Bader <address@hidden>

    > [archive caching revisions doesn't always have the 
    >  right performance characteristics.
    > 
    >  Say, how about skip-deltas?]

You got a bunch of good answers in this thread already -- this one
is slightly redundant but hopefully useful.    

Short-answer-that-works-today: don't balk at using mirrors here,
especially if you can afford a revision library.  That gives an almost
optimal solution at the cost of some client-side disk space.  An Emacs
revlib is likely to be huge and, if you don't have the latest disk
hardware client-side we can talk about how to manage this until you
do.

Since other people are pretty on-top of the short-answer already, let
me try to sum up some of the scattered design discussions we've had
over the past few months, in which context skip-deltas come into play.
This is long, but yes, really, your idea of skip-deltas is directly
analyzed in it.

     contents:

        * Problem Definition: build_rev and delta

        * Problem Analysis

        ** How delta May be Implemented
        ** How build_rev May be Implemented
        ** Referential Transparency, Caching, Recursion and Search

        * Solution-space Analysis
  
        ** Are Good Taste and Complex Heuristic Searching Compatible?

        ** Server-side Issues
        *** Namespace Extensions
        *** Server-side Computation vs. Server-side Storage
        *** Capability Negotiation and Heuristic Guides
        *** Round-trips Issues
        *** Transactional Issues

        ** Client-side Issues
        *** Uniform Interfaces 
        *** Scheduling Client-side Computation
        *** Resource Management and Locuses of Administration

        ** Issues That Transcend "client/server"

        ** Specific Hacks
        *** Server-side Skip-deltas
        *** On-demand Cached Revisions

        * Strategic Analsysis: How to Pick Solutions

        * Postscript: What About Mile's skip-delta Idea?


* Problem Definition

  Two low-level operations are of interest here:

      build_rev (REV, DEST)

        Create a client-side tree, DEST, which is a copy of
        revision REV.


      delta (REVA, REVB, DEST)

        Create a client-side directory, DEST, which is a changeset
        describing the differences between REVA and REVB


  These are the operations whose performance characteristics (disk
  space, I/O bandwidth use on the bus, I/O bandwidth use on the net,
  network roundtrips count, OS cache impact, transactional bottlenecks
  and costs, and CPU use (at least)) are of interest.  We're
  interested in both client and server side costs.

  And our performance goals are mostly to be able to make a wide range
  of trade-offs.  The ideal performance is different for someone
  behind a slow network connection than someone behind a very fast
  one.  The server performance ideal is different for someone running
  Savannah than for someone running the source code repository for
  Solaris development at Sun.  We want the flexability to address the
  needs of all four of those users well, and the grace to address them
  all in a seemless way that preserves interoperation among them.

  We place no a priori limits (other than good taste) on what can be
  done to work on the performance of build_rev and delta.  New
  client-side data structures, smart servers.... anything is game.

  We'll assume that the foundational database provided as input
  to build_rev and delta is, for all intents and purposes, an arch
  archive in the current format, with no cached revisions.   This
  doesn't have to be literally true -- the archive could be stored in
  an SQL database or Berkeley DB or whatever -- the changesets could
  be computed on demand -- whatever.   The essential assumption,
  though, is that the raw data is indexed as it is in today's archives
  and accessible with comperable costs.

  So, roughly speaking, add whatever you'd like -- make build_rev and
  delta go fast.   What's the right thing to add, and how do we do
  that with grace and good taste?


* Problem Analysis

** How delta May be Implemented

  There are two basic ways to compute `delta':  brute force and fancy.

  The brute force technique uses build_rev:

        delta (REVA, REVB, DEST)
        {
          build_rev (REVA, TMPA);
          build_rev (REVB, TMPB);
          mkpatch (TMPA, TMPB, DEST);
        }

  In some cases, such as if REVA and REVB are unrelated revisions,
  the brute force technique is the only one available.

  In other situations, fancier techniques apply.

  For example, let's suppose that REVA is an ancestor of REVB.
  In that case, there is a sequence of revisions, each defined 
  by a changeset:

        REVA
        REVA+1          has: changeset (REVA+1)
        REVA+2           "   changeset (REVA+2)
        ...                  ...
        REVB            has: changeset (REVB)

  We know that each of these changesets is an exact patch against the
  immediately preceeding ancestor.   As such, they can be "composed"
  reliably.    So in this special case, we could compute delta:

        delta (REVA, REVB, DEST)
        {
          DEST = changeset (REVA+1);

          DEST += changeset (REVA+2);
          DEST += changeset (REVA+3);
          ...

          DEST += changeset (REVB);
        }

  Just as we know that in some cases changesets can be composed, 
  we also know that some inverses of changesets can be composed.
  We could also compute delta this way:

        delta (REVA, REVB, DEST)
        {
          delta (REVA, REVB+1, DEST);

          DEST -= changeset (REVB+1);
        }

  etc.

  In short, the "fancy technique" of computing delta means thinking of
  having a graph of revisions and their changesets, finding a
  "shortest path" on that graph in a search where edge traversals
  correspond to things such changeset composition or calls to the
  brute-force delta technique.


** How build_rev May be Implemented

  Build_rev is largely the same problem as delta.   There's an 
  obvious brute force technique (unpack the initial import ancestor,
  apply all changesets since):

        build_rev (REV, DEST)
        {
          R = import_ancestor_of (REV);

          DEST = unpack_import_tree (R);

          apply_changeset (DEST, changeset (R+1));
          apply_changeset (DEST, changeset (R+2));
          ...
          apply_changeset (DEST, changeset (REV));
        }

  but more generally, we're right back to delta_rev, with a recursive
  definition like this:

        build_rev (REV, DEST)
        {
          R = pick_any_revision_in_the_world;

          build_rev (R, DEST);

          delta (R, REV, TMP);

          apply_changeset (DEST, TMP);
        }

  
** Referential Transparency, Caching, Recursion and Search

  build_rev and delta have referential transparency:   Given the
  same revision-name arguments, they always produce equal outputs.

  In arch, the revision name arguments exist in a global namespace -- 
  in other words, the referential transparency crosses machine
  boundaries.   It doesn't matter, for example, whether the answer is
  computed client-side or server-side.

  Another useful consequence of referential transparency is that the work
  done by build_rev and delta can be memoized or cached.   Pristine
  trees are an example of caching the work done by build_rev.
  Revision libraries are an exampe of memoizing the work done by
  build_rev.   Archive-cached revisions are another example of
  memoizing work done by build_rev.

  Managing the performance of build_rev and delta then reduces to
  regarding the recursive form of those operations as an opportunity
  for search.   The "optimal" implementation searches the possible
  recursion paths, looking for the shortest path from the raw
  archives and any caches and memos to the desired result.

  Such a search is conceptually simple, but there are some
  difficulties in practice.  

  One difficulty is knowing the costs of things.  In a given
  environment, which is cheaper: fetching and unpacking an
  archive-cached revision?  Or copying from a revision library and
  applying 12 changesets?

  Another difficulty is uniformity of access.  Currently, three forms
  of cache/memo for build_rev each have their own distinct interface:
  pristine trees, archive-cached revisions, and revision libraries.   
  The current implementation of build_rev performs a very crude 
  heuristic search -- using each of those interfaces separately.
  Is a more uniform interface desirable and practical?   What form
  might it take?



* Solution Space Analysis

** Are Good Taste and Complex Heuristic Searching Compatible?

  There's at least one thing we should be suspicious of from the
  get-go.

  Currently, tla has a delta_rev which is brute-force and a build_rev
  which is a very simplistic heuristic search (about 250 lines of
  code, could be described in perhaps 10 lines of pseudo-code).

  This simple-minded implementation has a few consequences:

        1) it's easy to understand

        2) it does _not_ automagically give optimal performance

        3) it's easy to get very good performance from it, in any
           given situation, by managing mirrors, cached revisions, 
           and a revision library

        4) it's simple enough to be reimplemented from scratch
           very quickly, and such a reimplementation will have
           comperable performance characteristics

        5) it's quite maintainable and has small source and object 
           code size


  Of those points, only (2) seems to me to be any kind of drawback,
  and that drawback is compensated for by (1) and (3).

  If we were to make build_rev or delta much fancier, virtues (1),
  (3), (4), and (5) could easily be lost in the shuffle.

  At the very least, as we explore the solution space, I think we
  should keep in mind the goal that implementations like the current
  one should remain "first class citizens" --- quite acceptable
  options.  It's ok to tweak it a bit -- but the tweaks should be
  elegant.



** Server-side Issues

*** Namespace Extensions

  Earlier we observed that build_rev and delta display referential
  transparency.   They share that property with some other arch
  operations which aren't immediately under consideration here.

  Ultimately, the referential transparency properties of many arch
  archives follow from the existence of a global namespace of
  revisions and the referentially closed nature of arch operations
  (e.g., operations such as `get' are defined in a deterministic way
  in terms of things named in the global namespace -- their result
  does not, for example, depend in any important way on a per-user
  preference setting.  By being deterministic and being defined over
  other referentially transparent names, they themselves inherit
  referential transparency.)

  As we examine the solution space for build_rev and delta, we ought
  to keep in mind the idea of making controlled extensions to the
  global arch namespace.

  To try to make that a little bit more explicit, let me point out a
  case in which the principle is violated and show how that might be
  corrected:

  Archive cached revisions are not part of the global arch namespace.   
  There is no arch name which denotes a possibly existing
  archive-cached revision.   Such a name _might_ look like:

    address@hidden/tla--devo--1.1--patch-42:CACHED

  but we don't have such names.  Instead of having such names, the
  idea of an archive cached revision is built-in as a primitive
  concept of "what is the nature of a revision".  It's not a name that
  you can ask if the name is bound -- it's a binary flag that's
  returned from the `revision_type' method which all archive
  implementations must provide.

  That could have, and _arguably_ should have, been different.
  The archive vtable might have had a general purpose namespace
  query operation:

        resource_exists? (NAME)

  which if passed a "...:CACHED" name such as above would answer yes
  or no.

  _If_ we wind up thinking that it's a good idea to extend the "set of
  things a server might provide" -- such as if we want skip-deltas or
  smart-servers -- then it is a good idea to at least consider
  generalizing the namespace in order to name these "things a server
  might provide" in a systematic way.

  Perhaps, for example, I can ask if my server can produce a cached
  `delta' result by asking:

        resource_exists? ( "delta(revA,revB)" )

  A very good reason to consider extending the namespace that way
  is because of the implications for _dumb_ servers.   The next
  section talks about that.


*** Server-side Computation vs. Server-side Storage

  It is a serious virtue of arch that a dumb server -- esstentially
  just a remote filesystem of limited capability -- is an adequate
  server.  

  That doesn't mean that smart servers should be precluded by the
  design -- but it does suggest something about the _interface_ to a
  smart server.

  As noted, referential transparency of the results of many
  conceivable queries has many benefits.   I think it a reasonable
  principle to push as far as it can be pushed: even with a
  smart server, queries should be referentially transparent.

  A dumb server can answer referentially transparent queries in a
  useful way as well.   The dumb server doesn't compute the results,
  true -- but the results of query can be stored in a dumb server by a
  client, for the benefit of other clients and later queries.

  This gets back to the idea of extending the namespace:

  To a smart-server, a query like:

        resource_exists? ( "delta(revA,revB)" )

  may mean "Say, are you willing to compute that delta for me?"

  To a dumb server, the same query may mean "Say, did anybody store a
  copy of that result there?"

  Archive cached revisions are another example of the application of
  this idea.   On a dumb server, a revision is archive-cached if
  somebody stored it there.   On a smart server, it might be
  reasonable to compute any desired achive-cached revision on-demand.


*** Capability Negotiation and Heuristic Guides

  The suggested `resource_exists?' query shown above is a trivial
  example of capability negotiation.

  Suppose it turns out, as we make build_rev and delta into very fancy 
  algorithms, that clients will sometimes want to make many, many
  queries using `resource_exists?'.

  Suppose further that for a particular server, we know in advance
  that all such queries (say, if the resource being asked for is a 
  "delta(X,Y)" resource) will be answered "No.".   

  Then the client could save many round-trips to the server if it
  could ask up front "are you able to answer _any_ delta(X,Y)
  queries"?

  So, something else to keep in mind as we contemplate extending the
  "set of things a server might provide" is the potential need for
  capability negotation.  Because:


*** Round-trips Issues

  Each new little thing you stick server-side, that delta and
  build_rev might use, has to be considered in light of the new server
  round-trips it will impose.

  It would be useless for implementations to minimize, say, the amount
  of patching needed or the amount of network bandwidth needed for
  build_rev if, at the same time, that optimization required 100s of
  server round-trips to compute.


*** Transactional Issues

  Dumb servers are remote file systems of limited capability.

  One consequence of that is that they are very limited in their
  transactional capabilities.

  The ".listing" files mechanism illustrates the problem.  It is a
  practical impossibility to guarantee that ".listing" files are kept
  up-to-date: an ill-timed ^C to a client can spoil them, for example.

  It is a happy coincidence that ".listing" files can be harmlessly
  (in terms of archive integrity) out of date and that therefore a 
  simple mechanism like the `archive-fixup' mechanism can repair them
  on demand.

  Not every conceivable addition to what might be stored on a dumb 
  server has that happily coincidental property.

  Therefore, any proposed addition to "the set of things a server
  might provide" has to be evaluated wrt to these questions:

        * Can that new thing be stored on a dumb server without
          compromising archive integrity?

        * If not, then: Is it still reasonable to run a dumb server?

        * If not and it is still reasonable to run a dumb server
          then, are we _sure_ this new feature is really needed?



** Client-side Issues
  
  With pristine trees and revision libraries we already do some
  client-side caching and memoizing.

  As we talk about further optimizing build_rev and delta, the
  inevitable question arises:  should we expand the client-side
  mechanisms for caching, memoizing, and otherwise optimizing these
  operations?  For example, should there be client-side caches of 
  the results of delta?

  If so, new questions arise:


*** Uniform Interfaces

  In considering the server-side, we contemplated expanding the arch
  namespace to include not only revisions -- but also other values
  which are computed from revisions by operations such as `delta'.

  The same issue arises client-side.  

  Currently, we have those two rather ad-hoc mechanisms (rev libs,
  pristines).   If we're going to introduce new structures (such as
  delta-caches) then perhaps it's time to think about a more uniform
  approach to accessing and querying these client-side resources:  an
  interface based on an extension of the namespace and independent of
  any particular client-side mechanism.


*** Scheduling Client-side Computation

  A minor annoyance or interesting opportunity, depending on how you
  look at it, is the cost of updating client-side data structures that
  cache interesting information.

  As a simple example, I use a `commit' hook which updates my revision
  library.   Before `commit' exists, the newly committed revision is
  added to my library.

  On balance I've found that hook to be worth it -- but it's not
  without cost.   Each commit acquires the cost of building a new link
  tree and applying a patch to that tree.    If my project trees were
  much larger, that cost might start to exceed it's value.

  A reasonable solution in this area would be to update the library in
  the background -- to orphan a process so that `commit' could exit
  quickly, but then let the orphanned process update the rev lib
  behind my back.

  As we talk about complicating the structure of client-side caches, 
  we ought to also consider being systematic about scheduling and 
  managing "background" computations.


*** Resource Management and Locuses of Administration

  Like most current arch-users, I suspect, I manage my own machine.
  I administer my own revision library, and so forth.

  It isn't hard to imagine environments where that would be rather
  inconvenient.   Instead of me managing my own revision library,
  perhaps that could be done by a system admin who provides a revision
  library which my entire team NFS-mounts, for example.

  Once again, as we talk about hairing up the client-side data
  structures, we ought to consider who administers those resources,
  and how, and what that implies about the various interfaces we
  define.


** Issues That Transcend "client/server"

  Let's step back and look at the big picture again for a minute:

  We have these two operations: build_rev and delta.

  We're talking, generally, about ways to make them fancier, largely
  by providing caches, memos, and servers that act as partial oracles
  for those (essentially recursive) operations.

  In some sense, as a user, I don't care whether those oracles are
  client-side, server-side, or something else entirely -- I only care
  that they speed things up.

  And as an implementor of build_rev and delta: I reach the same
  conclusion.   If this area starts to get fancy, I don't want to
  write code like:

        if (is_it_in_the_rev_lib?)
          ...
        else if (is_it_in_a_pristine?)
          ...
        else if (is_is_archive_cached?)
          ...
        ....
        else
          recurse_and_do_it_the_brute_force_way;

  I want to write code like:

        if (is_there_a_shortcut?)
          ...
        else
          recurse_and_do_it_the_brute_force_way;

  As suggested by the item, above, "Resource Management and Locuses of
  Administration" -- perhaps there is a need here for a new
  abstraction: not an arch client;  not an archive server;  but an
  arch short-cuts server (which may be part of the client environment,
  part of the archive server environment, or some different third thing).


** Specific Hacks

  A very short catalog of two ideas that come around from time to
  time.

*** Server-side Skip-deltas

  The `update' operation effectively computes a delta betwen two
  revisions, one a direct ancestor of the other.

  The `get' operation does the same.

  The `star-merge' operation sometimes computes such a delta, and even
  when it doesn't, the delta it does compute could be described as the
  composition of a direct-ancestor/descendent delta with one other
  delta (which has a revision name and is stored directly in the
  base-level archive).

  All three operations benefit from optimizations to the delta
  operation, and all three could specificially benefit from
  optimizations that focus on deltas between direct ancestors and
  their descendents.

  One such optimization is the idea of a `skip-delta' -- a rapidly
  computable or retrievable delta between an ancestor revision and
  some non-immediate descendent.

  Making some reasonable assumptions about the costs of client-side
  computation, the range of cases in which skip-deltas are, in fact,
  an optimization is theoretically narrow:

  Let us suppose that we are interested in:

        delta (revA, revB)

  where revA is an ancestor of revB.

  There exist changesets:

        revA+1 revA+2 ....

  between revA and revB and `delta (revA, revB)' is equal to the
  compose of those changesets.

  We have two opportunities:  (1) If we have an oracle for `delta (revA,
  revB)', then we can avoid the cost of the compose.  (2) If the delta
  we want is smaller than the sum of the sizes of the revA+n
  changesets, then we can reduce the amount of data we need to read
  to get our answer.

  Offsetting those opportunities are build_rev caches of revA and
  revB.  If we have cheap access to those trees, then any savings from
  skip-deltas has to be compared to the cost of just computing the
  delta by directly comparing the two trees.  (Revision libraries are
  an example of cheap access to those two trees, in some
  circumstances.)

  Complicating skip-deltas is the question of "which skip-deltas do we
  make available?"

  A smart server, perhaps with a revision library, _can_ make _all_
  skip-deltas available with a server-side cost that is at most the
  cost of comparing two trees.

  A dumb server, on the other hand, is likely to have at most only a
  subset of skip-deltas.

  There are many possibilities in between.

  The question arises, then, should a client only use this
  optimization if the precise skip-delta it wants is available?  or
  should the client combine the skip-delta operation with client-side
  changeset composition so that it can find a "close" skip-delta and
  add or subtract a few extra changes as needed?

  There is (as you can sort-of see in the design of svn) a pleasing
  theoretical result that tells you how to make build_rev essentially
  a random-access function that can deliver any tree in history,
  equally fast (or equally slowly :-).   The full theory involves
  optimizing the set of pre-computed skip-deltas and full-text copies
  so that only a bounded number of changesets needs to be applied
  against a full-text to compute a given revision.

  It's a nice theory but it has some practical drawbacks.  First,
  actual access patterns to history tend not to be random access --
  thus, this theory provides an optimization which is not actually
  needed.  Second, constructing the skip-deltas and full-texts is not
  a "for free" operation and, while it too is theoretically bound, it's
  costs are significant.   (In svn, you can see that both in the
  number of server-side cycles needed for a commit and, partly as a
  side effect of other implementation decisions, in the growth-rate of
  the Berkeley DB logs.)   Third: it's not easy to implement the
  theory -- it's a considerable amount of code and complexity to
  write, maintain, and depend upon.

  So, skip-deltas are a mixed bag: it's easy to go wrong with them.

  For the purposes of `update', `get', and `star-merge', a very
  easy yet effective approach might be to implement a delta-server,
  which keeps a server-side revision library and uses (perhaps cached)
  mkpatch to act as a delta oracle.   What needs to be thought through
  is the right interfaces to such an oracle, considering the topics
  considered above in sections about server-side, client-side, and
  transcendent issues.


*** On-demand Cached Revisions

  It's amusing to mention this in the context of Mile's complaining
  about archive-cached revisions being the wrong thing in some
  circumstances, but:

  Just as we can imagine a server that acts an oracle for delta,
  we can imagine one that acts as an oracle for build_rev.

  Again, the client/server/transcendent issues apply.  

  I'll also recall that, even when we start dropping such oracles into
  the mix:

    One difficulty is knowing the costs of things.  In a given
    environment, which is cheaper: fetching and unpacking an
    archive-cached revision?  Or copying from a revision library and
    applying 12 changesets?


* Strategic Analsysis: How to Pick Solutions

  A lot of the preceding material boils down to arch having, at its
  root, a surprisingly clean mathematical foundation.  Schematically:

  You've got some abstract objects which are "revisions" -- each an
  import, continuation, or simple changeset revision.

  These have names within a global namespace.

  Nearly all of the basic "revision control" operations can be largely
  explained as referentially transparent operations on those named
  revisions.

  So, in some sense, computing the answers that various arch commands
  are supposed to give is comperable to computing the answer to an
  arithmetic expression or computing the value of pi -- the results
  are defined by a set of purely self-referential equations for which
  we have enourmous amounts of liberty about how to get the answer.

  Pragmatically, making an effective "calculator program" for those
  equations involves providing "oracles" for various intermediate
  results and, thus, dealing with the issues of representing such
  oracles server-side, client-side, or elsewhere.  (The current
  oracles being revlibs, pristine trees, and archive-cached
  revisions.)

  Currently, we have a very simple implementation of the calculator.  
  It's virtue or drawback, depending on your aesthetics, is that while
  it isn't automagically optimized -- it seems to be flexible enough
  to optimize in most situations with just a little bit of work (e.g.,
  deciding when you want a mirror or when you want a revision
  library).

  How do we build on that?  How do we add new optimizations?

  One idea is to take an "implementation centric approach": look at
  the tla source;  look for easy hacks;  do those easy hacks as the
  mood strikes.

  Another idea is to take a "purely design-driven approach":  look at
  the mathematical structure;  consider the namespace issues;
  consider planning for smart servers;  do only those hacks which
  satisfy as many of those design constraints as can be thought of.

  Neither extreme is The Right Thing, of course.   The easy hacks on
  the implementation will sometimes be pointing out previously
  overlooked theoretical advances.   The theory will sometimes point
  out why an easy hack is, nevertheless, a bad idea.

  My attitude is: think for a long time and then, when no definitive
  answer appears, make a best guess.



* Postscript: What About Miles' skip-delta Idea?

  The outine above points out the issues that need to be addressed to
  turn Miles' idea from vague handwaving to a patch I'll accept.

  Briefly:

        1) Which skip-deltas will be stored server side?

        2) How will they be created, named, and accessed?

        3) What does a client have to do to look for them?

        4) Are changeset-composition operations an essential part 
           of making skip-deltas useful?

        5) Given the theoretically narrow range of applicability 
           of optimizations involving skip-deltas, what's the 
           compelling case for this plan?

-t
[Prev in Thread]
Current Thread
[Next in Thread]
Re: [Gnu-arch-users] Re: situations where cached revisions are not so good, (continued)
- Re: [Gnu-arch-users] situations where cached revisions are not so good, Robert Collins, 2003/09/24
  - [Gnu-arch-users] Re: situations where cached revisions are not so good, Samuel Tardieu, 2003/09/24
- Re: [Gnu-arch-users] situations where cached revisions are not so good, Tom Lord <=
Prev by Date: Re: [Gnu-arch-users] cscvs--experimental--1.1 nearing doneness; call for testers
Next by Date: Re: [Gnu-arch-users] cscvs--experimental--1.1 nearing doneness; call for testers
Previous by thread: [Gnu-arch-users] Re: situations where cached revisions are not so good
Next by thread: [Gnu-arch-users] cscvs--experimental--1.1 nearing doneness; call for testers
Index(es):
- Date
- Thread