[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Gnu-arch-users] situations where cached revisions are not so good
From: |
Tom Lord |
Subject: |
Re: [Gnu-arch-users] situations where cached revisions are not so good |
Date: |
Wed, 24 Sep 2003 11:27:49 -0700 (PDT) |
> From: Miles Bader <address@hidden>
> [archive caching revisions doesn't always have the
> right performance characteristics.
>
> Say, how about skip-deltas?]
You got a bunch of good answers in this thread already -- this one
is slightly redundant but hopefully useful.
Short-answer-that-works-today: don't balk at using mirrors here,
especially if you can afford a revision library. That gives an almost
optimal solution at the cost of some client-side disk space. An Emacs
revlib is likely to be huge and, if you don't have the latest disk
hardware client-side we can talk about how to manage this until you
do.
Since other people are pretty on-top of the short-answer already, let
me try to sum up some of the scattered design discussions we've had
over the past few months, in which context skip-deltas come into play.
This is long, but yes, really, your idea of skip-deltas is directly
analyzed in it.
contents:
* Problem Definition: build_rev and delta
* Problem Analysis
** How delta May be Implemented
** How build_rev May be Implemented
** Referential Transparency, Caching, Recursion and Search
* Solution-space Analysis
** Are Good Taste and Complex Heuristic Searching Compatible?
** Server-side Issues
*** Namespace Extensions
*** Server-side Computation vs. Server-side Storage
*** Capability Negotiation and Heuristic Guides
*** Round-trips Issues
*** Transactional Issues
** Client-side Issues
*** Uniform Interfaces
*** Scheduling Client-side Computation
*** Resource Management and Locuses of Administration
** Issues That Transcend "client/server"
** Specific Hacks
*** Server-side Skip-deltas
*** On-demand Cached Revisions
* Strategic Analsysis: How to Pick Solutions
* Postscript: What About Mile's skip-delta Idea?
* Problem Definition
Two low-level operations are of interest here:
build_rev (REV, DEST)
Create a client-side tree, DEST, which is a copy of
revision REV.
delta (REVA, REVB, DEST)
Create a client-side directory, DEST, which is a changeset
describing the differences between REVA and REVB
These are the operations whose performance characteristics (disk
space, I/O bandwidth use on the bus, I/O bandwidth use on the net,
network roundtrips count, OS cache impact, transactional bottlenecks
and costs, and CPU use (at least)) are of interest. We're
interested in both client and server side costs.
And our performance goals are mostly to be able to make a wide range
of trade-offs. The ideal performance is different for someone
behind a slow network connection than someone behind a very fast
one. The server performance ideal is different for someone running
Savannah than for someone running the source code repository for
Solaris development at Sun. We want the flexability to address the
needs of all four of those users well, and the grace to address them
all in a seemless way that preserves interoperation among them.
We place no a priori limits (other than good taste) on what can be
done to work on the performance of build_rev and delta. New
client-side data structures, smart servers.... anything is game.
We'll assume that the foundational database provided as input
to build_rev and delta is, for all intents and purposes, an arch
archive in the current format, with no cached revisions. This
doesn't have to be literally true -- the archive could be stored in
an SQL database or Berkeley DB or whatever -- the changesets could
be computed on demand -- whatever. The essential assumption,
though, is that the raw data is indexed as it is in today's archives
and accessible with comperable costs.
So, roughly speaking, add whatever you'd like -- make build_rev and
delta go fast. What's the right thing to add, and how do we do
that with grace and good taste?
* Problem Analysis
** How delta May be Implemented
There are two basic ways to compute `delta': brute force and fancy.
The brute force technique uses build_rev:
delta (REVA, REVB, DEST)
{
build_rev (REVA, TMPA);
build_rev (REVB, TMPB);
mkpatch (TMPA, TMPB, DEST);
}
In some cases, such as if REVA and REVB are unrelated revisions,
the brute force technique is the only one available.
In other situations, fancier techniques apply.
For example, let's suppose that REVA is an ancestor of REVB.
In that case, there is a sequence of revisions, each defined
by a changeset:
REVA
REVA+1 has: changeset (REVA+1)
REVA+2 " changeset (REVA+2)
... ...
REVB has: changeset (REVB)
We know that each of these changesets is an exact patch against the
immediately preceeding ancestor. As such, they can be "composed"
reliably. So in this special case, we could compute delta:
delta (REVA, REVB, DEST)
{
DEST = changeset (REVA+1);
DEST += changeset (REVA+2);
DEST += changeset (REVA+3);
...
DEST += changeset (REVB);
}
Just as we know that in some cases changesets can be composed,
we also know that some inverses of changesets can be composed.
We could also compute delta this way:
delta (REVA, REVB, DEST)
{
delta (REVA, REVB+1, DEST);
DEST -= changeset (REVB+1);
}
etc.
In short, the "fancy technique" of computing delta means thinking of
having a graph of revisions and their changesets, finding a
"shortest path" on that graph in a search where edge traversals
correspond to things such changeset composition or calls to the
brute-force delta technique.
** How build_rev May be Implemented
Build_rev is largely the same problem as delta. There's an
obvious brute force technique (unpack the initial import ancestor,
apply all changesets since):
build_rev (REV, DEST)
{
R = import_ancestor_of (REV);
DEST = unpack_import_tree (R);
apply_changeset (DEST, changeset (R+1));
apply_changeset (DEST, changeset (R+2));
...
apply_changeset (DEST, changeset (REV));
}
but more generally, we're right back to delta_rev, with a recursive
definition like this:
build_rev (REV, DEST)
{
R = pick_any_revision_in_the_world;
build_rev (R, DEST);
delta (R, REV, TMP);
apply_changeset (DEST, TMP);
}
** Referential Transparency, Caching, Recursion and Search
build_rev and delta have referential transparency: Given the
same revision-name arguments, they always produce equal outputs.
In arch, the revision name arguments exist in a global namespace --
in other words, the referential transparency crosses machine
boundaries. It doesn't matter, for example, whether the answer is
computed client-side or server-side.
Another useful consequence of referential transparency is that the work
done by build_rev and delta can be memoized or cached. Pristine
trees are an example of caching the work done by build_rev.
Revision libraries are an exampe of memoizing the work done by
build_rev. Archive-cached revisions are another example of
memoizing work done by build_rev.
Managing the performance of build_rev and delta then reduces to
regarding the recursive form of those operations as an opportunity
for search. The "optimal" implementation searches the possible
recursion paths, looking for the shortest path from the raw
archives and any caches and memos to the desired result.
Such a search is conceptually simple, but there are some
difficulties in practice.
One difficulty is knowing the costs of things. In a given
environment, which is cheaper: fetching and unpacking an
archive-cached revision? Or copying from a revision library and
applying 12 changesets?
Another difficulty is uniformity of access. Currently, three forms
of cache/memo for build_rev each have their own distinct interface:
pristine trees, archive-cached revisions, and revision libraries.
The current implementation of build_rev performs a very crude
heuristic search -- using each of those interfaces separately.
Is a more uniform interface desirable and practical? What form
might it take?
* Solution Space Analysis
** Are Good Taste and Complex Heuristic Searching Compatible?
There's at least one thing we should be suspicious of from the
get-go.
Currently, tla has a delta_rev which is brute-force and a build_rev
which is a very simplistic heuristic search (about 250 lines of
code, could be described in perhaps 10 lines of pseudo-code).
This simple-minded implementation has a few consequences:
1) it's easy to understand
2) it does _not_ automagically give optimal performance
3) it's easy to get very good performance from it, in any
given situation, by managing mirrors, cached revisions,
and a revision library
4) it's simple enough to be reimplemented from scratch
very quickly, and such a reimplementation will have
comperable performance characteristics
5) it's quite maintainable and has small source and object
code size
Of those points, only (2) seems to me to be any kind of drawback,
and that drawback is compensated for by (1) and (3).
If we were to make build_rev or delta much fancier, virtues (1),
(3), (4), and (5) could easily be lost in the shuffle.
At the very least, as we explore the solution space, I think we
should keep in mind the goal that implementations like the current
one should remain "first class citizens" --- quite acceptable
options. It's ok to tweak it a bit -- but the tweaks should be
elegant.
** Server-side Issues
*** Namespace Extensions
Earlier we observed that build_rev and delta display referential
transparency. They share that property with some other arch
operations which aren't immediately under consideration here.
Ultimately, the referential transparency properties of many arch
archives follow from the existence of a global namespace of
revisions and the referentially closed nature of arch operations
(e.g., operations such as `get' are defined in a deterministic way
in terms of things named in the global namespace -- their result
does not, for example, depend in any important way on a per-user
preference setting. By being deterministic and being defined over
other referentially transparent names, they themselves inherit
referential transparency.)
As we examine the solution space for build_rev and delta, we ought
to keep in mind the idea of making controlled extensions to the
global arch namespace.
To try to make that a little bit more explicit, let me point out a
case in which the principle is violated and show how that might be
corrected:
Archive cached revisions are not part of the global arch namespace.
There is no arch name which denotes a possibly existing
archive-cached revision. Such a name _might_ look like:
address@hidden/tla--devo--1.1--patch-42:CACHED
but we don't have such names. Instead of having such names, the
idea of an archive cached revision is built-in as a primitive
concept of "what is the nature of a revision". It's not a name that
you can ask if the name is bound -- it's a binary flag that's
returned from the `revision_type' method which all archive
implementations must provide.
That could have, and _arguably_ should have, been different.
The archive vtable might have had a general purpose namespace
query operation:
resource_exists? (NAME)
which if passed a "...:CACHED" name such as above would answer yes
or no.
_If_ we wind up thinking that it's a good idea to extend the "set of
things a server might provide" -- such as if we want skip-deltas or
smart-servers -- then it is a good idea to at least consider
generalizing the namespace in order to name these "things a server
might provide" in a systematic way.
Perhaps, for example, I can ask if my server can produce a cached
`delta' result by asking:
resource_exists? ( "delta(revA,revB)" )
A very good reason to consider extending the namespace that way
is because of the implications for _dumb_ servers. The next
section talks about that.
*** Server-side Computation vs. Server-side Storage
It is a serious virtue of arch that a dumb server -- esstentially
just a remote filesystem of limited capability -- is an adequate
server.
That doesn't mean that smart servers should be precluded by the
design -- but it does suggest something about the _interface_ to a
smart server.
As noted, referential transparency of the results of many
conceivable queries has many benefits. I think it a reasonable
principle to push as far as it can be pushed: even with a
smart server, queries should be referentially transparent.
A dumb server can answer referentially transparent queries in a
useful way as well. The dumb server doesn't compute the results,
true -- but the results of query can be stored in a dumb server by a
client, for the benefit of other clients and later queries.
This gets back to the idea of extending the namespace:
To a smart-server, a query like:
resource_exists? ( "delta(revA,revB)" )
may mean "Say, are you willing to compute that delta for me?"
To a dumb server, the same query may mean "Say, did anybody store a
copy of that result there?"
Archive cached revisions are another example of the application of
this idea. On a dumb server, a revision is archive-cached if
somebody stored it there. On a smart server, it might be
reasonable to compute any desired achive-cached revision on-demand.
*** Capability Negotiation and Heuristic Guides
The suggested `resource_exists?' query shown above is a trivial
example of capability negotiation.
Suppose it turns out, as we make build_rev and delta into very fancy
algorithms, that clients will sometimes want to make many, many
queries using `resource_exists?'.
Suppose further that for a particular server, we know in advance
that all such queries (say, if the resource being asked for is a
"delta(X,Y)" resource) will be answered "No.".
Then the client could save many round-trips to the server if it
could ask up front "are you able to answer _any_ delta(X,Y)
queries"?
So, something else to keep in mind as we contemplate extending the
"set of things a server might provide" is the potential need for
capability negotation. Because:
*** Round-trips Issues
Each new little thing you stick server-side, that delta and
build_rev might use, has to be considered in light of the new server
round-trips it will impose.
It would be useless for implementations to minimize, say, the amount
of patching needed or the amount of network bandwidth needed for
build_rev if, at the same time, that optimization required 100s of
server round-trips to compute.
*** Transactional Issues
Dumb servers are remote file systems of limited capability.
One consequence of that is that they are very limited in their
transactional capabilities.
The ".listing" files mechanism illustrates the problem. It is a
practical impossibility to guarantee that ".listing" files are kept
up-to-date: an ill-timed ^C to a client can spoil them, for example.
It is a happy coincidence that ".listing" files can be harmlessly
(in terms of archive integrity) out of date and that therefore a
simple mechanism like the `archive-fixup' mechanism can repair them
on demand.
Not every conceivable addition to what might be stored on a dumb
server has that happily coincidental property.
Therefore, any proposed addition to "the set of things a server
might provide" has to be evaluated wrt to these questions:
* Can that new thing be stored on a dumb server without
compromising archive integrity?
* If not, then: Is it still reasonable to run a dumb server?
* If not and it is still reasonable to run a dumb server
then, are we _sure_ this new feature is really needed?
** Client-side Issues
With pristine trees and revision libraries we already do some
client-side caching and memoizing.
As we talk about further optimizing build_rev and delta, the
inevitable question arises: should we expand the client-side
mechanisms for caching, memoizing, and otherwise optimizing these
operations? For example, should there be client-side caches of
the results of delta?
If so, new questions arise:
*** Uniform Interfaces
In considering the server-side, we contemplated expanding the arch
namespace to include not only revisions -- but also other values
which are computed from revisions by operations such as `delta'.
The same issue arises client-side.
Currently, we have those two rather ad-hoc mechanisms (rev libs,
pristines). If we're going to introduce new structures (such as
delta-caches) then perhaps it's time to think about a more uniform
approach to accessing and querying these client-side resources: an
interface based on an extension of the namespace and independent of
any particular client-side mechanism.
*** Scheduling Client-side Computation
A minor annoyance or interesting opportunity, depending on how you
look at it, is the cost of updating client-side data structures that
cache interesting information.
As a simple example, I use a `commit' hook which updates my revision
library. Before `commit' exists, the newly committed revision is
added to my library.
On balance I've found that hook to be worth it -- but it's not
without cost. Each commit acquires the cost of building a new link
tree and applying a patch to that tree. If my project trees were
much larger, that cost might start to exceed it's value.
A reasonable solution in this area would be to update the library in
the background -- to orphan a process so that `commit' could exit
quickly, but then let the orphanned process update the rev lib
behind my back.
As we talk about complicating the structure of client-side caches,
we ought to also consider being systematic about scheduling and
managing "background" computations.
*** Resource Management and Locuses of Administration
Like most current arch-users, I suspect, I manage my own machine.
I administer my own revision library, and so forth.
It isn't hard to imagine environments where that would be rather
inconvenient. Instead of me managing my own revision library,
perhaps that could be done by a system admin who provides a revision
library which my entire team NFS-mounts, for example.
Once again, as we talk about hairing up the client-side data
structures, we ought to consider who administers those resources,
and how, and what that implies about the various interfaces we
define.
** Issues That Transcend "client/server"
Let's step back and look at the big picture again for a minute:
We have these two operations: build_rev and delta.
We're talking, generally, about ways to make them fancier, largely
by providing caches, memos, and servers that act as partial oracles
for those (essentially recursive) operations.
In some sense, as a user, I don't care whether those oracles are
client-side, server-side, or something else entirely -- I only care
that they speed things up.
And as an implementor of build_rev and delta: I reach the same
conclusion. If this area starts to get fancy, I don't want to
write code like:
if (is_it_in_the_rev_lib?)
...
else if (is_it_in_a_pristine?)
...
else if (is_is_archive_cached?)
...
....
else
recurse_and_do_it_the_brute_force_way;
I want to write code like:
if (is_there_a_shortcut?)
...
else
recurse_and_do_it_the_brute_force_way;
As suggested by the item, above, "Resource Management and Locuses of
Administration" -- perhaps there is a need here for a new
abstraction: not an arch client; not an archive server; but an
arch short-cuts server (which may be part of the client environment,
part of the archive server environment, or some different third thing).
** Specific Hacks
A very short catalog of two ideas that come around from time to
time.
*** Server-side Skip-deltas
The `update' operation effectively computes a delta betwen two
revisions, one a direct ancestor of the other.
The `get' operation does the same.
The `star-merge' operation sometimes computes such a delta, and even
when it doesn't, the delta it does compute could be described as the
composition of a direct-ancestor/descendent delta with one other
delta (which has a revision name and is stored directly in the
base-level archive).
All three operations benefit from optimizations to the delta
operation, and all three could specificially benefit from
optimizations that focus on deltas between direct ancestors and
their descendents.
One such optimization is the idea of a `skip-delta' -- a rapidly
computable or retrievable delta between an ancestor revision and
some non-immediate descendent.
Making some reasonable assumptions about the costs of client-side
computation, the range of cases in which skip-deltas are, in fact,
an optimization is theoretically narrow:
Let us suppose that we are interested in:
delta (revA, revB)
where revA is an ancestor of revB.
There exist changesets:
revA+1 revA+2 ....
between revA and revB and `delta (revA, revB)' is equal to the
compose of those changesets.
We have two opportunities: (1) If we have an oracle for `delta (revA,
revB)', then we can avoid the cost of the compose. (2) If the delta
we want is smaller than the sum of the sizes of the revA+n
changesets, then we can reduce the amount of data we need to read
to get our answer.
Offsetting those opportunities are build_rev caches of revA and
revB. If we have cheap access to those trees, then any savings from
skip-deltas has to be compared to the cost of just computing the
delta by directly comparing the two trees. (Revision libraries are
an example of cheap access to those two trees, in some
circumstances.)
Complicating skip-deltas is the question of "which skip-deltas do we
make available?"
A smart server, perhaps with a revision library, _can_ make _all_
skip-deltas available with a server-side cost that is at most the
cost of comparing two trees.
A dumb server, on the other hand, is likely to have at most only a
subset of skip-deltas.
There are many possibilities in between.
The question arises, then, should a client only use this
optimization if the precise skip-delta it wants is available? or
should the client combine the skip-delta operation with client-side
changeset composition so that it can find a "close" skip-delta and
add or subtract a few extra changes as needed?
There is (as you can sort-of see in the design of svn) a pleasing
theoretical result that tells you how to make build_rev essentially
a random-access function that can deliver any tree in history,
equally fast (or equally slowly :-). The full theory involves
optimizing the set of pre-computed skip-deltas and full-text copies
so that only a bounded number of changesets needs to be applied
against a full-text to compute a given revision.
It's a nice theory but it has some practical drawbacks. First,
actual access patterns to history tend not to be random access --
thus, this theory provides an optimization which is not actually
needed. Second, constructing the skip-deltas and full-texts is not
a "for free" operation and, while it too is theoretically bound, it's
costs are significant. (In svn, you can see that both in the
number of server-side cycles needed for a commit and, partly as a
side effect of other implementation decisions, in the growth-rate of
the Berkeley DB logs.) Third: it's not easy to implement the
theory -- it's a considerable amount of code and complexity to
write, maintain, and depend upon.
So, skip-deltas are a mixed bag: it's easy to go wrong with them.
For the purposes of `update', `get', and `star-merge', a very
easy yet effective approach might be to implement a delta-server,
which keeps a server-side revision library and uses (perhaps cached)
mkpatch to act as a delta oracle. What needs to be thought through
is the right interfaces to such an oracle, considering the topics
considered above in sections about server-side, client-side, and
transcendent issues.
*** On-demand Cached Revisions
It's amusing to mention this in the context of Mile's complaining
about archive-cached revisions being the wrong thing in some
circumstances, but:
Just as we can imagine a server that acts an oracle for delta,
we can imagine one that acts as an oracle for build_rev.
Again, the client/server/transcendent issues apply.
I'll also recall that, even when we start dropping such oracles into
the mix:
One difficulty is knowing the costs of things. In a given
environment, which is cheaper: fetching and unpacking an
archive-cached revision? Or copying from a revision library and
applying 12 changesets?
* Strategic Analsysis: How to Pick Solutions
A lot of the preceding material boils down to arch having, at its
root, a surprisingly clean mathematical foundation. Schematically:
You've got some abstract objects which are "revisions" -- each an
import, continuation, or simple changeset revision.
These have names within a global namespace.
Nearly all of the basic "revision control" operations can be largely
explained as referentially transparent operations on those named
revisions.
So, in some sense, computing the answers that various arch commands
are supposed to give is comperable to computing the answer to an
arithmetic expression or computing the value of pi -- the results
are defined by a set of purely self-referential equations for which
we have enourmous amounts of liberty about how to get the answer.
Pragmatically, making an effective "calculator program" for those
equations involves providing "oracles" for various intermediate
results and, thus, dealing with the issues of representing such
oracles server-side, client-side, or elsewhere. (The current
oracles being revlibs, pristine trees, and archive-cached
revisions.)
Currently, we have a very simple implementation of the calculator.
It's virtue or drawback, depending on your aesthetics, is that while
it isn't automagically optimized -- it seems to be flexible enough
to optimize in most situations with just a little bit of work (e.g.,
deciding when you want a mirror or when you want a revision
library).
How do we build on that? How do we add new optimizations?
One idea is to take an "implementation centric approach": look at
the tla source; look for easy hacks; do those easy hacks as the
mood strikes.
Another idea is to take a "purely design-driven approach": look at
the mathematical structure; consider the namespace issues;
consider planning for smart servers; do only those hacks which
satisfy as many of those design constraints as can be thought of.
Neither extreme is The Right Thing, of course. The easy hacks on
the implementation will sometimes be pointing out previously
overlooked theoretical advances. The theory will sometimes point
out why an easy hack is, nevertheless, a bad idea.
My attitude is: think for a long time and then, when no definitive
answer appears, make a best guess.
* Postscript: What About Miles' skip-delta Idea?
The outine above points out the issues that need to be addressed to
turn Miles' idea from vague handwaving to a patch I'll accept.
Briefly:
1) Which skip-deltas will be stored server side?
2) How will they be created, named, and accessed?
3) What does a client have to do to look for them?
4) Are changeset-composition operations an essential part
of making skip-deltas useful?
5) Given the theoretically narrow range of applicability
of optimizations involving skip-deltas, what's the
compelling case for this plan?
-t
- Re: [Gnu-arch-users] Re: situations where cached revisions are not so good, (continued)
- Re: [Gnu-arch-users] Re: situations where cached revisions are not so good, Alexander Deruwe, 2003/09/24
- [Gnu-arch-users] Re: situations where cached revisions are not so good, Miles Bader, 2003/09/24
- Re: [Gnu-arch-users] Re: situations where cached revisions are not so good, Robert Anderson, 2003/09/24
- [Gnu-arch-users] Re: situations where cached revisions are not so good, Miles Bader, 2003/09/24
- Re: [Gnu-arch-users] Re: situations where cached revisions are not so good, Tom Lord, 2003/09/24
- [Gnu-arch-users] Re: situations where cached revisions are not so good, Miles Bader, 2003/09/24
- [Gnu-arch-users] Re: situations where cached revisions are not so good, Tom Lord, 2003/09/24
- Re: [Gnu-arch-users] Re: situations where cached revisions are not so good, Jason McCarty, 2003/09/24
Re: [Gnu-arch-users] situations where cached revisions are not so good, Robert Collins, 2003/09/24
Re: [Gnu-arch-users] situations where cached revisions are not so good,
Tom Lord <=