[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Gnu-arch-users] arch lkml
From: |
Tom Lord |
Subject: |
Re: [Gnu-arch-users] arch lkml |
Date: |
Wed, 10 Dec 2003 14:20:55 -0800 (PST) |
Eric:
I'll reply with perhaps more than you wanted to know :-)
>>> The one very obvious potential issue I see with arch as it
>>> currently stands is that it does not use one of the more
>>> sophisticated storage formats for storing deltas. Which means
>>> as your archive size increases the work can increase. I think
>>> with a different backend format cacherev would not be
>>> necessary. But I may be wrong.
I don't think that you are "wrong", per se -- just that you're worried
about solving imaginary/outdated problems.
The classic (so-called:) "sophisticated" formats (e.g., SCCS) come to
us from a distant time (by industry timescales). There were designed
to solve a problem that arose in:
(a) a non-networked environment
(b) an environment in which disk-space was precious
and so consequently they:
(1) [re: (a)] assume computation close to the repository for
every operation
(2) [re: (b)] are aimed at balls-to-the-wall optimized compression
of revisions rather than just a comperable "big-O" (loosely
speaking) optimization
When those formats (or their more modern design-thinking copy-cats
like svn) are used in a more modern environment, that means:
(i) [re: (1)] they need to introduce smart servers and thus create
both a performance bottleneck and a single point of failure --
obstacles that can be overcome only with ridiculous amounts of
complexity (which of course creates new problems)
(ii) [re: (2)] they create systems which: flop over dead upon a
media failure; add absurd complexity (hence problems) to try
to optimize "common case" access patterns; aren't
easily extensible, and so forth.
Here is a better way to understand the arch format:
* separation of archive host and archive client
We recognize that the general case is that that these two hosts are
different and that when performance becomes really interesting,
it's going to be a matter of clients vastly outnumbering hosts.
Therefore, we want to push as much work as possible to the client
side.
* minimization of network bandwidth
We recognize that the important case is of a developer who
is active over a prolonged period of time in some project whose
mainline archive is hosted remotely. We also recognize that the
internet is basically slow.
Therefore, we want to minimize network bandwith primarily, and
latency secondarily, between our developer and his project.
It is damn hard to beat compressed changesets for that purpose
other than by increments.
To be sure, arch's push-mirror could do better on latency issues --
it should eventually work more like rsync and _maybe_ even ask for a
little help from server-side computation.
But in general, for our important-case developer, arch makes
network performance about as close to irrelevent as it can be.
* serendipitous archive-as-payloads
So we're going to have long-lived client sites that pull
updates or push updates to or from some archive host.
In fact, we'll have _many_ such clients for our most interesting
hosts.
One idea is that we should, archive-side, just cache the payloads
that we'll send to these clients -- so that when a client needs some
"thing" we just find it on disk and stream it out.
It turns out, then, that if instead of "caching" we simply "memoize"
those payloads ("caching" can discard stuff -- "memoizing" holds on
to it forever) that the memo we create contains, in a _reasonably_
compact form, all of the information we would have in an archive.
So: the memo _is_ the archive. Disregarding archive-cached
revisions for the moment, an arch archive _is_ just a memo of the
payloads that a client might ask for.
That's dirt-simple to implement, allows arch to use "dumb servers",
and is the same "big-O" space complexity as the more sophisticated
formats.
* rely on client-side caches, but keep it simple, stupid
Client-side, our interesting developer is going to have some
pretty interesting access patterns -- focused mainly on recent
revisions, semi-random-access within those.
Revision libraries provide that in, again, a dirt-simple way that
has just about optimal performance characteristics. The recent
hacks for --hard-links, library paths, sparse libraries and so
forth just make that sweeter and sweeter: our interesting developer
can these days (using the head revision at least -- you'll see it in
the next release) check out a tree just as fast as he can hard-link
to it. Whole tree comparisons take, in the expected case, the
time of `find' traversals of the two trees plus the time to diff the
files that actually changed.
In terms of _time_ complexity, this romps all over the old
"sophisticated" formats and their modern descendents.
In terms of _space_ complexity, it's awefully damn competitive with
the "sophisticated" formats, especially when you multiply the
difference between them by the price-per-byte of disk storage.
* cache-rev is for patzers
Archive cached revisions pretty much don't matter to our
interesting-case developer -- they mostly help the more casual
users of arch archives. They're a courtesy in that regard.
If you're working with some project, make a local mirror, pass
--no-cached to push-mirror, and make sure you set up your libraries
well. Arch would _arguably_ be better off not even bothering with
cached revs.
I exaggerate slightly: if your ongoing relationship to a project
is that you want to keep current with the head revision but don't
care about keeping the history of it locally (perhaps you are a
"tester"), then cached-revs (and, later, cached-summary-deltas) are
a win for you. But these things really aren't of central
importance to arch.
A big theme in the preceding is "dirt simple". An example of what
that implies can be seen in recent discussions about "archive
signing". There's a minor crisis going on the free software world
because project hosts keep getting cracked and people want to be sure
that their source archives are not corrupted -- hence an interest in
signing. Contrast recent discussions on the dev@ list for svn with
those on arch. Here in arch-land we're ~30% done implementing signing
after just a few days -- I could (an not officially, but ...) commit
to having it up and running by the end of the month. Now read the
dev@ for svn (and think through the performance and complexity
implications of their most rapid-route suggestion of a hook-based
answer). Meanwhile, we're not worrying about "how to possibly do it"
over here -- but "how to do it in any of the several possible ways
while not screwing up the security concerns and while keeping a
long-term view of the architecture."
> As I understand the literature recent work on version control has
> used what is a variation on the gzip format for storing multiple
> versions. The idea is you compress the first file like normal.
> But for the second and subsequent files you look back into your
> archive (which you are simply appending to) and use previous text
> for compression. This makes both appends and random access fast
> and in addition this happens to work for random binary files.
arch is not heavily optimized for binary files, that's true. I've
yet to see any evidence that this matters all that much for software
development and given today's disk economics.
> Thinking about the implications of this I wonder why no one has built
> a general purpose archiver like with a structure like that. I should
> give roughly the same compression ration as tar | gzip while still
> giving you the random access properties of zip.
> So it looks like the very practical thing to do is to build a general
> purpose, compressed, random access file archiver, which can be used
> for backups or file archiving whichever makes most sense. And then
> come back and look at improving arch.
_Possibly_. It's not obvious that it'll be worth it (nor that it
won't). But if you are going to go that route, google around for Josh
MacDonald first.
-t
- Re: [Gnu-arch-users] arch lkml, (continued)
- Re: [Gnu-arch-users] arch lkml, Thomas Zander, 2003/12/10
- Re: [Gnu-arch-users] arch lkml, Robert Collins, 2003/12/10
- Re: [Gnu-arch-users] arch lkml, Thomas Zander, 2003/12/10
- Re: [Gnu-arch-users] arch lkml, David Brown, 2003/12/10
- Re: [Gnu-arch-users] arch lkml, Robert Collins, 2003/12/11
- Re: [Gnu-arch-users] arch lkml,
Tom Lord <=
- Re: [Gnu-arch-users] arch lkml, Eric W. Biederman, 2003/12/12
- Re: [Gnu-arch-users] arch lkml, Christopher Dale, 2003/12/12
- Re: [Gnu-arch-users] arch lkml, Tom Lord, 2003/12/12
- Re: [Gnu-arch-users] arch lkml, Tom Lord, 2003/12/12
- Re: [Gnu-arch-users] arch lkml, Eric W. Biederman, 2003/12/13
- Re: [Gnu-arch-users] arch lkml, Robert Collins, 2003/12/13
- Re: [Gnu-arch-users] arch lkml, Tom Lord, 2003/12/13
- [Gnu-arch-users] Considering Semantics with a simple but stupid file archive format., Eric W. Biederman, 2003/12/13
- [Gnu-arch-users] Re: arch lkml, Miles Bader, 2003/12/10
- Re: [Gnu-arch-users] arch lkml, Aaron Bentley, 2003/12/11