Re: [Gnu-arch-users] Some issues

gnu-arch-users
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Gnu-arch-users] Some issues

From:	Tom Lord
Subject:	Re: [Gnu-arch-users] Some issues
Date:	Mon, 14 Jun 2004 19:40:18 -0700 (PDT)
  > Some Issues with GNU arch

Thank you.  I'm not trying to be argumentative with you in what
follows.   In some important ways your issues are right on the mark
and, it's comforting to me to note, they're also What We're Working
On, Right Now.   In other ways, your issues are stuffed with horse
droppings though it's hard to explain clearly precisely why because,
superficially, the issues sound entirely reasonable.

  > This memo collects some issues which could be considered design
  > defects, hard-to-overcome implementation deficiencies in arch's
  > current implementation (called tla), and general design
  > directions. The memo is still work in progress and is, of course,
  > entirely subjective.

  >  Design Defects
  >  Implementation Issues
  >  Directions
  >  Some Ideas

  > Design Defects

  >   The changeset format is defined relative to GNU patch and GNU
  >   tar. These data formats are still somewhat in flux.

Those formats change, but it is highly unlikely that they will change
in upwardly incompatible ways.  Thus, they are a good foundation.  It
was precisely the longevity, reputation and deployment, and relative
stagnation of the GNU diff and GNU patch projects that made me
comfortable depending upon their stability.   Nobody can get away with
breaking GNU diff or GNU patch or GNU tar _too_ badly for Arch's
purposes because too many things besides arch depend on those tools.



  > The changeset format does not handle binaries efficiently, and
  > certain text files (e.g. XML files not created by a text editor
  > and formated for readability).

Here, you are complaining, more or less, that water is wet and that
ducks fly.

A changeset in arch is similar to a patch file in traditional, 1980s,
unix: You must be able to apply a changeset (or a patch) to an object
that doesn't precisely match the ORIG from which the changeset (or
patch) was generated.  Patch(1) has its fuzzy rules for that and .rej
files.  Arch inherits and adds to those.  With binary files such as
XML or jpeg, there is nothing useful that patch can do other than
provide the "whole texts".

Yes, yes, in _theory_ we can do a better style of diffs for XML files
than diff(1) would do, attending to the tree structure rather than
stream-structure of the XML, but here in Arch-maintainer land we're
still waiting for someone to implement a really good diff(1) and
patch(1) interface-alike for this theoretical space of XML diffs.  As
soon as that happens and someone cares enough to work on it, you can
see it implemented in arch in a matter of days or weeks.



  > In essence, an archive consists of concatenated changesets, which
  > are directly exposed in a file-based interface. This makes it very
  > complex to address issues with the changeset format itself, and
  > the archive interpretation might change when new versions of patch
  > and tar are installed.

I'm not sure what you are complaining about here.  Yes, GNU Arch does
depend on the file format produced by GNU tar.  So?  What's your
point?  It will have to depend on _some_ format.  GNU tar format has
been stable for a long time.  There are social and technical reasons
why we can be extremely confident that GNU tar will not become upward
incompatible.   Arch depends on these things and these things are
dependable.    Isn't that good?


  >  Arch does not implement a distributed system. For example, its
  >  archive replication does not transparently handle write
  >  operations.

Water is wet.  There is no such thing, in heaven or upon earth, as
handling write transactions to a distributed database in some way that
doesn't take into account the unique conditions that can arise in a
distributed database.

Analogously: One of _my_ big complaints in the world is that computing
the prime factors of an integer N is not (as far as we know) an O(1)
operation for arbitrary N.

     > There is no integrated mechanism to atomically commit related
     > changesets to two branches (even if these branches are
     > contained in the same archive).

I haven't double checked but I believe I've gotten around to posting
the "FEATURE PLAN" for two-phase commits which address precisely this
issue.

Mind you, the feature has been requested at most perhaps 5 times in
the history of arch but, I agree that it is useful and it is on the
way.



    > Categories, branches, and versions are not orthogonal at all and
    > add unnecessary complexity. Future features cannot differentiate
    > between them because they are used very inconsistently in
    > existing archives.

Bah.  The arch namespace is an _exact_ (character by character) match
for _none_ of the ways projects are commonly done but it's an
_isomorphic_ match for essentially all of the ways projects are
commonly done.  It's a humanly useful (in a distributed world)
coordinate system imposed on an otherwise under-structured world of
anonymous revisions.  It's just fine and, moreover, it's a damn sight
better than CVS's "even-odd version number" scheme, SVN's
repository-wide-revision-number, and Monotone's misapplication of
cryptographic checksums as names.   In my egotistiocal moments, I
think of myself as enacting the role of Dewey in laying down that
namespace:  everyone will always hate it and benefit greatly from it
at the same time.

That said, you're absolutely correct when you imply that, with the
current command set, the somewhat verbose namespace currently gets to
be a bit tedious.   I think that that's a minor issue, in the big
picture, and, anyway, many of the recently posted feature plans to the
gnu-arch-users mailing list will pick off some of these issues.



    > GNU arch does not support a centralized development model which lacks a 
single, designated committer.

     Branch creation is not versioned. Branches cannot be
     deleted. This means that branches stay around forever, even after
     development on them has finished. (This could be worked around in
     the implementation by hiding branches, but it doesn't seem to be
     the right thing to do.)

Doesn't seem?  It seems like exactly the right thing to me.  What's
hidden and what's not is just a presentation-layer issue, nothing
more.  It's analogous to `ls' (and the -a convention) vs. the kernel
filesystem interface.  Oops.... arch hasn't yet added "-a" to it's
ls-like commands.  Well, we can do that (it would take more
effor than complaining about the issue on the GCC list, but not much
more, which is my larger point :-)....


     > The access methods for remote archives are subject to a lot of
     > round trips. Therefore, archive replication using tla itself is
     > very slow.

The access methods for remote archives optimize the specific kinds of
operation that day to day, minute to minute commands are likely to
need.  Archive replication _by_arch_itself_ is _not_ a minute to
minute need and it suffers in this optimization as a result.   But
archive replication by arch itself is _not_ the only way to replicate
archives -- it is simply one that is guaranteed to work if the server
supports _only_ an arch protocol and nothing else.   If your server
can support _additional_ protocols, such as rsync, then you have also
an essentially optimal replication protocol available.  Really, the
arch archive format interacts with rsync far better than, say, the
format of a CVS repository.

     > The archive format optimizes for access to early versions, not
     > most recent ones as one would expect. (Once the archive format is
     > no longer exposed directly, this becomes an implementation issue,
     > not a design issue.)

That's patently false though it's easy to see how you got that idea.
The archive format optimizes for _optimal_ random access to versions
although people don't currently realize this as much as they chould
because to fully activate the optimizations some careful by-hand steps
are needed.  These steps will, over the coming few months, be
automated.  Your complaint is _mostly_ that arch is a touch too hard
to configure.



     > The caches which compensate the previously mentioned issues are
     > not expired by tla. (This includes revision libraries
     > and. apparently, pristine copies stored inside a checked-out
     > copy of a revision.)

On the one hand, the `rm' command expires these perfectly well and its
use as such enables archive hosts to set their own policies.

And on the other hand, there are policies that are optimal under some
common conditions and that's what we're currently working on making
the default.


     > Changesets are tar files. They cannot be posted easily to a
     > mailing list for approval and commit; metadata tends to get
     > lost.

The funny thing about that oft-cited (very minor) issue is that there
have been N+1 solutions offered and, aside from in idle discussions
like this, exactly 0 solutions seriously demanded.

It's obvious to all arch users beyond the absolute novices that within
about 18 hours of noticing a need for a changeset format that can be
posted nicely to a mailing list one can be designed (for the ages) and
thoroughly implemented.   We noticed this months ago and the response
since then has been a universal "Huh.  Well, nice to know that
_that's_ there if we ever need it."

Why so?  Because _posting_ changes would be redundent.  To a crude but
significant approximation, the arch archive mirroring process is
_isomorphic_ to posting to a mailing list or newsgroup.  The
differences just happen to optimize "posts" that are revision control
records.  You're _already_ sending out your changesets when you push
them to a publicly visible mirror of your local archive.  No need to
separately mail those changes -- just mail to let people know they're
there.


     > In practice, tla requires four inodes per file in a checked-out
     > project tree: one for the file, one for the file ID, and a a
     > pristine copy of both. This gratuitous use of inodes can cause
     > problems.

While a not especially accurate account that point does have a
legitimate basis in fact.

In practice, that and other disk space, inode count, and client-side
i/o bandwidth issues generate few and very particular complaints.
People using hw that by today's standards is not very powerful tend to
complain a little bit.  People using, say, an ~$2K contemporary
workstation do not experience any problems.  Therefore, the informal
policy of the project is to merge changes which improve the situation
for older hardware when those changes are clean and worth having for
other reasons, but otherwise just to focus on the current and future
situation.


     > A checked-out revision of a branch contains at least one inode
     > for each revisions that was ever committed in the history of the
     > branch. Long-running branches also result in huge directories
     > with lots of entries.

Yes, it is shocking that given all that disk space we might actually
use it.

But in fact, the space overhead of an arch project tree can be bound
to a constant factor of the size of the tree.  It's just that, for
years now, nobody has felt any pressing demand for a feature to do so
automatically.

As per usual: this is the kind of "issue" that can be fixed with
entire _hours_ worth of hacking (i.e., effectively instantaneously).
The interesting thing is that though there are a significant number
of serious arch users, none have felt a pressing need to spend those
few hours.


     > The inventory constructor, project tree checker, and changeset
     > creation code are not fully synchronized. For example, it is
     > possible to commit a changeset with an inconsistent inventory,
     > which is also inconsistent as a result.

Perhaps there is some bug of which you are aware that I am not?  If
so, that's the kind of thing we simply fix and forget.  But otherwise,
that does not sound like anything the arch I know and use does.

If a tree is being concurrently modified while a commit is taking
place, sure, the commit may record some but not all of those
concurrent changes.  It _might_ even record some changes that aren't
quite the same as what took place but which are consistent with the
observable information about what took place.  It might even leave
your working dir in a state that (harmlessly) doesn't record that the
commit took place.  But, modulo bugs, it will never commit an
"inconsistent" revision in any way.

     > Branch creation is very cheap (a few inodes in the archive), but
     > a long-running branch to which changes in a mainline branch are
     > periodically merged replicates all changes on mainline. This
     > means that branch maintenance costs are controlled by the amount
     > of development on the branch and the development on the mainline,
     > and branches are no longer very cheap in total. 

That's _roughly_ true.  In dollar amounts, these overheads can wind up
costing a GCC-scale project perhaps 10s or 100s of billions of
nano-pennies per year.  Each and every year.   Should we be alarmed at
the scale of those costs?

     > (This is an > implementation issue because unlike other
     > systems, merge tracking > does not depend on the way changesets
     > are combined in the > archive. This is actually a very strong
     > point of GNU arch.)

Thank you for noticing.  It's also a very subtle point that people
sometimes naively complain about so you get points with me, here and
otherwise, for making obvious that you've spent some time studying
arch.

    > The author of this memo thinks that these issues are not
    > ordinary bugs. Workarounds may exist (faster machines and
    > networks, cached revisions, non-traditional file systems, hard
    > link farms, "simply don't do that, then"). However, these issues
    > have not been addressed for quite some time now and, though
    > still rather bothersome, some of them will probably remain
    > forever. (Other issues, such as some hairy error messages and
    > the lack of a file-specific revert command, are considered
    > transitional problems.)

Water is wet and Dogs are canine and that's that.  Most of your issues
aren't issues; they're complaints that gravity is attractive and most
cats have whiskers.


     > The GNU arch developers believe that it's easy for all developers
     > participating in a project to publish a repository.  However,
     > this requires write access to webspace without file name and
     > directory layout limitations. 

That's just so false I don't know where to begin.  But the intent of
what you are saying, which seems to be "you have to be able to publish
something on the internet if you are going to publish an arch archive"
is true (and tautological).


     > or to those who can only afford cheap Internet access. 

People like me, perhaps?

     > Genuine support for centralized development is required, but
     > GNU arch is unlikely to provide it.

I have no idea what you mean since you have not named a single feature
required for "centralized development" that Arch lacks.

     > The tendency to trade decreased code complexity for increased
     > running time and more disk space was fine when tla got started,
     > but today, it results in performance that does not compare
     > favorable with optimized competitors. 

No fair and balanced, scientific experiments have been performed to
my knowledge.   Nor would a good experiment design be easy to create.

Some operations in some arch environment are just plain blindingly
fast.  Others are slower.  We've done pretty well but not perfectly in
optimizing the common cases and where there are gaps, there are plans
and work ongoing to fix them.  People with dinky, old laptops tend to
gripe the most, as a class, about arch performance.  People with
something you could buy to day for ~$US2K complain, as a class, the
least and some of those folks report some very impressive results.
People who configure arch poorly (a process we're trying to simplify,
though it isn't that hard to begin with) are often a source of
complaints.  Yet many find that it can be configured pleasingly.



     > In addition, disk seek times have not improved at a significant
     > rate, and the huge amount of stat operations performed by tla
     > will remain a bottleneck even when developers move to larger
     > machines.

That's right and you're being Chicken Little, here.

Folks with contemporary desktops find that basic operations like
commit, carried out on an entire linux kernel tree, take O(10s).  Bear
in mind that these are typically multi-file commits, not like CVS
where single-file commits are often typical.

That's really not bad and it's only going to get better.  It's quite
competitive with comparable operations on other systems.  That's
because none of the work being done by these operations is extraneous.
None of it could be avoided without sacrificing functionality.   The
claim of arch is that this work _is_ what you want to be doing and
that here, in tla, is a reasonable implementation of that.

Other operations, like get, are even trickier to compare.  On a
naively configured arch set-up or one used poorly, `get' can be
absurdly expensive.  On a well set-up arch client, such as you'd have
if you used arch every day, for work, the implementation of `get' in
arch is, I think, the fastest and most flexible on the planet, by far.

(I work on several different projects and with many different branches
from many different people.   For me, in ordinary use, `get' is
usually better than `cp -R' for the same tree.)

     > The developers seem to underestimate the need for a robust user
     > interface with clear error messages and transaction semantics
     > (i.e. a command either fails and changes nothing, or it
     > completes successfully). 

Would you like like some Colorless Green Time Flies with that?

Arch is extremely robust and transactional at handling archives.  It
is _not_ transactional in the strong sense that you propose at handling
working directories.   Yet _no_ system could handle working
directories as you propose.   The unix filesystem semantics, even on
local disks, simply do not support it.  You could fake it if all of
your other tools (e.g., shell, editor) cooperated, but not otherwise.

Next thing you know you'll complain that arch doesn't trisect angles
with a straight edge and a compass.


     > Even non-programmers use revision control systems, and most
     > programmers are primarily interested in their own project and
     > not necessarily in tla internals.

What you refer to as tla internals I would mostly refer to as "the essential
core functionality meant by `revision control' with all of the
arbitrary choices in providing that functionality made for you".

The _real_ internals are things like "Data structures don't share
string pointers, by default.  Instead, strings are copied with a fresh
allocation."  Nobody but us arch hackers care about those and nobody
but us needs to, other than perhaps to recognize that we've made
choices with an eye on maintainability or "habitability" of the code.


     > tla input and output formats are currently deliberately
     > incompatible with the rest of the GNU system. An internal file
     > name encoding is externally exposed, and this encoding is not
     > used anywhere else at the moment. The problem at hand is the
     > possibility that file names might contain column or record
     > separators. (The GNU way is to use the ASCII NUL character
     > optionally, to deal with file names with spaces or control
     > characters. tla does not support this.)

That is not "the problem at hand".  You have mischaracterized the
situation so outrageously as to admit no other reply than this one.
What you've said is simply going to confuse and mislead newcomers.



     > Redesign the changeset format, probably based on VCDIFF (RFC
     > 3284). Unlike unified diffs (which are currently used by tla),
     > VCDIFF deltas are one-way and not reversible when just the
     > delta itself is known. (this is not so much of a problem, tla
     > uses changesets only in forward direction most of the time).

You suffer from deep misunderstandings.  Non-reversability of
changesets would be a crippling blow to arch.   I wouldn't know where
to begin fixing you, on this point, beyond that.



     > Additionally, VCDIFF deltas only support exact
     > application. Fuzzy application (i.e. the contents of new source
     > file for the delta is not identical to the the source file
     > contents the delta was computed to) requires access to the
     > original source file of the data. This is a fundamental change
     > in some respects, but many merge operations in tla already
     > require the computation of source files.

A "fundamental change in some respects?"  I don't know what more to
say other than you aren't making much sense.

     > Provide a human-readable changeset format with complete
     > metadata. This format is intended for exchange of patches over
     > mailing lists and should include unified diffs.

That takes (we have empirical evidence) the motivated hacker (one with
a need for such a thing) just a few hours to create.   It would take a
few days of discussion on g-a-u to pick an "official" one.


     > Do not expose the archive format, but use a changeset server
     > which implements access control (and pipelining, to cut down
     > effects of network latency).

As you must know, upward compatible designs for precisely that have
been laid out.   It's really only a question of someone having a
strong enough itch.

Meanwhile, I'm not really so sure anyone will ever _really_ want that
functionality.  Set-up a few well-placed mirrors and revision
libraries (costing you those billions of nano-pennies) and you may
find that you want not for such optimizations.  And you'll be happy
because it means not having to run any arch-specific server.

What we're working on, these days, is making such configurations just
so incredibly stone-simple to set-up that everyone starts doing it
more or less by default.  Currently, the "default settings" for common
operations are, more less, pessimal in all situations.

     > Project trees should not abuse the file system as a database. If
     > a database is required, use a real one (such as BDB or SQLite),
     > or CSV files containing multiple records, but not one file per
     > record.

Sorry, my friend for whom I have great respect, but, you're full of it
on that one too.

Arch does not in the slightest abuse the filesystem.  Rather, ext2 and
ext3, specifically, are historically anomalous in some respect with
regard to their performance on small files.  Arch's traditionally
respected style of filesystem usage tickles the space consumption and
i/o worst case of those filesystems (on small files) ---- which is
bad, abstractly, but in absolute terms, the badness isn't very bad.
It's quite usable, even on those friendly-old-dogs-of-filesystems.

The anticipated longevity of the design of arch is much, much larger
than the longevity of ext2 and ext3 implementations.  If Hans doesn't
blow ext2/3 away, someone else will in short order.  And the odds are,
surveying the whole history of unix filesystems, that having lots of
small files for this or that purpose will be restored to its
critically important place as an unambiguously practical programming
technique.   Geeze, I mean --- what the heck do you think the whole
"software tools" movement that boosted unix was about?   Just some
marketing play?   Not at all:  tiny files plus shell tools is the
underdog competitor against relational databases.   

     > Use a file cache (with LRU logic) instead of revision libraries.

I would say "in addition to".   Revision libraries basically _are_ a
file cache but one that is optimized for certain operations (`diffs',
`commit', `get', ...).  A finer-grained file-cache, which is what
you're talking about, would take more code and more complexity -- that
is a drawback.  But it _could_ optimize some operations that are
currently too heavy.

It is an utterly simple idea to implement within the existing arch
code.  The chief burden to you will be identifying places that _want_
access to a particular file but currently _ask_ for access to the
entire tree containing that file.  You'd tweak those to call some new
library routine which looks just for the file specically and, other
than that, the job would be trivial.  With my guidance, you could do
it in < 3 weeks, part time.  It's facts like that -- the ease with
which you could implement even this esoteric thing you are looking for
-- that I think are one of the most important recommendations for arch
as opposed to CVS or any of several other systems.

-t
[Prev in Thread]
Current Thread
[Next in Thread]
Re: [Gnu-arch-users] Some issues, (continued)
- Re: [Gnu-arch-users] Some issues, Colin Walters, 2004/06/09
  - Re: [Gnu-arch-users] Some issues, Aaron Bentley, 2004/06/09
  - Re: [Gnu-arch-users] Some issues, Matthieu Moy, 2004/06/10
  - Re: [Gnu-arch-users] Some issues, James Blackwell, 2004/06/10
    - Re: [Gnu-arch-users] Some issues, Andrew Suffield, 2004/06/10
    - Re: [Gnu-arch-users] Some issues, James Blackwell, 2004/06/10
    - Re: [Gnu-arch-users] Some issues, William Dode, 2004/06/10
    - Re: [Gnu-arch-users] Some issues, James Blackwell, 2004/06/10
    - Re: [Gnu-arch-users] Some issues, Tom Lord, 2004/06/15
- Re: [Gnu-arch-users] Some issues, Tom Lord <=
  - Re: [Gnu-arch-users] Some issues, Andrew Suffield, 2004/06/15
  - Re: [Gnu-arch-users] Some issues, Colin Walters, 2004/06/15
    - Re: [Gnu-arch-users] Some issues, Tom Lord, 2004/06/15
Prev by Date: [Gnu-arch-users] Message translation - basic framework
Next by Date: Re: [Gnu-arch-users] Some issues
Previous by thread: Re: [Gnu-arch-users] Some issues
Next by thread: Re: [Gnu-arch-users] Some issues
Index(es):
- Date
- Thread