gnu-arch-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gnu-arch-users] archive storage format comments on the size


From: Tom Lord
Subject: Re: [Gnu-arch-users] archive storage format comments on the size
Date: Mon, 29 Sep 2003 14:15:50 -0700 (PDT)



    > From: Andrea Arcangeli <address@hidden>

    > I tried to convert two random cvs repositories to arch archives and the
    > sizes are like this:  [  (4.8..5.8)x larger for arch ]

Since cscvs is still in the waining days of "experimental", and since
you're fairly new to arch, it's possible that _part_ of the difference
is just something silly in this particular conversion.  (A message
shortly after yours even suggests a possibly relevant glitch in
cscvs.)

However, in general, yup: the base archive size of arch is a bit
larger than CVS.

Comparing to CVS on archive size is something to be done carefully,
for a few reasons.

*) Think from first principles, instead.

   The theoretical minimum of information in a whole tree commit is
   something like the contents of added lines and files, the addresses
   of added and deleted lines, the log message, and enough of an
   inventory index list to figure out the tree-deltas, plus enough 
   structuring goo to make those components indexable.

   The space inefficiences in arch are that it adds: contents of
   deleted lines and files, context of diffs, an extra copy of 
   the log file, and some overhead costs associated with using
   directory structure (including in tar files) as the "structuring
   goo" instead of something smaller.    That's not all that much 
   and each element represents a reasonable space/time trade-off.

   Gzip is pretty fast as it's used here;   tar is decent at
   compressing the use of directories as structuring foo;  as you
   noted, gzip is beneficial for network traffic:  these factors
   offset some of the space inefficiencies.

   Storing an archive on an ext* filesystem is a slight negative
   offset compared to, say, an ffs filesystem because of the poor
   space performance on small files.

   Space consumption, measured as a function of "labor performed on
   the project" is (taking your figures as an outlier) probably
   something like 2-6x greater than CVS. (A rough guess.)

*) Think in economic terms, both current and historical.

   Currently, a, say, 10x inefficiency compared to CVS is, frankly,
   not much money.  The very painful exception is people in the same
   situation as me: people who have some old machine and don't have
   money to buy a new disk -- thus live with N+1-year old storage
   economics.  I'm pretty active on a fairly large amount of source; I
   work with a decent number of other people's archives.  I mirror
   lots of stuff; keep a full revlib of my own stuff and partial
   revlib of one or two contributors.  When I started arch the
   pessimistic end of my projection was that I would risk filling up
   my disk roughly around now -- In fact, I'm only 1/3 of the way
   there.

   CVS makes certain space/time trade-offs differently.   It uses a
   storage model designed -- what, more than 20 years ago?   Measured
   in dollars, the space consumption differences between CVS and arch
   have fallen, during that period, by a few orders of magnitude.
   A-buck-a-gig-and-falling is a very different world from that in
   which CVS was designed.   (Tagging and branching are examples of
   operations in which the space/time trade-offs of CVS become quite
   apparent -- not to mention the associated robustness issues.)

   This is why you frequently see comments in arch-world like "mirror
   stuff", "build revlibs", "build an ancillary index for that".



    > at around >2000 patchsets archs slowsdown like a crawl in the commits,
    > it seems to spend all its time doing a flood of lstat on every patchset
    > file in the {arch} directory inside the working dir, basically it's O(N)
    > where N is the number of patchsets in the archive. cscvs spends all its
    > time in wait4 (waiting tla to return). Note, linux should be fairly
    > efficient in the stats, thanks to the dcache, indipendently by the fs
    > used.

You are, here, measuring cscvs performance -- not tla.   And you have
a slight misunderstanding of patch logs.

Mass-conversion by cscvs is fundamentally a crunch-n-grind batch
process that's not typical of how arch is used.   It's a "make
possible" case not a common case.   If it became a common case, the
best solution might very well be for cscvs to write an archive
directly -- bypassing tla entirely -- not fussing with project trees
-- computing changesets directly from the CVS data.   Tla is nice this
way:  that'd be a tractable problem.

In terms of operations that arch-users typically perform during
day-to-day work: speed has been getting better and better.   For most
trees, it's pretty damn good at the moment.   For very large trees
using explicit tagging -- some of the latest optimizations haven't
caught up with that yet but probably will in the next 2-3 months.
I've made some suggestions recently of what hacks to contirbute to
accelarate that process.

Patch-logs do not have to retain every entry from all of the history
of a tree.   They need only retain a subset for the "active region" of
history.   (In those rare cases where you need to start merging from
very old work, if your logs have been pruned since then, sync-tree is
your friend.)

-t




reply via email to

[Prev in Thread] Current Thread [Next in Thread]