monotone-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Monotone-devel] announcing preliminary "dumb server" support for mo


From: Nathaniel Smith
Subject: Re: [Monotone-devel] announcing preliminary "dumb server" support for monotone
Date: Wed, 12 Oct 2005 16:01:14 -0700
User-agent: Mutt/1.5.9i

On Thu, Oct 13, 2005 at 12:15:50AM +0200, Zbynek Winkler wrote:
> Nathaniel Smith wrote:
> >No, different problem entirely -- do_export is currently quadratic in
> >the history length, mostly because it uses a separate invocation of
> >monotone to request each manifest delta, and since monotone still does
> >unbounded delta chaining, it takes linear time to retrieve an
> >arbitrary manifest.  (This also applies to files, but files tend to
> >have much shallower histories than the tree as a whole, so it doesn't
> >matter there as much.)
> > 
> I am not that familiar with monotone codebase, so please forgive the 
> question - but why does this cause do_export to be quadratic?  Does it 
> actually retrieve arbitrary manifests? Shouldn't the request for 
> manifest delta be constant if that is the way it is stored in the db?

Two reasons:
  -- I don't think the code happens to make the optimization that when
     the requested delta is already stored in the db, it should just
     return it instead of recomputing it.  (This is actually not as
     lazy of us as it sounds; monotone currently makes the guarantee
     that data pulled from the database is never corrupted, and we
     only have 
  -- as you correctly notice, we store reverse deltas, but the dumb
     server wants forward deltas.  So we can't pull deltas straight
     out of the db anyway, they have to be computed.

> I've gone over merkle_dir.py and I believe it provides append-only data 
> structure (file called DATA) for arbitrary chunks identified by id 
> (mostly the hash of the chunk). I see some logic in do_export that 

Right.

> checks if old_something is already in the merkle_dir and if not, the 
> whole thing is put it - otherwise a delta of old_something and 
> new_something is requested. Is this true?

Essentially.

> Doesn't this requested delta 
> correspond directly to the delta stored in the monotone db? [BTW: For 
> some reason I thought monotone does reverse delta...?] 

You're right, we do store reverse deltas; that's why it doesn't
correspond to what's stored in the db.

> What I do not 
> understand is how on earth can we have a 'thing' that we do not have the 
> corresponding 'old_thing' for?

If you look more closely, we don't actually check to see whether the
old_thing is already in the merkle_dir; we check to see whether the
old_thing exists at all.  (Note that monotone represents non-existent
revisions, manifests, and files as having the hash "", the empty
string.)  Basically, what this is doing is saying "if this is a
totally new object with no parent, then put in the whole thing, since
there's nothing to delta _against_, but otherwise put in a delta
against the parent".

> And where does the linear search come 
> from when getting the deltas?

Say we have a history:
  A -> B -> C -> D -> E
Because we store reverse deltas, this is actually stored on disk as:
  A <- B <- C <- D <- E
Suppose we want the delta from A to B.  To get this, we need to
reconstruct A and B.  Doing this requires starting from E, then
using the E->D link to get to D, then the D->C link to get to C, etc.,
until we reach A and B.  (There are some optimizations here, but they
don't affect the algorithmic complexity.)  This takes time linear in
length of that chain, and ATM that chain can get as long as the total
history graph.

Note that in getting A, we actually reconstructed every version in the
graph, not just A.  So getting every version takes O(n) time, just
like getting only one version takes O(n) time.  But because we make
separate invocations to monotone, it has no way to take advantage of
this; it has to perform each request from scratch.

[...]
> Then I am looking forward to the rosters! :-)

Me too :-).

[...]
> >It always exports the whole database.  This could be made smarter, I
> >guess, but it certainly didn't seem worth the effort for the first
> >pass.  One of many possible improvements for someone to make, if they
> >want :-).
> > 
> >
> :-) Anyway, does it have to export the database at all? Maybe it could 
> build the MerkleDir directly from the database...? But then maybe we 
> would be reimplementing netsync in python... ;)

I was actually considering the possibility of having some
merkle-related automate commands -- something like
  $ monotone automate merkle_hash certs 0035 net.venge.monotone*
which would give the merkle hash of the 0035-prefixed certs that would
be included in a netsync of net.venge.monotone*.  Calculating merkle
trees is a little expensive, so again you'd want some sort of
in-process caching plus "automate stdio", but this is doable and it
would make it easy to write funky syncers, like say writing a client
script and a cgi to allow syncing a monotone db directly against a
viewmtn install, say.

However, this doesn't actually solve any of the problems that
merkle_dir.py solves, since all its heavy lifting has to do with the
other end -- how do you deal with a simple remote filesystem to make
it possible to efficiently push and pull.  It isn't trivial to
integrate this with something like the above 'automate merkle_hash'
idea, because then you'd have to make sure you could efficiently
calculate hashes _in the same way monotone does_, and that takes a bit
of thought.

Oh, however however, you're actually quite right -- you could do
something very conceptually simple in just monotone-dumb, which is
implement a class that acts like a MerkleDir, but that is constructed
in memory directly off a monotone database.  Basically, you'd iterate
over the db like do_export does, but instead of actually fetching
stuff, you could just keep track of which ids exist, build HASHES
files in memory, and use them to sync.  Then you pull stuff out of the
db as necessary, when it turns out you want to send it to the remote
side.  That'd be neat.

The only real problem I see is that at the moment monotone doesn't
expose any sort of "handle" on certs that can be used as a unique id
(even though internally it does keep per-cert hashes).  Every item in
the merkle_dir needs to have a unique id; so to generate this, ATM
dumb.py just hashes the text of the cert packet.  This works, but it
causes a problem for the above scheme, because if we discover that
we need the chunk whose id is 0123, and that refers to a cert, we
can't simply ask monotone for the cert 0123 -- we have to keep some
sort of reverse lookup table.  The same problem doesn't arise for
revisions, because dumb.py just uses the revision id, which we _can_
ask for on the fly.

I think this is a good argument to add cert ids to monotone; it's
something that's been in the back of my mind as a good idea for a long
time anyway.  This is a good use case for them.

-- Nathaniel

-- 
When the flush of a new-born sun fell first on Eden's green and gold,
Our father Adam sat under the Tree and scratched with a stick in the mould;
And the first rude sketch that the world had seen was joy to his mighty heart,
Till the Devil whispered behind the leaves, "It's pretty, but is it Art?"
  -- The Conundrum of the Workshops, Rudyard Kipling




reply via email to

[Prev in Thread] Current Thread [Next in Thread]