monotone-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Monotone-devel] long RFC: "contexts"


From: graydon hoare
Subject: [Monotone-devel] long RFC: "contexts"
Date: Tue, 25 May 2004 17:01:43 -0400
User-agent: Mozilla Thunderbird 0.5 (X11/20040208)

hi,

I've been having a number of off-list discussion about monotone's ancestry graph, metadata, and ability to "behave like arch".

these discussions, and some of the recent hacking (esp. on netsync) have been suggesting to me that monotone might benefit from having a "changeset" or "context" added as a "first class" (named) object.

the idea -- in case you missed it in all the other VC systems! -- would be to add a textual object to monotone which describes (all at once) the contents of a number of certs and a certain amount of currently-synthetic information:

manifest: <manifest-sha1>
date: <contents-of-current-date-cert>
author: <contents-of-current-author-cert>
summary: "line of text"
parent: <first-parent-context-sha1> {
  manifest: <manifest-sha1>
  renames: [<filename>, <filename>]
  adds: [<filename>, <file-sha1>] ...
  dels: <filename> ...
  patches: [<filename> <file-sha1> <file-sha1>] ...
}
parent: <second-parent-context-sha1> {
  manifest: <manifest-sha1>
  renames: [<filename>, <filename>] ...
  adds: [<filename>, <file-sha1>] ...
  dels: <filename> ...
  patches: [<filename> <file-sha1> <file-sha1>] ...
}

remainder is changelog
^D

we would then hash this blob of text to produce a "context ID", and you could attach certs to either context IDs or manifest IDs. the context *contains* its ancestry (as a "fact"), and the current concept of "approval" would be rephrased as certifying a context ID as a member of a branch. there are a bunch of reasons for wanting to do this. I'll list them here, I'd like you to read them and think about them before leaping to an immediate knee-jerk reaction. I know it looks like an arch changeset; that's intentional. they have made some valid points. this will involve a fair bit of reorganization to implement this on my side, but I think it's a good idea. such a change will:

 - kill, finally, any worries about cycles or accidental shared
   lineage in the ancestry graph. you might share storage, but you will
   essentially *never* (save a collision in SHA1) share context
   ancestry. there would be no more manifest ancestry.

 - kill, finally, any of the seeming paranoia that monotone can't or
   doesn't reason about "first class" changesets. so far I've been
   reasonably comfortable with the idea of managing content alone, but
   I get a lot of feedback suggesting the desire to see a written,
   tangible, formal object (with a name) called "a change". this would
   be it. the only remaining "missing" concept would be "file GUIDs",
   which I consider mostly meaningless anyways; imo if you have enough
   shared history to have a shared GUID, you probably have enough to
   work out the naming relationship by tracing through rename history.

 - give a name to a particular change. this makes it easier to talk
   about cherry-picking commands; easier to list in a sort of
   "what am I about to get during this update" command; and easier to
   write as arguments for aprove, disapprove, and similar commands.

 - require a somewhat robust printer/parser which can serialize this
   information to a human-friendly and email-friendly form. this will
   make it easier to interoperate with the patch-and-email approach.
   nearly every hacker I've ever discussed VC with says this sort of
   email interoperability is a practical necessity.

 - make a clear future distinction between certs which are about
   a change (context certs) and certs which are about a particular
   tree state (manifest certs). this difference is evident for example
   in the difference between approval (context) and testresults
   (manifest), but it's not really as clear at the moment.

 - simplify future interoperability issues. if we import CVS archives
   at the moment, we will get some cycles. this is just a symptom, it
   will get worse if we try to read or write other VC formats. this
   change forces monotone to keep *a* level of history which is a
   strict DAG, which is how most systems organize their history
   anyways. it may even become possible (?) to read or write linear
   sub-DAGs as arch archives, if we're careful.

 - add a small extra dimension of integrity checking: the synthetic
   analysis of pair of manifests should match the written contents
   of a context edge. though you could also see this as an extra
   dimension for integrity *failure*. good or bad. in any case, it'll
   do away with separate rename certs, which are a bit of a hack.

 - remove a small class of potential bug where you have, for example,
   two disagreeing "rename" certs on the same edge. this is currently
   possibile in monotone, and I suspect is not handled nicely.

 - trade some space for speed:

   - I'd have an excuse to unpack and index the fields which I know
     the substructure of (author, date, ancestor, etc.) which would
     speed and simplify a lot of local operations.

   - things like netsync or log require analysis of two manifests to
     synthesize the change edge. netsync would speed up a fair bit if
     it could hazard a guess at the prerequisites for a change the
     instant it received the "change object", rather than waiting for
     constructability of the pre-image followed by set-wise analysis, as
     it currently does.

 yet, despite the seeming "increase in space", it would ...

 - take no more space. all these items are generated each time we do
   a commit already, but as *separate* certs. the certs aren't free:
   generally there are about 300 extra bytes of crytographic data
   along for the ride on each one. that makes a commit cost about
   1500 bytes in crypto; this data object would probably weigh no more
   than that, possibly even less.

now, the downsides are:

 - the user would see "divergence" slightly more often. for example, if
   njs and I both merge the same fork, we'd see two different context
   IDs, which (as far as "heads" is concerned) would be different
   nodes. but they would have the same manifest, so "heads" (or "merge")
   could be made smart enough to say "different context, same content"
   and not make you do any extra work.

 - there would be a certain distinction between "core" and "auxiliary"
   metadata: the stuff mentionned in the context will have a seeming
   primacy over additional, 3rd party certs hung on the side. the
   experience so far seems to suggest that nobody ever sticks 3rd party
   author, date, or rename certs on a manifest anyways, so I'm not sure
   how much would be lost there.

 - compressing history gets a bit harder. you either need to keep a full
   context graph on hand, or make an auxiliary cert or context which
   says "this set of contexts is included here". on the other hand, that
   sort of facility is potentially something arch interoperability would
   need anyways, and is something commonly requested, as a "trail" left
   by a cherry-picking command. so maybe it makes sense anyways.

there might be more. I'd appreciate some public discussion now that I've sort of stewed on the issue for a couple weeks.

I had some more outrageous approaches in mind too -- overhaul the whole manifest format, switch to versionned directories, etc. -- but I find myself unable to imagine the complete extent of implications for those, and unable to justify them given the greatly increased scope of work. this approach at least seems, er, small enough and similar enough to what we already do, yet sufficient to cover the main points.

anyways, no matter what I will do any such work on a branch and provide some sort of sensible migration path from existing DBs, but it might at worst require re-issuing all existing manifest certs.

(aside: yes, technically this could be more lightweight. really all we
 *need* to do for most of the "hard" goals is to make a context which
 contains parent context IDs and manifest ID, and then you can do all
 the rest hanging certs on the context ID. but I thought I'd cheat and
 kill multiple "efficiency" birds with one stone. feel free to reject
 the latter concept and argue that all we need is a simple context
 object)

-graydon




reply via email to

[Prev in Thread] Current Thread [Next in Thread]