monotone-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Monotone-devel] Re: Support for binary files, scalability and Windo


From: Ori Berger
Subject: Re: [Monotone-devel] Re: Support for binary files, scalability and Windows port
Date: Mon, 19 Jan 2004 01:05:44 +0200
User-agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.4; MultiZilla v1.5.0.2g) Gecko/20031007

graydon hoare wrote:

I don't know what you mean by "all sources for a file". could you be
more specific about where you mean to use a suffix array, and how it
helps with a particular bit of retrieval or storage? the way I see it, a
persistent suffix array makes the on-disk representation of a string
considerably larger, without improving the xdelta problem (which is:
find all common regions between these two strings)

What I meant by "all sources" is that there is no inherent a-priori
reason to encode a file with respect to one _given_ file. It is
often the case that the best reference is the previous version of
the same file. But when you merge two files into one, for example,
the "natural" delta is from both, rather than from a specific one.

If you have a persistent suffix tree (one tree for the _entire_
repository), when you want to put a new version in the database, you
just start walking the tree, and note what part of it you take from
what repository file. Renames, copies, splits, merges, etc. will all
be efficiently represented (in terms of "delta" storage) even
without any explicit "monotone rename" command.

Your elevator pitch includes "ancestry is just some interesting
metadata", and this is just taking it to the extreme ... A version is stored as a patchwork of parts of all previous files, with or without any ancestry knowledge.

I don't think that the theoretic possibilities justify any work at this time. Nevertheless, the idea is worth documenting, I think.
It's implementation is orthogonal to the block structure below, but
the underlying philosophy is the same.

Xdelta could be more-or-less as useful, if you store the adler32 of all reference blocks in an accessible location that you can consult while scanning a new file.

    that technique is really nothing more than extending xdelta to work
    over block-structured files, which is (as I'll get to) advantageous
    since it permits Very Large Files (~2 ** 40 or so) and also makes
    network sync work.

What you're describing is, as you said, extended xdelta. What I was describing is doing away with ancestry as a factor in storage and identification. If breaking to blocks depends on context (e.g., ancestry), then, patches that are applied in a different order (but result in the same outcome) are likely to generate different block structures. Personally, I much prefer stateless representations that depend only on the data and not its history - YMMV.

hmm, no, not changing the manifest format. the file remains identified by its SHA1. if you don't like SHA1, substitute some other function which makes an identifier given an input string, but if the filesystem can maintain the notion of a "file" using inodes, so can manifests.

I didn't understand that one. I was suggesting (or at least, thought I was suggesting) either (a) to include SHA1 for blocks inside the same manifest, or alternatively, (b) keep another per-file manifest that says how to construct it from its SHA1 blocks, in the same way that a manifest can be used to construct an atomic revision from files. Whoever receives an (a) or (b) manifest can then go hunting for either the blocks or the files, in whatever way needed. (b) would be completely transparent, and (a) would cause sha1sum to correctly check the complete files, but report the file _parts_ as missing.

Yep, it's not too bright; But I wasn't suggesting dropping SHA1 in any way, and I couldn't understand the relation to inodes. Could you please elaborate?

as far as I'm concerned, I'd keep doing what I'm doing now, reuse the block + delta storage system for *storing* manifests, too. they're data too.

Hmm, I wasn't aware manifests were treated specially in storage. I thought they were only special by their cert status.

hm, for the network I have a broader idea: use hash trees over the entire space of SHA1 to synchronize my collection + your collection into the union of both (on both ends). with some special accounting to manage singletons and tombstones, and a good spread factor, it's very efficient to synchronize hash trees, and I'd use the exact same scheme to sync the collection of blocks, the collection of manifests, the collection of files, the collection of keys, and the collection of certs.

Sounds good. As Nathaniel hinted in a recent post, hash tree synchronization might not play well with your block-extended xdelta; It needs to be stateless to play well, I think.

About NNTP, email, "dumb web" distribution - all you have to do is record, for each block whether or not it was sent to a specific destination.

no, I think I'd just remove these things altogether, let people sync between databases directly as the primary mode of operation.

Ouch. One of the things I like so much about monotone is that it is transport agnostic. Requiring a smart online server would kill that. I also think it's an increased barrier for many people.

It should still be possible to build an NNTP/email/dumb-web gateway though, even if you do move to an interactive hashtree-sync. It's the same logic I described earlier, of "for each block not yet sent to a gateway destination, send to gateway destination" and "add received blocks to database".

Ori.






reply via email to

[Prev in Thread] Current Thread [Next in Thread]