monotone-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Monotone-devel] Re: Support for binary files, scalability and Windo


From: Ori Berger
Subject: Re: [Monotone-devel] Re: Support for binary files, scalability and Windows port
Date: Sat, 17 Jan 2004 03:41:49 +0200
User-agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.4; MultiZilla v1.5.0.2g) Gecko/20031007

graydon hoare wrote:

currently, files are stored, as you have probably guessed, as head versions + xdeltas. the xdeltas reach into the past. head versions are identified by SHA1, deltas are identified by (pre,post) pairs of SHA1s -- the versions on either side of the delta. xdelta is itself based somewhat on hashing; common blocks are found by indexing all the blocks in the first file under their adler32 code, and rolling an adler32 window of the same block size over the second file one byte at a time. then it writes a copy/insert delta stream. when you make a change the forwards xdelta is queued to send to the network, and the reverse xdelta is stored in the database, moving the head version to your newly commited version.

this storage management system need not be the case. another feasible form would involve removing the file/delta distinction and storing all files as hash trees, where each hash code identifies an existing data fragment (up to a fixed block size) *or* a subtree.

I had been thinking about this as well; In a message from september 16th, I mentioned I'd describe my ideas when I get them more organized, but never had the time to get down to all the details. I'll try to join the brain dump instead.

First, it's possible to use a suffix tree to locate repeating parts. It's possible to implement much, much faster than rsync's algorithm, and it is finely grained - an implementation and description can be found in <http://www.cl.cam.ac.uk/~cpk25/libstree/>. This implementation doesn't use persistent storage, but if it did, then it would take more or less zero time to locate _all_ sources for a file almost immediately. (The naive storage requirements are O(n) for a text of length n, but I suspect that the constant factor might be prohibitive in practice).

Suffix trees may allow a "find likely ancestors" function to be implemented, with which proper attribution and detection of file splitting and merging can be made - but I digress.

I'd be just as happy to scrap the entire existing networking system and move to a synchronization-based one, custom protocol or otherwise, if I could work out a really efficient abstraction. synchronization is self-stabilizing, which I'm very attracted to. maybe hash trees are it.

Ok, here goes my braindump:

Find a way to partition files into blocks, in such a way that:
1. Block lengths are reasonable (definitions later)
2. The block boundaries are likely to be unchanged by small modifications - i.e. an insertion of 1 byte somewhere should have small probability of changing the boundary area, smaller probability of changing two boundaries, .... vanishingly small probability of changing all boundaries.

Now, given a file, break it into blocks; Store each block independently, with its own hash; Store a manifest that says how to rebuild a file from its blocks. And that's it.

If all files were text files, of 200 lines or so, applying this would work well by just taking each line as a block, the boundaries detected by seeing a newline (\n) character. Inserting a character in a line would not affect any other line. The manifest would be a list of line identifiers. When a line is changed, we'd usually have a new manifest, and exactly _one_ line added to the database (all other lines=blocks will be reused).

To apply this to general files, we can use the following:
1. Block lengths are between 1/100 and 1/400 of the file length,
   but never shorter than (e.g) 80 bytes, and never longer than
   (e.g) 16 Megabytes.
2. The block boundaries are determined by computing some running
   checksum that depends only on (e.g) the last 80 characters, and
   asserting that a block has ended when it meets the length
   conditions set previously, and also, when the running checksum
   has some property that, probabilistically, appears ~200 times
   along the file. For example, let
   k = int (log_base_2 of (file length / 200))
   You could require a running alder32 (or cc32 or whatever) of
   the recent 80 bytes to have it's k least bits all zero.

Assuming the running checksum is reasonably "random", small changes shouldn't affect more than one or two blocks.

The blocks could be identified by their SA1, and the manifest could either be file-specific, or the standard manifest extended to include a part specification, e.g.

deadface dir1/entire
deadbeef dir1/entire*1*200
31415926 dir1/entire*2*200
.
.
.
b1e2e3f4 dir1/entire:200:200
f5a6c7e8 dir1/file2

(I put only 8 chars for the sha1, but ... you get the idea).

sha1sum would still be able to check the entire file; It will complain about the missing parts, naturally.

This also interoperates nicely with network stuff - To fetch a version from a depot, you fetch the manifest, and then fetch all sha1 blocks listed in the manifest, that you don't already have. As simple as that.

There's no need to have a staging queue for the depot. I suspect it will cost a little in latency compared to a perfectly synchronized depot of today, as the client would have to fetch the manifest, parse it, decide what it needs to fetch, and only there start transfer of the actual data - but the blocks themselves can be pipelined, so it's not going to be crucial, I think.

About NNTP, email, "dumb web" distribution - all you have to do is record, for each block whether or not it was sent to a specific destination. If you want to push a version somewhere, you just push all the blocks that have not yet been pushed to that location. Need to resync? Reset the "already sent" marks. You can push to a dumb ftp or web server just as well - each block is placed in a file, whose name is the sha1 of the block, and so does the manifest. The client fetches the manifest, and then starts fetching (through http or ftp) all required blocks. No need for a smart depot (although one could make things more efficient, by serving directly from the databse or accepting directly to the database - the client doesn't even have to be aware).

Problems I could think of, none too serious:

1. An adversary might construct files that don't break well, and change them between versions in such a way that a complete copy is stored, even though there aren't many changes.

That's not really a concern, I think - the adversary could just have well replaced the entire file and foiled any kind of delta compression.

2. File with large repeats, and / or large spans of zeros, would probably reduce the block boundary decision algorithm to a constant-length cutter or something like that.

Not really a problem, but perhaps could be handled better as a special case.

3. "phase change" effect when file size crosses a 2^k size boundary.

In this case, the boundary decision algorithm would use a different decision rule, and the two versions are much less likely to have blocks in common - resulting in another copy of the file being stored even though we could have reused many of the blocks.

It's not a _huge_ problem - this will only happen once for each 2^k crossing, because if we cross back we'll have the old shorter block version to match against, and if we cross forward again, we'll have the old longer block version to match against.

Further, this can be solved as a special case. The block boundary decision algorithm is only important for storage efficiency, not integrity - if it's replaced by no blocking at all, or breaking to exactly 8K blocks, everything will work as expected (albeit with much less efficient storage).


Pro:

* Older versions just as easy to access
* Easier to build "dumb" servers, simpler queuing rules, etc.
* Compression across all versions of all files (not just
  all versions of the one file).
* Easy to purge any specific version (using reference counts)
* Easy to overcome database/filesystem/memory limits, by proper
  choice of block sizes.

Con:

* Requires more storage and /or bandwidth; A one
  byte change in a 100MB file would cost ~500KB with this
  scheme, and ~ ~100B with the current delta scheme.
  Network bandwidth is similarly affected. This depends
  on how many blocks are changed, rather than the changes
  themselves: a sequential 400KB change that is entirely within
  one block will cost 400KB as delta, and 500KB as blocks.
  A change of one byte in every one of the 200 blocks will cost
  ~2KB as delta, and the entire 100MB as blocks.
* Latest version, unless properly cached, will take longer to
  construct (need to pull all blocks, which will be scattered
  in the database). And proper caching costs space....

The boundary determination scheme is not unlike the Rabin Karp search algorithm, or some of what rsync is doing, BTW.

End of braindump.

I was hoping to experiment and write it down more clearly, but I don't think I'll get to do that soon enough; Hence the braindump.






reply via email to

[Prev in Thread] Current Thread [Next in Thread]