[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Monotone-devel] Re: Support for binary files, scalability and Windo
From: |
Ori Berger |
Subject: |
Re: [Monotone-devel] Re: Support for binary files, scalability and Windows port |
Date: |
Sat, 17 Jan 2004 03:41:49 +0200 |
User-agent: |
Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.4; MultiZilla v1.5.0.2g) Gecko/20031007 |
graydon hoare wrote:
currently, files are stored, as you have probably guessed, as head
versions + xdeltas. the xdeltas reach into the past. head versions are
identified by SHA1, deltas are identified by (pre,post) pairs of SHA1s
-- the versions on either side of the delta. xdelta is itself based
somewhat on hashing; common blocks are found by indexing all the blocks
in the first file under their adler32 code, and rolling an adler32
window of the same block size over the second file one byte at a time.
then it writes a copy/insert delta stream. when you make a change the
forwards xdelta is queued to send to the network, and the reverse xdelta
is stored in the database, moving the head version to your newly
commited version.
this storage management system need not be the case. another feasible
form would involve removing the file/delta distinction and storing all
files as hash trees, where each hash code identifies an existing data
fragment (up to a fixed block size) *or* a subtree.
I had been thinking about this as well; In a message from september
16th, I mentioned I'd describe my ideas when I get them more
organized, but never had the time to get down to all the details.
I'll try to join the brain dump instead.
First, it's possible to use a suffix tree to locate repeating parts.
It's possible to implement much, much faster than rsync's algorithm,
and it is finely grained - an implementation and description can be
found in <http://www.cl.cam.ac.uk/~cpk25/libstree/>. This
implementation doesn't use persistent storage, but if it did, then
it would take more or less zero time to locate _all_ sources for a
file almost immediately. (The naive storage requirements are O(n)
for a text of length n, but I suspect that the constant factor might
be prohibitive in practice).
Suffix trees may allow a "find likely ancestors" function to be
implemented, with which proper attribution and detection of file
splitting and merging can be made - but I digress.
I'd be just as happy to scrap the entire existing networking system and
move to a synchronization-based one, custom protocol or otherwise, if I
could work out a really efficient abstraction. synchronization is
self-stabilizing, which I'm very attracted to. maybe hash trees are it.
Ok, here goes my braindump:
Find a way to partition files into blocks, in such a way that:
1. Block lengths are reasonable (definitions later)
2. The block boundaries are likely to be unchanged by small
modifications - i.e. an insertion of 1 byte somewhere should have
small probability of changing the boundary area, smaller probability
of changing two boundaries, .... vanishingly small probability of
changing all boundaries.
Now, given a file, break it into blocks; Store each block
independently, with its own hash; Store a manifest that says how to
rebuild a file from its blocks. And that's it.
If all files were text files, of 200 lines or so, applying this
would work well by just taking each line as a block, the boundaries
detected by seeing a newline (\n) character. Inserting a character
in a line would not affect any other line. The manifest would be a
list of line identifiers. When a line is changed, we'd usually have
a new manifest, and exactly _one_ line added to the database (all
other lines=blocks will be reused).
To apply this to general files, we can use the following:
1. Block lengths are between 1/100 and 1/400 of the file length,
but never shorter than (e.g) 80 bytes, and never longer than
(e.g) 16 Megabytes.
2. The block boundaries are determined by computing some running
checksum that depends only on (e.g) the last 80 characters, and
asserting that a block has ended when it meets the length
conditions set previously, and also, when the running checksum
has some property that, probabilistically, appears ~200 times
along the file. For example, let
k = int (log_base_2 of (file length / 200))
You could require a running alder32 (or cc32 or whatever) of
the recent 80 bytes to have it's k least bits all zero.
Assuming the running checksum is reasonably "random", small changes
shouldn't affect more than one or two blocks.
The blocks could be identified by their SA1, and the manifest could
either be file-specific, or the standard manifest extended to
include a part specification, e.g.
deadface dir1/entire
deadbeef dir1/entire*1*200
31415926 dir1/entire*2*200
.
.
.
b1e2e3f4 dir1/entire:200:200
f5a6c7e8 dir1/file2
(I put only 8 chars for the sha1, but ... you get the idea).
sha1sum would still be able to check the entire file; It will
complain about the missing parts, naturally.
This also interoperates nicely with network stuff - To fetch a
version from a depot, you fetch the manifest, and then fetch all
sha1 blocks listed in the manifest, that you don't already have. As
simple as that.
There's no need to have a staging queue for the depot. I suspect it
will cost a little in latency compared to a perfectly synchronized
depot of today, as the client would have to fetch the manifest,
parse it, decide what it needs to fetch, and only there start
transfer of the actual data - but the blocks themselves can be
pipelined, so it's not going to be crucial, I think.
About NNTP, email, "dumb web" distribution - all you have to do is
record, for each block whether or not it was sent to a specific
destination. If you want to push a version somewhere, you just push
all the blocks that have not yet been pushed to that location. Need
to resync? Reset the "already sent" marks. You can push to a dumb
ftp or web server just as well - each block is placed in a file,
whose name is the sha1 of the block, and so does the manifest. The
client fetches the manifest, and then starts fetching (through http
or ftp) all required blocks. No need for a smart depot (although one
could make things more efficient, by serving directly from the
databse or accepting directly to the database - the client doesn't
even have to be aware).
Problems I could think of, none too serious:
1. An adversary might construct files that don't break well, and
change them between versions in such a way that a complete copy is
stored, even though there aren't many changes.
That's not really a concern, I think - the adversary could just have
well replaced the entire file and foiled any kind of delta compression.
2. File with large repeats, and / or large spans of zeros, would
probably reduce the block boundary decision algorithm to a
constant-length cutter or something like that.
Not really a problem, but perhaps could be handled better as a
special case.
3. "phase change" effect when file size crosses a 2^k size boundary.
In this case, the boundary decision algorithm would use a different
decision rule, and the two versions are much less likely to have
blocks in common - resulting in another copy of the file being
stored even though we could have reused many of the blocks.
It's not a _huge_ problem - this will only happen once for each 2^k
crossing, because if we cross back we'll have the old shorter block
version to match against, and if we cross forward again, we'll have
the old longer block version to match against.
Further, this can be solved as a special case. The block boundary
decision algorithm is only important for storage efficiency, not
integrity - if it's replaced by no blocking at all, or breaking to
exactly 8K blocks, everything will work as expected (albeit with
much less efficient storage).
Pro:
* Older versions just as easy to access
* Easier to build "dumb" servers, simpler queuing rules, etc.
* Compression across all versions of all files (not just
all versions of the one file).
* Easy to purge any specific version (using reference counts)
* Easy to overcome database/filesystem/memory limits, by proper
choice of block sizes.
Con:
* Requires more storage and /or bandwidth; A one
byte change in a 100MB file would cost ~500KB with this
scheme, and ~ ~100B with the current delta scheme.
Network bandwidth is similarly affected. This depends
on how many blocks are changed, rather than the changes
themselves: a sequential 400KB change that is entirely within
one block will cost 400KB as delta, and 500KB as blocks.
A change of one byte in every one of the 200 blocks will cost
~2KB as delta, and the entire 100MB as blocks.
* Latest version, unless properly cached, will take longer to
construct (need to pull all blocks, which will be scattered
in the database). And proper caching costs space....
The boundary determination scheme is not unlike the Rabin Karp
search algorithm, or some of what rsync is doing, BTW.
End of braindump.
I was hoping to experiment and write it down more clearly, but I
don't think I'll get to do that soon enough; Hence the braindump.
- [Monotone-devel] Support for binary files, scalability and Windows port, Asger Ottar Alstrup, 2004/01/12
- [Monotone-devel] Re: Support for binary files, scalability and Windows port, graydon hoare, 2004/01/12
- [Monotone-devel] Re: Support for binary files, scalability and Windows port, Asger Kunuk Ottar Alstrup, 2004/01/15
- Re: [Monotone-devel] Re: Support for binary files, scalability and Windows port, Zbynek Winkler, 2004/01/15
- [Monotone-devel] Re: Support for binary files, scalability and Windows port, graydon hoare, 2004/01/16
- Re: [Monotone-devel] Re: Support for binary files, scalability and Windows port,
Ori Berger <=
- Re: [Monotone-devel] Re: Support for binary files, scalability and Windows port, graydon hoare, 2004/01/17
- Re: [Monotone-devel] Re: Support for binary files, scalability and Windows port, Nathaniel Smith, 2004/01/17
- [Monotone-devel] Re: Support for binary files, scalability and Windows port, graydon hoare, 2004/01/19
- Re: [Monotone-devel] Re: Support for binary files, scalability and Windows port, Zbynek Winkler, 2004/01/19
- Re: [Monotone-devel] Re: Support for binary files, scalability and Windows port, Ori Berger, 2004/01/18
- Re: [Monotone-devel] Re: Support for binary files, scalability and Windows port, Zack Weinberg, 2004/01/18
- [Monotone-devel] Re: Support for binary files, scalability and Windows port, graydon hoare, 2004/01/19
- [Monotone-devel] RE: Support for binary files, scalability and Windows port, Asger Kunuk Alstrup, 2004/01/18
- [Monotone-devel] Re: Support for binary files, scalability and Windows port, Peter Simons, 2004/01/18
- [Monotone-devel] Re: Support for binary files, scalability and Windows port, graydon hoare, 2004/01/19