[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Monotone-devel] Re: Support for binary files, scalability and Windows p
From: |
graydon hoare |
Subject: |
[Monotone-devel] Re: Support for binary files, scalability and Windows port |
Date: |
Fri, 16 Jan 2004 18:44:30 -0500 |
User-agent: |
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.6b) Gecko/20031205 Thunderbird/0.4 |
Asger Kunuk Ottar Alstrup wrote:
OK. The 16 MB limit is a showstopper for us, and probably the hashing as
well. We are working with raw video files that are on the order of 100
MB big.
yeah. those are really not going to play well. not only hashing will be
a problem; it is also the case that monotone sometimes keeps 1, 2, in
rare cases 3 copies of a file in memory concurrently, in std::strings.
it is really made for source code control. if you need to use it for
controlling very large individual files, it will need quite a bit of
changing.
Do you have an impression of what the consequences of changing some
files to use automatically generated unique identifiers instead of
hashes, or partial hashes? In other words, how much of the code relies
on the fact that the id's are hashes of the complete file?
eh, I doubt anything intrinsically *relies* on it; the hash is pretty
much treated as an opaque function from data->identifier. you could in
theory change it to use any sort of hashing scheme: partial, smaller,
faster, bigger, stronger. there are many to choose from.
Regarding, rcs_import.cc that is too big for a 5 minute inspection, so
do you know in more detail what changes are required?
Maybe the best bet is simply to try, and see what breaks?
yup. I don't actually know for sure that it's broken; my estimate was
based on how much I could imagine any hypothetical breakage costing, not
a detailed analysis. it may well already parse, but if it doesn't that's
not going to take too long to fix. after that you probably just need to
make sure it is decoded (uudecode?) and reconstructed properly (i.e. not
using the ed-line-script thing textual RCS entries use).
Regarding big-file support: I was wondering whether that could be done
without changing sqlite: A big file can be defined as concatenations of
many smaller chunks. Even if you bump the limit to 2^32, it will still
fail for some users, so maybe it's better to come up with a scheme
without such a "low" limit?
I'm not sure how that fits into the architecture, though.
hmm. this comment, and your talk about network distribution, has given
me a lot to chew on. sorry it's taken so long to respond. these are
issues which I agonized over quite a bit near the beginning, and am
still not completely happy with. your idea is good. I will proceed to
talk out loud for a minute here.
currently, files are stored, as you have probably guessed, as head
versions + xdeltas. the xdeltas reach into the past. head versions are
identified by SHA1, deltas are identified by (pre,post) pairs of SHA1s
-- the versions on either side of the delta. xdelta is itself based
somewhat on hashing; common blocks are found by indexing all the blocks
in the first file under their adler32 code, and rolling an adler32
window of the same block size over the second file one byte at a time.
then it writes a copy/insert delta stream. when you make a change the
forwards xdelta is queued to send to the network, and the reverse xdelta
is stored in the database, moving the head version to your newly
commited version.
this storage management system need not be the case. another feasible
form would involve removing the file/delta distinction and storing all
files as hash trees, where each hash code identifies an existing data
fragment (up to a fixed block size) *or* a subtree.
this would in a sense "speed up" access to old files -- or at least make
access to all files roughly equal speed -- and as you mention is would
integrate nicely with a type of network distribution we don't currently
have any real abstraction for: "synchronizing". syncing hash trees works
pretty well. it would also easily permit astronomically huge files.
the difficulty with this sort of approach (well, everything related to
the rsync algorithm really) is getting the constants right.
you have pressure from one side to make the extents you identify large:
the fewer extents you write, the smaller your encoding, and the smaller
your summary index of extents is (== less time to search) when you're
doing a rolling adler32.
on the other hand, you have pressure from the other side to make the
extents you identify small: finer grained extents means that you
identify more commonality between blocks.
in theory you can speed up the adler32 (well, for whole-database
indexing we would probably move to a larger adler code) by maintaining a
medium-sized bloom filter of all the adler codes in your database. it's
important to realize how fast xdelta needs to be able to disqualify an
adler code: it checks *every byte offset* in second file. you commit a
64k file, it will want to check (64,000 - blocksize) adler codes.
How is monotone doing in this area? Full support requires reimplementing
rsync, but maybe monotone can reuse rsync or something?
you know xdelta is based on rsync, right? :))
yeah, I'm definitely curious to figure out a way to run monotone's
networking the other way around. currently it is "replay-based".
committing something queues it for transmission, depots replay
transmissions, etc. this has some advantages -- you can just concatenate
communication history for example -- but seemingly just as many
disadvantages. it is fragile, order-sensitive, and requires a bunch of
contortions to compensate for.
I'd be just as happy to scrap the entire existing networking system and
move to a synchronization-based one, custom protocol or otherwise, if I
could work out a really efficient abstraction. synchronization is
self-stabilizing, which I'm very attracted to. maybe hash trees are it.
note that any such solution has to synchronize not only file contents
but metadata collections. but maybe we can structure things in a way
which supports that. for example, suppose we sort all manifests by
attached date cert (or, heck, lexicographically) and cluster them into
hash trees with a pleasant branching factor. then when I want to sync
with you, I send you my "highest" summary hash-tree list (a few hundred
manifests) and we sync from manifests to certs and files, then into the
file content hash trees. it might well work.
It seems Zbynek Winkler more or less nailed that one today, so that is
good news.
not sure about that being nailed, but it's certainly making very
encouraging progress. I'd really like to see a native windows port
someday (both to shed cygwin, and to get a native GUI).
The next step would be to develop a TortoiseCVS/TortoiseSVN kind of
client, but that should not require any changes in monotone as such.
it'll probably require some refactoring of commands.cc, but I'm not at
all adverse to cleaning that up. it's a bit of a rat's nest right now.
-graydon
- [Monotone-devel] Support for binary files, scalability and Windows port, Asger Ottar Alstrup, 2004/01/12
- [Monotone-devel] Re: Support for binary files, scalability and Windows port, graydon hoare, 2004/01/12
- [Monotone-devel] Re: Support for binary files, scalability and Windows port, Asger Kunuk Ottar Alstrup, 2004/01/15
- Re: [Monotone-devel] Re: Support for binary files, scalability and Windows port, Zbynek Winkler, 2004/01/15
- [Monotone-devel] Re: Support for binary files, scalability and Windows port,
graydon hoare <=
- Re: [Monotone-devel] Re: Support for binary files, scalability and Windows port, Ori Berger, 2004/01/16
- Re: [Monotone-devel] Re: Support for binary files, scalability and Windows port, graydon hoare, 2004/01/17
- Re: [Monotone-devel] Re: Support for binary files, scalability and Windows port, Nathaniel Smith, 2004/01/17
- [Monotone-devel] Re: Support for binary files, scalability and Windows port, graydon hoare, 2004/01/19
- Re: [Monotone-devel] Re: Support for binary files, scalability and Windows port, Zbynek Winkler, 2004/01/19
- Re: [Monotone-devel] Re: Support for binary files, scalability and Windows port, Ori Berger, 2004/01/18
- Re: [Monotone-devel] Re: Support for binary files, scalability and Windows port, Zack Weinberg, 2004/01/18
- [Monotone-devel] Re: Support for binary files, scalability and Windows port, graydon hoare, 2004/01/19
- [Monotone-devel] RE: Support for binary files, scalability and Windows port, Asger Kunuk Alstrup, 2004/01/18
- [Monotone-devel] Re: Support for binary files, scalability and Windows port, Peter Simons, 2004/01/18