[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Monotone-devel] Re: database schema questions
From: |
graydon hoare |
Subject: |
[Monotone-devel] Re: database schema questions |
Date: |
Wed, 21 Apr 2004 10:50:39 -0400 |
User-agent: |
Mozilla Thunderbird 0.5 (X11/20040208) |
Derek Scherger wrote:
Does the files table contain the heads of various files, and the
files_delta table contain deltas from the head backwards to each
previous parent version? I think this is conceptually similar to how rcs
stores files (reverse deltas), and the reverse of how sccs does (or
did). Presumably manifests and manifest_deltas work similarly.
yes. though the storage organization is opaque to the rest of the
program; there is no requirement that the database store "heads" (in an
ancestry sense) in the file table and "non-heads" in the "file_deltas"
table. it happens to do so frequently, but it could also put all
versions in "files" and do no delta storage at all. deltas are just a
method for the storage system to compress things.
this is a very important and rarely-stated aspect of monotone's design:
file and manifest storage is *unrelated* to history. history is a
description of what happened, in terms of certificates which point to
SHA1 values. the method of constructing the data for a given SHA1 value
is *completely independent* of that history.
storage happens to be built more or less in parallel with history -- in
the current scheme -- so often shares a similar graph shape, but we
could use any other scheme we liked too. some people have suggested
"fragment buckets" -- essentially running xdelta over every N-byte block
in the entire history -- and I'd be willing to do that if I could come
up with an efficient implementation of the match algorithm. likewise
suffix trees or burrows-wheeler blocks or whatever. the storage layer is
decoupled[1].
It seems that branch certs are "sticky" in that they will get added to
successive manifests that descend from a parent in the branch. Is this
done purely with the branch setting in the MT/options file?
yes, or the --branch command-line option. they are normal certs, they
just happen to be added for you automatically.
Given the previous paragraph would it make some sense to store manifests
as a table with columns like manifest_id, file_id and file_name?
yes, but that would consume more bytes than the current form. in some
cases quite a few more bytes. it is important to note that we store
manifests as *gzipped* text. this means we get free statistical data
compression. look at a large manifest (say gcc: about 900k). most of the
bytes on most of the lines of that manifest are identical path prefixes:
...
gcc/testsuite/consistency.vlad/layout/i960-97r2-results/c-longlong-2-c-short.out++
gcc/testsuite/consistency.vlad/layout/i960-97r2-results/c-pointer-1-c-char.out++
gcc/testsuite/consistency.vlad/layout/i960-97r2-results/c-pointer-1-c-double.out++
...
that happens to compress very well; in this case it is nearly 10:1
compression. even with the subsequent base64 encoding, the result is
only 130k.
now, we could possibly achieve the same (or even better) compression
result "explicitly" at the storage level by storing directories as first
class citizens, and putting files as leaves in directories, and chaining
through the directory table to get to a file.. but that adds a lot of
complexity. I didn't want to bother.
This would allow any file to get all of the names it is known by and a list
of all the manifests it is part of very easily.
...
I do see that this represents some additional complexity in that files
and manifests are then stored and treated differently. But would there
be some advantage to making file names easily available? I have the
impression that getting a file name is somewhat involved in terms of
unpacking manifests and constructing the associated maps. I'm also
wondering whether a manifest merge under this representation would be
possible as a simple sql union of two other manifests.
there is certainly an argument to be made. early versions of monotone
did represent data this way. I removed it for the following reasons:
- there is a big space efficiency win with blobs of text (gzip)
- merging is more difficult than SQL union in all but the most trivial
cases, so you wind up having to construct a memory representation
anyways. diff_patch.cc::merge3 is one of the largest and most
complex functions in monotone (300 lines), and that's the simplified
version. merging trees is kinda gross.
notice the "union oriented" vision I initially had (struct
manifest_changes, in manifest.hh) turned out only to be useable as
a stepping stone in building the "real" representation of inter-tree
changes, which is struct patch_set, in patch_set.hh. as for
manifest.cc::apply_manifest_changes, it is actually dead code now.
- my representation of "a file" at the database level is really not
compatible with the notion people have of "a history line". storage
delta chains can be broken, or mis-ordered, or whatever. they don't
necessarily correspond to time or work. so suppose you want to read
off the "history" of a file; you cannot say "look at the file_deltas
table for its predecessor" because that might be unrelated to the
record of history. same is true or manifest_deltas; it might not
actually describe history. only the ancestry graph describes history.
so you need to go:
file->manifest->cert->manifest->file
each step of the way. storing things in a more explicit fashion
doesn't seem to win you much given algorithms that operate this way.
if you want to conduct an experiment and try rewriting part of the
schema in terms of explicit directories and manifests-as-tuples, I won't
stop you. it's possible it would be a space win and maybe simplify some
algorithms. I found the current arrangement beneficial though.
-graydon
[1]: small fib: the CVS importer and some of the old, dead queue-walking
code manipulate deltas directly. they do this for speed. you could fake
them on another storage system, but they'd slow down.