[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Monotone-devel] Re: netsync status
From: |
graydon hoare |
Subject: |
[Monotone-devel] Re: netsync status |
Date: |
Tue, 24 Feb 2004 10:00:37 -0500 |
User-agent: |
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.6b) Gecko/20031205 Thunderbird/0.4 |
Asger Ottar Alstrup wrote:
Do you have an indication of the overhead when synching? If X bytes are
different in a repository of size Y, how much data is transferred? I
understand that you can not give an accurate formula, but I hope it is a
linear function in X only, and that the constant factor in front of X is
less than 2.
it is not just a function of X, but it is close to that. I will explain
the protocol and you can see where the overhead is.
- the hashed index exchange happens only over manifest certs and keys.
after that the manifest certs imply an ancestry graph, the ancestry
graph implies a manifest data / delta structure, each manifest edge
implies a file data / delta structure. so once the keys and certs
are exchanged there is no more overhead, just streaming requests
and responses.
- the manifest cert synchronization will probably be the only source
of "increasing" overhead with time. it will be bounded by something
like (forgive the inaccuracy, I haven't done detailed analysis):
~ log_B(K) * N
where N is the number of certs eventually found missing (and thus
transmitted), K is the size of your cert set, and B is a tunable
branching factor serving as the log base, currently set to 16.
two further qualifications should be made: the number N should
really be smaller since each path of length L through the hashed
index has N*(B^-L) probability of sharing a prefix with another
path -- or some such factor -- and the overall *size* of each index
node, in bytes, varies with the load of the tree, though the formula
for that is proving a bit uglier than I feel like working out in
the first email of the morning.
- note that this only scales with the number of *manifest* certs, so
it is really scaling with about 4 * number-of-change-sets-in-branch,
which for most practical users will mean no more than 4 (or 5
if you're talking about the linux kernel) index exchanges per
missing element. each index node varies between about 50 and 400
bytes (depending on load) so in the worst realistically imaginable
case, with these branching settings, it could cost say 1.6k of sync
traffic to pick out a missing node in amongst a quarter million
changesets (which -- ballpark -- is about 75mb of certificates).
but you'd only see that if there were only a couple missing nodes;
if there are "lots", the number of shared prefixes will rise and the
efficiency will improve a bit.
- in practical terms: I just tested a netsync of a change to monotone
against an HTTP post of the same change (using old packets) and the
netsync used fewer bytes, despite including a synchronization of
196-cert collection. the netsync encodings are all a bit tighter
than those used for packets.
What about security? Can you encrypt the data transferred?
not at the moment, but it's not beyond reason. it does authenticate the
peers connecting, with RSA signatures on nonces, and calls a lua hook to
evaluate their read/write access to the collection they're syncing. if
you want "transport" encryption, you could also just tunnel the
connection over SSH. should work fine.
-graydon