[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Monotone-devel] Roster handling in the code
From: |
Patrick Georgi |
Subject: |
Re: [Monotone-devel] Roster handling in the code |
Date: |
Sat, 25 Jul 2009 09:39:26 +0200 |
User-agent: |
Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.1.1) Gecko/20090715 Thunderbird/3.0b3 |
Am 25.07.2009 04:07, schrieb Stephen Leake:
Patrick Georgi<address@hidden> writes:
The current code takes two rosters, compares them, and registers the
differences. The problem is that this approach scales by roster size
(which is directly dependent on the tree size). The problem is, that I'm
working with trees with 50000 files - comparing all of their entries
just to find the 5 that actually changed is quite a waste of time.
As a test, I disabled the roster deltification code, and got quite some
speedup (pushing 10 revisions with put_revision in approx. half the
time), so there's definitely room for improvement.
I think the main motivation for roster_deltas is database size. If we
can store a new revision as an old revision plus a small delta, that's
a significant size saving. As with anything that saves size, it costs
computation time. Which doesn't mean the current trade-off is optimal.
Disk space is certainly cheaper than it used to be. But so is CPU time
:).
Perhaps a user tuning option could be provided?
The problem in this case, that large trees affect both of them very
negatively. When I did the profiling, I disabled roster_delta generation
(#if 0/#endif around a couple of lines in database.cc) to see if I'm
really on the right track.
But this "optimization", while reasonable for small trees, really hurts
large trees, too. Every roster contains information about 50000 files,
and that sums up.
So both potential optimizations, the current one, and disabling
roster_deltas, are acceptable for small trees. But small trees are not
the case I'm looking at.
...
My current approach is to try to create the roster_delta while the new
roster is created out of the old (using
cset.apply_to(editable_roster_for_{non,}merge)). The most appropriate
place for this seems to be editable_roster_for_{non,}merge, as all data
is passed to it as some point of the operation.
I have wondered why there were two different "delta" mechanisms; one
that creates changesets, and one that creates roster deltas. I thought
at one time that one could be eliminated. But then I realized both are
needed. I'm not sure I can remember the details, though :(.
I think at the time where changesets are created and written to the
database, some data that's stored in rosters (and thus roster_deltas)
isn't around yet, for example node ids.
It could be that combining them would lead to a time savings.
The ability to generate one out of the other, given the right set of
source data, might be enough. As changesets are managed first, my
attempt is to efficiently generate roster_deltas out of changesets plus
the related rosters.
But I'm looking for a more direct approach: Is there a way to get figure
out the revision that matches a roster?
Rosters are fetched in the first place by revision id, so I'm not sure
what you are asking.
I was deep inside the code, and at some point, it was only shuffling
rosters, with no revision ids in sight. I just looked a couple of levels
further up, and everything needed seems to be there.
For example:
db.get_roster(left_rid, left_roster, left_marking_map);
db.get_roster(right_rid, right_roster, right_marking_map);
This is from the merge code. If you could point to the section of the
sync code you are worried about, it might help.
This gave me an idea: database::put_roster handles the deltification of
rosters. The revision ids are available there, so I could hand over the
changesets to the delta_rosters(...) call there. See below.
5 lookups instead of 50000.
Computing a roster delta should not involve retrieving the file text
for every file; doesn't it check if the file id (hash of the contents)
has changed first? Those are given in the roster, so comparing them
doesn't count as a "lookup".
Yes, but look how it's done. It's in roster_delta.cc,
make_roster_delta_t(...):
It creates a datastructure containing information about _all_ files in
the roster (ie. all 50000 in my case), and then loops over them and
notes the differences.
The cset already contains a list of all changes between the two
revisions, so the code could walk a changeset to figure out which roster
entries to handle. As most changesets are rather small, that should give
a nice speed up.
What are you defining as a "lookup"?
This iterator is used to look up the changes: parallel::iter<node_map>
i(from.all_nodes(), to.all_nodes());
Regards,
Patrick