gnu-arch-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gnu-arch-users] Re: situations where cached revisions are not so go


From: Robert Collins
Subject: Re: [Gnu-arch-users] Re: situations where cached revisions are not so good
Date: Mon, 29 Sep 2003 07:13:15 +1000

On Mon, 2003-09-29 at 00:44, Tom Lord wrote:
> I don't think it's a mystery to anyone that the build_rev problem can
> be abstractly reduced to an instance of SPF.
> 
> The problem is that, in practical reality, it's an instance of SPF
> where:
> 
> a) the cost of examining the structure of the graph, including the
>    weights on edges, varies wildly over different parts of the graph

Yep. In one of my previous emails I proposed biasing the weight of a
node by the cost to access it. local nodes would then be examined before
remote nodes.

> b) on some parts of the graph, the costs of examining the graph is 
>    are infinite

Yes. Handling those seems trivial: throw that path away (or simply
record with an arbitrarily high cost.

> c) the costs of examining the graph are in some places high-enough
>    that if we do too much of it, the optimal SPF solution is useless
>    because we've spent too much time computing it

I'm not at all convinced that SPF will do more tree searching that
Jason's hand-created algorithm. If we consider the repository an
adjacency matrix:
node-N (links to other nodes)
then at any point retrieving the possible links is hopefully a single
round-trip. while the worst case behaviour of SPF is to build a minimal
spanning tree, a simple limit: until we decide if reverse patching is
actually useful the cost of examining a revision lower than anything in
ALMOST is infinite. (allowing summary changesets from a revision in our
local revlib to work) should contain that obvious worst case.

> From those reasons, it's not trivially SPF, it's a heuristic search
> whose goal is to approximate SPF in the face of incomplete knowledge
> and with the catch that asking the wrong questions during the
> heuristic makes the heuristic worse than useless.  (Perhaps there is a
> meta-problem that reduces to a different instance of SPF, sure...)

Yes, I think there is: weighting the links with the retrieval time and
estimated cost based on the type of link (full. summary-N, single, tag)

> d) we haven't gotten into it much, yes, but _representing_ the graph
>    raises questions of the physical representation of archives,
>    maintaining that representation in a txnally-sound manner,
>    minimizing the costs of querying the graph, and extending the
>    "method table" that all archives must implement.
> 
> So, there's problems well beyond just the search itself.
> 
>     > For a cost metric I suggest something like:
>     > number of links * RTT *(MAGIC1) + total KB * observed|configured
>     > bandwidth *(MAGIC2) + projected local tree copies *(MAGIC3)
> 
>     > where magic 1 2 and 3 are to normalise the metrics.
> 
> KB?  observed|configure bandwidth?
> 
> Yes, with complete a priori knowledge of the graph the problem is,
> indeed, quite trivial.

well for any sensible comparision of full vs summary vs changesets we do
need to know the sizes (kilobits) involved. Any rule of thumb without
that knowledge will still get it wrong.

> It's the lack of complete a priori knowledge and the costs of
> acquiring knowledge about the graph that make it non-trivial.  And
> it's the implications of representing those extra datums (kb,
> bandwidth) that make it a larger design problem than just copying a
> graph algorithm from a data structures text.

:)

>   > Rinse and repeat until, after finishing a nodes adjacencies you
>   > have one or more successful paths.
> 
> One way to understand the thing Miles was kvetching about is to say
> that your termination criteria there is wrong.   Having found just one
> path, I then have to decide whether or not it is worth continuing to
> look for a second one that might be far better.

Well, I was anticipating, based on the existing discussion, that all the
edges from a node would be retrieved in one hit. All I'm saying with
that termination criteria is that we add all the edges to the queue,
rather than evaluating termination at each one. (Otherwise, we might
bring down a full rev when the changeset we wanted was in the same
node).

> One approach here is to say: 
> 
>       * It's A.I.!  Yay!

[sarcastic comment skipped]


> 
> Another is to say:
> 
>       * It's A.I.!  Boo!
> 
>         Keep the search butt simple, easy to understand,
>           and easy to influence by changing the configuration
>           of revlibs, mirrors, and cachedrevs.
> 
>           Make it easy for users to configure their environment
>           in such a way that the butt simple search does a great
>           job 90% of the time.

Yes, which is what I've aimed at. No smart server needed. Butt simple.
Mirrors, revlibs and cached revs have immediate impact.

> and then there's everything in-between, of course.
> 
> (Incidentally, storing the sizes of various tar bundles in archives is
> one idea that's come around before and seems plausible enough.  The
> design issues of (d) have to be delt with and, more to the point, it
> would need to be shown that there's a really good reason to do it
> since it wouldn't, in and of it self, instantly give a better
> build_rev algorithm.)

I think there reasons have already been presented.

Cheers,
Rob

-- 
GPG key available at: <http://members.aardvark.net.au/lifeless/keys.txt>.

Attachment: signature.asc
Description: This is a digitally signed message part


reply via email to

[Prev in Thread] Current Thread [Next in Thread]