gnu-arch-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gnu-arch-users] tla1.1 plans


From: Tom Lord
Subject: Re: [Gnu-arch-users] tla1.1 plans
Date: Mon, 15 Sep 2003 17:38:39 -0700 (PDT)


    > From: Miles Bader <address@hidden>

    > >     > How does this relate to support for using directory-context in
    > >     > =tagging-methods, i.e., so a regexp can match FOO in one
    > >     > directory, but not another?  That _is_ something I'd really like
    > >     > to see.

    > > I'm slightly nervous about that stuff, actually.

    > > Implmented naively, it's going to slow-down inventories noticably on
    > > trees that use it.   (And slowing down inventories impacts everything
    > > else, just about.)

    > > Implemented to be fast -- well, that's just plain hard.

    > Can you share some more thought on the implementation details of the 
problem?

    > Is the full pathname available at the point where regexp matching gets 
done?


Near enough.   But that's not the point.


    > My initial assumption was that you could just match each file twice 
against
    > the current regexps, once with the basename (as now), and once with the 
full
    > path.  Then current regexps would still work, but people could also stick 
in
    > `foo/bar/.*\.c'.  Some problems with this -- (1) there's no way to write a
    > specific `only in the root dir' regexp that doesn't also match basenames, 
and
    > (2), the `.' meta-character matches '/', which might sometimes cause
    > surprising results.

These regexp matches, at core, are kinda-sorta cheap.   If you have a
bit of a clue about them, they're an incredibly useful tool.   If you
lack a clue -- they're an engraved, personalized invitation to hell.

During inventory we're not looking for subexpression positions (the
"pmatch" parameter to regexec).  We can presume that in the common
case the regexps aren't using wildly irregular features such as
backreferences.  They do often use the irregular feature of '^' and
'$' in patterns -- but usually at the beginning and/or end of the
entire pattern, which is a special case that Rx optimizes for and
takes advantage of.

So, (at least when ^ and $ surround the pattern) each call to regexec
during an inventory really does live up to the formal languages theory
and do a single, DFA-based pass over the string being tested.
Optimized, that's 10-20 instructions per filename character.
Unoptomized -- haven't checked recently -- perhaps twice that.

That's all well and good but it's not the complete story.  The DFA
used in that scan is potentially huge.   And even if it's not huge, it
might be large.   So, the DFA isn't just built -- it's built-on-demand
in a cache, discarding infrequently used parts as needed to make room
for needed parts -- leading to repeated reconstruction of the DFA if a
cache-thrashing condition is achieved.

On top of the linear scan, there's also a bit of overhead for each
call to regexec, just setting this up.

So, now, what do we have in inventory?  Naively, we have a need for
N*K calls to regexec where N is the number of files and directories in
the tree and K is the average number of regexps that need to be
considered to determine the disposition of a particular file.   Much
of the proposed flexability in =TODO involves adding new regexps
(increasing K).   Overall, the proposal encourages making ever-bigger
and ever-harier regexps -- increase the odds of cache-thrashing.

Now, partly using the unique features of Rx, I could take all the
regexps in the extended =tagging-method file, rewrite them slightly,
and make One Big Honkin Regexp that does the entire thing at once.
The result would be Rx doing a single pass over each relative path --
I'd be willing to bet we can keep the requisit cache-size needs quite
reasonable other than a few outliers.   On simple cases, like a tla
tree and maybe even an Emacs tree -- the One Big Honkin Regexp
approach would probably be (slightly)  _faster_ than the inventory you
see today.   But there's two problems:   

(a) it'd be a bitch to code.

(b) having coded it, users would get real freakin comformtable using
    it.  Then they'd try to stretch until they ran face first into 
    cache thrashing.   Then I'd spend the next two years trying to 
    explain regexp theory to people so that they could appropriately
    adjust their expectations.  That is not my idea of fun.

(It's because of instances like this that I think the world needs a 
new language for writing regular expressions --  a new langauge that
doesn't contain so many surprises in terms of what optimized matchers
wind up having to do.)



    > > So there's a cost/benefit question here and what I'm wondering (and
    > > taking as a hypothesis) is that the untagged-source foo is ample here.

    > I must admit that I can't see at all how untagged-source (useful though it
    > is) has any connection with directory-specific regexp matching...

Liberalize the patterns that classify files as "source, by name" --
but then use tagging (hence untagged-source) to prune the set.

-t




reply via email to

[Prev in Thread] Current Thread [Next in Thread]