bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [patch #6899] Speed-up for searching in multibyte and ignore-icase.


From: Aharon Robbins
Subject: Re: [patch #6899] Speed-up for searching in multibyte and ignore-icase.
Date: Wed, 10 Mar 2010 20:49:01 +0200

Hi Paolo,

> The main problems with the glibc DFA matcher are because it has to
> track subexpression boundaries.  So it won't be as fast as dfa.c in
> the general case.  Never.

I think this means we can't give up dfa.c, then.

> On the other hand it developed many optimizations for UTF-8 that right
> now make it faster than dfa.c in some common cases (while still being
> slower on others).  On glibc systems it is also the only matcher that
> can correctly handle bracket expressions (due to the shortsightedness
> of the POSIX committee, in my not-so-humble opinion).

Sigh.

> With respect to the former, the way to go is to add these
> optimizations to dfa.c, and to eliminate other obvious performance
> problems in grep and dfa.c's multibyte paths; I do have some patches
> for that, but like my RFC patch I'll wait for you and Jim to finish
> the sync.

That's OK.

> Due to the issue with brackets, however we'll have to live
> either with dfa.c being only 99% correct, or with it punting to
> glibc's regex matcher.

I think dfa needs to be made smart enough to know when it must
punt to regex.  

In fact, I think we should assume that dfa lives in a "regex is also
around" environment (true of grep and gawk, probably for gettext too),
and therefore the interface should be broadened to include a "don't
trust me, call regex yourself" return code.

> Finding a balance is hard also because on
> non-glibc platform we then lose speed _but not gain precision_.

I don't understand this.  All of gawk, grep, and gettext should include
a recent version of regex for use on non-GLIBC systems anyway. So
in practice, regex is always available.

> I do think that Norihirio's DFA patch goes a bit too far, but it has
> the merit of making the fastest possible dfa.c _as long as it doesn't
> punt to glibc regex_.  In some sense, it provides a baseline for
> comparison.  Does this make sense?

I didn't look at the patch, so I can't answer this question.

I think I have an obligation at this point to mention:

        http://swtch.com/~rsc/regexp/

In particular, there is code there for an "Efficient (non-backtracking)
NFA implementation with submatch tracking. Accepts UTF-8 and
wide-character Unicode input. Traditional egrep syntax only. Written by
Rob Pike."

Perhaps this can serve as the basis for a new unified matcher?

Thanks!

Arnold




reply via email to

[Prev in Thread] Current Thread [Next in Thread]