[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [patch #6899] Speed-up for searching in multibyte and ignore-icase.
From: |
Aharon Robbins |
Subject: |
Re: [patch #6899] Speed-up for searching in multibyte and ignore-icase. |
Date: |
Wed, 10 Mar 2010 20:49:01 +0200 |
Hi Paolo,
> The main problems with the glibc DFA matcher are because it has to
> track subexpression boundaries. So it won't be as fast as dfa.c in
> the general case. Never.
I think this means we can't give up dfa.c, then.
> On the other hand it developed many optimizations for UTF-8 that right
> now make it faster than dfa.c in some common cases (while still being
> slower on others). On glibc systems it is also the only matcher that
> can correctly handle bracket expressions (due to the shortsightedness
> of the POSIX committee, in my not-so-humble opinion).
Sigh.
> With respect to the former, the way to go is to add these
> optimizations to dfa.c, and to eliminate other obvious performance
> problems in grep and dfa.c's multibyte paths; I do have some patches
> for that, but like my RFC patch I'll wait for you and Jim to finish
> the sync.
That's OK.
> Due to the issue with brackets, however we'll have to live
> either with dfa.c being only 99% correct, or with it punting to
> glibc's regex matcher.
I think dfa needs to be made smart enough to know when it must
punt to regex.
In fact, I think we should assume that dfa lives in a "regex is also
around" environment (true of grep and gawk, probably for gettext too),
and therefore the interface should be broadened to include a "don't
trust me, call regex yourself" return code.
> Finding a balance is hard also because on
> non-glibc platform we then lose speed _but not gain precision_.
I don't understand this. All of gawk, grep, and gettext should include
a recent version of regex for use on non-GLIBC systems anyway. So
in practice, regex is always available.
> I do think that Norihirio's DFA patch goes a bit too far, but it has
> the merit of making the fastest possible dfa.c _as long as it doesn't
> punt to glibc regex_. In some sense, it provides a baseline for
> comparison. Does this make sense?
I didn't look at the patch, so I can't answer this question.
I think I have an obligation at this point to mention:
http://swtch.com/~rsc/regexp/
In particular, there is code there for an "Efficient (non-backtracking)
NFA implementation with submatch tracking. Accepts UTF-8 and
wide-character Unicode input. Traditional egrep syntax only. Written by
Rob Pike."
Perhaps this can serve as the basis for a new unified matcher?
Thanks!
Arnold