[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [patch #6899] Speed-up for searching in multibyte and ignore-icase.
From: |
Paolo Bonzini |
Subject: |
Re: [patch #6899] Speed-up for searching in multibyte and ignore-icase. |
Date: |
Wed, 10 Mar 2010 21:27:23 +0100 |
>> Due to the issue with brackets, however we'll have to live
>> either with dfa.c being only 99% correct, or with it punting to
>> glibc's regex matcher.
>
> I think dfa needs to be made smart enough to know when it must
> punt to regex.
It already knows. However, for [a-x], [= =] and [. .] currently it
prefers a nonexact result to a punting. It only punts for
backreferences.
> In fact, I think we should assume that dfa lives in a "regex is also
> around" environment (true of grep and gawk, probably for gettext too),
> and therefore the interface should be broadened to include a "don't
> trust me, call regex yourself" return code.
It's already there, again.
>> Finding a balance is hard also because on
>> non-glibc platform we then lose speed _but not gain precision_.
>
> I don't understand this. All of gawk, grep, and gettext should include
> a recent version of regex for use on non-GLIBC systems anyway. So
> in practice, regex is always available.
Regex gets bracket expressions right on glibc systems only. On
non-glibc systems only, regcomp/regexec might get them right, but
regex will have exactly the same fallback as current dfa.c.
> I think I have an obligation at this point to mention:
>
> http://swtch.com/~rsc/regexp/
>
> In particular, there is code there for an "Efficient (non-backtracking)
> NFA implementation with submatch tracking. Accepts UTF-8 and
> wide-character Unicode input. Traditional egrep syntax only. Written by
> Rob Pike."
>
> Perhaps this can serve as the basis for a new unified matcher?
I think this is very much related to the algorithms already in use by
regex. A unified matcher will always be slower than DFA.
Paolo