bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#19306: [PATCH 1/2] dfa: avoid execution for a pattern including an u


From: Jim Meyering
Subject: bug#19306: [PATCH 1/2] dfa: avoid execution for a pattern including an unsupported expression
Date: Sun, 26 Jul 2015 09:10:05 -0700

On Mon, Jul 20, 2015 at 8:14 AM, Norihiro Tanaka <address@hidden> wrote:
>
> On Sun, 19 Jul 2015 20:14:52 -0700
> Jim Meyering <address@hidden> wrote:
>
>> Thank you for the additional information and the test script.
>> I like most of this patch, but not the fact that it causes the
>> word-delim-multibyte test to fail. I do see that also applying your
>> following patch makes that test pass once again.  However, it does so
>> at the cost of forcing a new class of regexps (any that contain a use
>> of \b, \<  or \>) from DFA into the slower regex matcher.
>
> I think DFA forces regex for BEGWORD, LIMWORD, ENDWORD, instead of
> whether patching or not.  Could you remark code in dfassbuild() without
> patching?  It seem that DFA rejects their words from before.
>
>         case BEGWORD:
>         case ENDWORD:
>         case LIMWORD:
>         case NOTLIMWORD:
>           if (d->multibyte)
>             {
>               /* These constraints aren't supported in a multibyte locale.
>                  Ignore them in the superset DFA, and treat them as
>                  backreferences in the main DFA.  */
>               sup->tokens[j++] = EMPTY;
>               d->tokens[i] = BACKREF;  <<<<
>               break;
>             }
>
> DFA does not handle word context in multibyte correctly.  Perhaps, if we
> fix it, DFA will take a performance penalty.

You're right. That covers it.

This patch is good also for eliminating some technical debt.
Since your change eliminates the code below (effectively making
the same change from conjunct to disjunct), there is no reason
to make the following correction:

               /* Falling back to the glibc matcher in this case gives   \
                  better performance (up to 25% better on [a-z], for     \
                  example) and enables support for collating symbols and \
                  equivalence classes.  */                               \
-              if (d->states[s].has_mbcset && backref)                   \
+              if (d->states[s].has_mbcset || backref)                   \
                 {                                                       \
                   *backref = 1;                                         \
                   goto done;                                            \
                 }                                                       \

At first I was surprised to see that using "&&" there provoked
no test failure, but then saw that we test "backref" shortly thereafter.
While technically, using "&&" there is a bug, it seems the effect was
negligible.

I have made some changes, renaming functions, e.g., dfabackref -> dfa_supported,
since even before this change "backref" meant more than the presence
of a backreference.

Note that while your commit log entry described new functions, I have
modified the commit
log to say merely "new function" for each. Instead, I document the new
functions in the code,
where that documentation will be more useful, and maintained.

Please let me know if there is anything you'd like to change:

Attachment: 0001-dfa-avoid-execution-for-a-pattern-including-an-unsup.patch
Description: Text Data

Attachment: 0002-dfa-remove-word-delimiter-support-for-multibyte-loca.patch
Description: Text Data


reply via email to

[Prev in Thread] Current Thread [Next in Thread]