Re: dfa fix for multibyte char inside square brackets

bug-grep

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: dfa fix for multibyte char inside square brackets

From:	Paolo Bonzini
Subject:	Re: dfa fix for multibyte char inside square brackets
Date:	Fri, 05 Jul 2013 08:42:20 +0200
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130514 Thunderbird/17.0.6

Il 25/06/2013 07:54, Aharon Robbins ha scritto:
> Hi All.
> 
> A few weeks ago I reported a bug in dfa that was causing an assertion
> failure in gawk. It occurred in UTF locales.  Attached are two test
> programs, one that fails, and one that doesn't, and input.  The failure
> is seen in gawk, I'm not sure how to reproduce in grep.
> 
> Mike Haertel was kind enough to track down the problem and fix it for
> me. His email is below.  I have already applied his patch to the gawk
> code.  I applied it to the dfa in grep 2.14 and grep passed its "make check"
> but that is just FYI.
> 
> If you (plural) think it makes sense, please apply Mike's fix to
> the grep dfa also, so that we can stay in sync.
> 
> Thanks!
> 
> Arnold
> 
>> To: address@hidden
>> Subject: tentative fix for DFA bug, with explanation
>> Date: Sun, 23 Jun 2013 12:24:03 -0700
>> From: Mike Haertel <address@hidden>
>>
>> Since this is 7-bit email, let FOO stand for the chinese character
>> in the text regexp.  It turns out the following much simpler regexp:
>>      ([^.]*[FOO]){1,2}
>> is sufficient to cause the crash.
>>
>> Some background info: in the first step of its parsing, DFA transforms
>> regexp from human readable syntax into reverse-polish form.  For
>> example, if the regexp is a|b (where a and b are arbitrary
>> subexpressions), the RPN representation looks like:
>>
>>      <RPN representation for a>
>>      <RPN representation for b>
>>      OR
>>
>> and if the regexp is ab, the RPN representation is
>>
>>      <RPN representation for a>
>>      <RPN representation for b>
>>      CAT
>>
>> For regexps of the form a{m,n} repeat counts, it simply builds
>> repeated copies of the representation of a, with appropriate inserted
>> CAT and QMARK operators.  For the above example with a regexp of
>> the form a{1,2} it would build:
>>
>>      <RPN representation for a>
>>      <RPN representation for a>
>>      QMARK
>>      CAT
>>
>> When building repeated copies of RPN representations, additional
>> copies of the RPN representations are made by calling a function
>> copytoks() with arguments consisting of the start position and
>> length of the original copy.
>>
>> The problem is that the current code for copytoks() is simply
>> incorrect.  It operates by calling addtok() for each individual
>> token in the source range being copied.  But, in the particular
>> case that the token being added is MBCSET, addtok():
>>
>> (1) incorrectly assumes that the character set being added to be added
>>     is the one most (addtok has no argument to indicate which cset is
>>     being added, so it just uses the latest one)
>>
>> (2) attempts to do some token sequence expansion into more primitive
>>     operators so things like [FOO] are matched efficiently.
>>
>> Both of these assumptions are incorrect in the case that addtok()
>> is being called from copytoks(): (1) is simply not true, and
>> (2) is redundant--the expansion has already been done token sequence
>> being copied, so there is no need to do the expansion again.
>>
>> Based on my reading of the code, it looks like the correct function to
>> add exactly one token, without further expansion, is addtok_mb().
>>
>> So here is my proposed fix, which is that copytoks() should never call
>> addtok(), but instead directly call addtok_mb() (which is what addtok()
>> eventually calls).
>>
>> diff --git a/dfa.c b/dfa.c
>> index 54e0ae9..2195e28 100644
>> --- a/dfa.c
>> +++ b/dfa.c
>> @@ -1847,13 +1847,12 @@ copytoks (size_t tindex, size_t ntokens)
>>  {
>>    size_t i;
>>  
>> -  for (i = 0; i < ntokens; ++i)
>> -    {
>> -      addtok (dfa->tokens[tindex + i]);
>> -      /* Update index into multibyte csets.  */
>> -      if (MB_CUR_MAX > 1 && dfa->tokens[tindex + i] == MBCSET)
>> -        dfa->multibyte_prop[dfa->tindex - 1] = dfa->multibyte_prop[tindex + 
>> i];
>> -    }
>> +  if (MB_CUR_MAX > 1)
>> +    for (i = 0; i < ntokens; ++i)
>> +      addtok_mb(dfa->tokens[tindex + i], dfa->multibyte_prop[tindex + i]);
>> +  else
>> +    for (i = 0; i < ntokens; ++i)
>> +      addtok_mb(dfa->tokens[tindex + i], 3);
>>  }
>>  
>>  static void
>>  
>> With this fix, both tp1.awk and tp2.awk pass.  I haven't tested it on 
>> anything
>> else, since I have no idea how to run your validation suite (if any).
>>
>> So please try this out on your test suite and let me know how it works.

Looks good, thanks!

Paolobug-g

[Prev in Thread]

Current Thread

[Next in Thread]

Re: dfa fix for multibyte char inside square brackets, Paolo Bonzini <=

Prev by Date: [bug #37737] Contained subexpressions don't follow POSIX
Next by Date: [PATCH 0/2] Fix for bug reported by Arnold
Previous by thread: [bug #37737] Contained subexpressions don't follow POSIX
Next by thread: [PATCH 0/2] Fix for bug reported by Arnold
Index(es):
- Date
- Thread