[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: dfa fix for multibyte char inside square brackets
From: |
Paolo Bonzini |
Subject: |
Re: dfa fix for multibyte char inside square brackets |
Date: |
Fri, 05 Jul 2013 08:42:20 +0200 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130514 Thunderbird/17.0.6 |
Il 25/06/2013 07:54, Aharon Robbins ha scritto:
> Hi All.
>
> A few weeks ago I reported a bug in dfa that was causing an assertion
> failure in gawk. It occurred in UTF locales. Attached are two test
> programs, one that fails, and one that doesn't, and input. The failure
> is seen in gawk, I'm not sure how to reproduce in grep.
>
> Mike Haertel was kind enough to track down the problem and fix it for
> me. His email is below. I have already applied his patch to the gawk
> code. I applied it to the dfa in grep 2.14 and grep passed its "make check"
> but that is just FYI.
>
> If you (plural) think it makes sense, please apply Mike's fix to
> the grep dfa also, so that we can stay in sync.
>
> Thanks!
>
> Arnold
>
>> To: address@hidden
>> Subject: tentative fix for DFA bug, with explanation
>> Date: Sun, 23 Jun 2013 12:24:03 -0700
>> From: Mike Haertel <address@hidden>
>>
>> Since this is 7-bit email, let FOO stand for the chinese character
>> in the text regexp. It turns out the following much simpler regexp:
>> ([^.]*[FOO]){1,2}
>> is sufficient to cause the crash.
>>
>> Some background info: in the first step of its parsing, DFA transforms
>> regexp from human readable syntax into reverse-polish form. For
>> example, if the regexp is a|b (where a and b are arbitrary
>> subexpressions), the RPN representation looks like:
>>
>> <RPN representation for a>
>> <RPN representation for b>
>> OR
>>
>> and if the regexp is ab, the RPN representation is
>>
>> <RPN representation for a>
>> <RPN representation for b>
>> CAT
>>
>> For regexps of the form a{m,n} repeat counts, it simply builds
>> repeated copies of the representation of a, with appropriate inserted
>> CAT and QMARK operators. For the above example with a regexp of
>> the form a{1,2} it would build:
>>
>> <RPN representation for a>
>> <RPN representation for a>
>> QMARK
>> CAT
>>
>> When building repeated copies of RPN representations, additional
>> copies of the RPN representations are made by calling a function
>> copytoks() with arguments consisting of the start position and
>> length of the original copy.
>>
>> The problem is that the current code for copytoks() is simply
>> incorrect. It operates by calling addtok() for each individual
>> token in the source range being copied. But, in the particular
>> case that the token being added is MBCSET, addtok():
>>
>> (1) incorrectly assumes that the character set being added to be added
>> is the one most (addtok has no argument to indicate which cset is
>> being added, so it just uses the latest one)
>>
>> (2) attempts to do some token sequence expansion into more primitive
>> operators so things like [FOO] are matched efficiently.
>>
>> Both of these assumptions are incorrect in the case that addtok()
>> is being called from copytoks(): (1) is simply not true, and
>> (2) is redundant--the expansion has already been done token sequence
>> being copied, so there is no need to do the expansion again.
>>
>> Based on my reading of the code, it looks like the correct function to
>> add exactly one token, without further expansion, is addtok_mb().
>>
>> So here is my proposed fix, which is that copytoks() should never call
>> addtok(), but instead directly call addtok_mb() (which is what addtok()
>> eventually calls).
>>
>> diff --git a/dfa.c b/dfa.c
>> index 54e0ae9..2195e28 100644
>> --- a/dfa.c
>> +++ b/dfa.c
>> @@ -1847,13 +1847,12 @@ copytoks (size_t tindex, size_t ntokens)
>> {
>> size_t i;
>>
>> - for (i = 0; i < ntokens; ++i)
>> - {
>> - addtok (dfa->tokens[tindex + i]);
>> - /* Update index into multibyte csets. */
>> - if (MB_CUR_MAX > 1 && dfa->tokens[tindex + i] == MBCSET)
>> - dfa->multibyte_prop[dfa->tindex - 1] = dfa->multibyte_prop[tindex +
>> i];
>> - }
>> + if (MB_CUR_MAX > 1)
>> + for (i = 0; i < ntokens; ++i)
>> + addtok_mb(dfa->tokens[tindex + i], dfa->multibyte_prop[tindex + i]);
>> + else
>> + for (i = 0; i < ntokens; ++i)
>> + addtok_mb(dfa->tokens[tindex + i], 3);
>> }
>>
>> static void
>>
>> With this fix, both tp1.awk and tp2.awk pass. I haven't tested it on
>> anything
>> else, since I have no idea how to run your validation suite (if any).
>>
>> So please try this out on your test suite and let me know how it works.
Looks good, thanks!
Paolobug-g
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- Re: dfa fix for multibyte char inside square brackets,
Paolo Bonzini <=