Re: [bug-gawk] gawk bug

bug-gawk

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gawk] gawk bug

From:	Aharon Robbins
Subject:	Re: [bug-gawk] gawk bug
Date:	Tue, 25 Jun 2013 08:46:16 +0300
User-agent:	Heirloom mailx 12.5 6/20/10

Greetings.

> Date: Thu, 16 May 2013 15:50:45 +0800
> From: Steven Daniels <address@hidden>
> To: address@hidden
>
> I'm getting an Assertion failed when I try the following:
>
> > $ echo '很?pos=ad 宽广?pos=va ， 更?pos=ad 是?pos=vc1' | gawk '{match($0, /(([^ 
> > \?.]*\?pos=ad |([^ \?.]*\?pos=(jj|va) )[地]\?pos=dev ){0,2})/ , arr)}  { 
> > if(arr[0]) print arr[1], arr[4], $6} '
> > Assertion failed: (&musts[2] <= mp), function dfamust, file dfa.c, line 
> > 3951.
> > [1]    13263 done       echo '很?pos=ad 宽广?pos=va ， 更?pos=ad 是?pos=vc1' |
> >        13264 abort      gawk
>
>
> The point of failure seems to be "[地]", when brackets aren't used, the 
> command works as expected.
>
> $ echo '很?pos=ad 宽广?pos=va ， 更?pos=ad 是?pos=vc1' | gawk '{match($0, /(([^ 
> \?.]*\?pos=ad |([^ \?.]*\?pos=(jj|va) )地\?pos=dev ){0,2})/ , arr)}  { 
> if(arr[0]) print arr[1], arr[4], $6} '    
> # => 很?pos=ad
>
> $gawk --version   
> GNU Awk 4.0.2
>
> Thanks.
>
> -Steven Daniels  

Mike Haertel, the original author of the dfa code, was kind enough to
track this down and fix it for me.  His explanation and patch are below.

I have checked this fix into gawk-4.1-stable and master branches of the
git repo for gawk, and it will be in the next release, whenever that is.

You should be able to manually apply this patch to the 4.0.2 or 4.1.0
sources if you don't want to build out of the git repo.

I plan to add your input and test programs into the gawk test suite.

Much thanks!

Arnold

> To: address@hidden
> Subject: tentative fix for DFA bug, with explanation
> Date: Sun, 23 Jun 2013 12:24:03 -0700
> From: Mike Haertel <address@hidden>
>
> Since this is 7-bit email, let FOO stand for the chinese character
> in the text regexp.  It turns out the following much simpler regexp:
>       ([^.]*[FOO]){1,2}
> is sufficient to cause the crash.
>
> Some background info: in the first step of its parsing, DFA transforms
> regexp from human readable syntax into reverse-polish form.  For
> example, if the regexp is a|b (where a and b are arbitrary
> subexpressions), the RPN representation looks like:
>
>       <RPN representation for a>
>       <RPN representation for b>
>       OR
>
> and if the regexp is ab, the RPN representation is
>
>       <RPN representation for a>
>       <RPN representation for b>
>       CAT
>
> For regexps of the form a{m,n} repeat counts, it simply builds
> repeated copies of the representation of a, with appropriate inserted
> CAT and QMARK operators.  For the above example with a regexp of
> the form a{1,2} it would build:
>
>       <RPN representation for a>
>       <RPN representation for a>
>       QMARK
>       CAT
>
> When building repeated copies of RPN representations, additional
> copies of the RPN representations are made by calling a function
> copytoks() with arguments consisting of the start position and
> length of the original copy.
>
> The problem is that the current code for copytoks() is simply
> incorrect.  It operates by calling addtok() for each individual
> token in the source range being copied.  But, in the particular
> case that the token being added is MBCSET, addtok():
>
> (1) incorrectly assumes that the character set being added to be added
>     is the one most (addtok has no argument to indicate which cset is
>     being added, so it just uses the latest one)
>
> (2) attempts to do some token sequence expansion into more primitive
>     operators so things like [FOO] are matched efficiently.
>
> Both of these assumptions are incorrect in the case that addtok()
> is being called from copytoks(): (1) is simply not true, and
> (2) is redundant--the expansion has already been done token sequence
> being copied, so there is no need to do the expansion again.
>
> Based on my reading of the code, it looks like the correct function to
> add exactly one token, without further expansion, is addtok_mb().
>
> So here is my proposed fix, which is that copytoks() should never call
> addtok(), but instead directly call addtok_mb() (which is what addtok()
> eventually calls).
>
> diff --git a/dfa.c b/dfa.c
> index 54e0ae9..2195e28 100644
> --- a/dfa.c
> +++ b/dfa.c
> @@ -1847,13 +1847,12 @@ copytoks (size_t tindex, size_t ntokens)
>  {
>    size_t i;
>  
> -  for (i = 0; i < ntokens; ++i)
> -    {
> -      addtok (dfa->tokens[tindex + i]);
> -      /* Update index into multibyte csets.  */
> -      if (MB_CUR_MAX > 1 && dfa->tokens[tindex + i] == MBCSET)
> -        dfa->multibyte_prop[dfa->tindex - 1] = dfa->multibyte_prop[tindex + 
> i];
> -    }
> +  if (MB_CUR_MAX > 1)
> +    for (i = 0; i < ntokens; ++i)
> +      addtok_mb(dfa->tokens[tindex + i], dfa->multibyte_prop[tindex + i]);
> +  else
> +    for (i = 0; i < ntokens; ++i)
> +      addtok_mb(dfa->tokens[tindex + i], 3);
>  }
>  
>  static void
>  
> With this fix, both tp1.awk and tp2.awk pass.  I haven't tested it on anything
> else, since I have no idea how to run your validation suite (if any).
>
> So please try this out on your test suite and let me know how it works.

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [bug-gawk] gawk bug, Aharon Robbins <=

Prev by Date: Re: [bug-gawk] Inconsistent External Command Spawning
Next by Date: Re: [bug-gawk] Issue with read timeout on TCP socket
Previous by thread: [bug-gawk] Inconsistent External Command Spawning
Next by thread: Re: [bug-gawk] Issue with read timeout on TCP socket
Index(es):
- Date
- Thread