[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
dfa - gawk matching problem on windows and suggested fix
From: |
Aharon Robbins |
Subject: |
dfa - gawk matching problem on windows and suggested fix |
Date: |
Sun, 02 Oct 2011 21:14:17 +0200 |
User-agent: |
Heirloom mailx 12.4 7/29/08 |
Hi Grep Guys.
A while back David Millis reported a rather strange problem with gawk 4.0.0
on Windows:
> Date: Sat, 10 Sep 2011 23:13:25 -0700 (PDT)
> From: David Millis <address@hidden>
> To: address@hidden
> Subject: [bug-gawk] 4.0.0 Regex Patterns Choke on Exotic Chars
>
> # A bug in GNU AWK 4.0.0's regex handling?
> # 3.1.6 (GnuWin32)/3.1.7 (Jgawk?, had |& intact) worked.
> # It cripples manipulation of mildly exotic chars.
> # In Windows anyway (Binary: http://www.klabaster.com/freeware.htm#dl).
> # I couldn't reproduce it in Debian with 4.0.0.
>
> BEGIN {
> # For this, escaping is no different from pasting the genuine char.
> badChar = "\x95";
> # This is a bullet (\x95, vim: ctrl-v+149) in the Win-1252 codepage.
> # It happens to be in the \x80-\x9f range
> # where Win-1252 diverges from strict Latin-1.
> # Most apps don't care, but this might be the issue...
> # Hmm, middledot (\xb7, vim: ctrl-v+183) shows the same behavior.
>
> print badChar; # Print's fine
> print gensub(/\x95/, "@", "", badChar); # Error
>
> # The char is acceptable as the gsub/gensub replacement arg.
> # But not as the pattern: be it /literal/ or "string".
> # Upon reaching the line, gsub/gensub throw "unbalanced )".
> # Or an "internal error" if used in a character class /[\x95]/.
>
> # Mundane escapes like \x22 for double-quote are fine.
> }
>
>
> I sent this to Eli Zaretskii, who replied:
> > This also happens in 3.1.8 (on Windows).
> >
> > Please send this bug report to address@hidden,
> > I have no idea what is wrong with this character,
> > and why only on Windows.
>
>
> David
Eli finally traced this down. His report and fix follow. Can y'all
comment on this please? In particular, is there a different or better way to
fix this? Unless I hear differently from you, I plan to apply the
patch in the next day or two.
Thanks,
Arnold
> Date: Fri, 30 Sep 2011 16:33:35 +0300
> From: Eli Zaretskii <address@hidden>
> Subject: Re: [bug-gawk] 4.0.0 Regex Patterns Choke on Exotic Chars
> To: address@hidden
> Cc: address@hidden, address@hidden
>
> > Date: Mon, 12 Sep 2011 07:19:10 GMT
> > From: address@hidden
> > Cc: address@hidden, address@hidden
> >
> > Otherwise, it looks like a problem with compiling the regular expression.
> > Start with make_regexp and keep digging down. You may want to try
> > compiling without optimzatin; I've seen the regex code break optimizers
> > before.
>
> No, optimizations have nothing to do with this (I see the problem in a
> non-optimized build as well).
>
> This bug is caused by the most mundane and dull issue with mixing
> signed and unsigned. To tell the truth, I never expected to see such
> issues in GNU sources that are used for such a long time.
>
> Here's the thing. The fatal error comes from here:
>
> regexp();
>
> if (tok != END)
> dfaerror(_("unbalanced )"));
>
> I.e., dfaparse expects all the string to be exhausted when `regexp'
> returns. In `regexp' we see:
>
> static void
> regexp (void)
> {
> branch();
> while (tok == OR)
> {
> tok = lex();
> branch();
> addtok(OR);
> }
> }
>
> where `branch' does this:
>
> static void
> branch (void)
> {
> closure();
> while (tok != RPAREN && tok != OR && tok >= 0)
> {
> closure();
> addtok(CAT);
> }
> }
>
> Note that `branch' terminates the loop when `tok' is negative (and
> there are other subroutines of dfa.c that do the same). Now, `tok'
> is an enumerated data type that has a single negative value:
>
> typedef enum
> {
> END = -1,
>
> /* Ordinary character values are terminal symbols that match themselves.
> */
>
> EMPTY = NOTCHAR, /* EMPTY is a terminal symbol that matches
> ...
>
> NOTCHAR is 256. So obviously, `branch' assumes that `tok' will only
> be negative when its value is END. However, `lex' calls FETCH_WC and
> FETCH macros that on Windows return negative values for any character
> greater than 127. So the loop ends prematurely, and the rest is
> history.
>
> Why do we get negative values from FETCH_WC and FETCH? Because they
> assume that casting to an unsigned type converts a negative value to a
> positive one. But what happens in fact is sign extension, so instead
> of 0x95 we get 0xffffff95. Assigning this to a signed int (because
> `tok's return value has the same enumerated type mentioned above,
> which must be signed to accommodate for -1) converts back to a
> negative value.
>
> I can fix the problem with the following simple patch. I don't
> consider myself an expert on futzing with signed and unsigned values,
> so I'll leave it to the experts to figure out The Right Way if this
> one isn't. I did test the patch on GNU/Linux and verified that
> David's script works there after applying the patch below.
>
> 2011-09-30 Eli Zaretskii <address@hidden>
>
> * dfa.c (FETCH_WC, FETCH): Produce an unsigned value, rather than
> a sign-extended one. Fixes a bug on MS-Windows with compiling
> patterns that include characters with the 8-th bit set.
> Reported by David Millis <address@hidden>.
>
> --- dfa.c.orig 2011-06-23 12:27:01.000000000 +0300
> +++ dfa.c 2011-09-30 16:06:25.609375000 +0300
> @@ -691,19 +691,22 @@ static unsigned char const *buf_end; /*
> else \
> { \
> wchar_t _wc; \
> + unsigned char uc; \
> cur_mb_len = mbrtowc(&_wc, lexptr, lexleft, &mbs); \
> if (cur_mb_len <= 0) \
> { \
> cur_mb_len = 1; \
> --lexleft; \
> - (wc) = (c) = (unsigned char) *lexptr++; \
> + uc = (unsigned char) *lexptr++; \
> + (wc) = (c) = uc; \
> } \
> else \
> { \
> lexptr += cur_mb_len; \
> lexleft -= cur_mb_len; \
> (wc) = _wc; \
> - (c) = wctob(wc); \
> + uc = (unsigned) wctob(wc); \
> + (c) = uc; \
> } \
> } \
> } while(0)
> @@ -718,6 +721,7 @@ static unsigned char const *buf_end; /*
> /* Note that characters become unsigned here. */
> # define FETCH(c, eoferr) \
> do { \
> + unsigned char uc; \
> if (! lexleft) \
> { \
> if ((eoferr) != 0) \
> @@ -725,7 +729,8 @@ static unsigned char const *buf_end; /*
> else \
> return lasttok = END; \
> } \
> - (c) = (unsigned char) *lexptr++; \
> + uc = (unsigned char) *lexptr++; \
> + (c) = uc; \
> --lexleft; \
> } while(0)
>
>
- dfa - gawk matching problem on windows and suggested fix,
Aharon Robbins <=
- Re: dfa - gawk matching problem on windows and suggested fix, Jim Meyering, 2011/10/02
- Re: dfa - gawk matching problem on windows and suggested fix, Eli Zaretskii, 2011/10/04
- Re: dfa - gawk matching problem on windows and suggested fix, Jim Meyering, 2011/10/03
- Re: dfa - gawk matching problem on windows and suggested fix, Eli Zaretskii, 2011/10/04
- Re: dfa - gawk matching problem on windows and suggested fix, Jim Meyering, 2011/10/03
- Re: dfa - gawk matching problem on windows and suggested fix, Eli Zaretskii, 2011/10/04
- Re: dfa - gawk matching problem on windows and suggested fix, Jim Meyering, 2011/10/03
- Re: dfa - gawk matching problem on windows and suggested fix, Eli Zaretskii, 2011/10/04
- Re: dfa - gawk matching problem on windows and suggested fix, Jim Meyering, 2011/10/04
- Re: dfa - gawk matching problem on windows and suggested fix, Aharon Robbins, 2011/10/04