[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#15440: [PATCH] dfa: fix \s and \S to work for multibyte
From: |
Aharon Robbins |
Subject: |
bug#15440: [PATCH] dfa: fix \s and \S to work for multibyte |
Date: |
Tue, 24 Sep 2013 15:24:45 +0300 |
User-agent: |
Heirloom mailx 12.5 6/20/10 |
Hi Jim.
I should note that gawk uses its own regex, although it does rely
on glibc for isspace / iswspace etc...
Can you test gawk (using the master branch is fine) on Mac OS X?
Basically you'd want to enclose the pattern in /.../ on the command
line and use GAWK_NO_DFA=1 to force use of regex.
In any case, once you push the changes I'll pick them up.
Thanks,
Arnold
P.S. To test gawk, cut and paste:
git clone git://git.savannah.gnu.org/gawk.git
cd gawk
./bootstrap.sh && ./configure && make -j 10 # or whatever
make check # optional
printf '....' | ./gawk '/.../' # your tests here. :-)
Much thanks!
> From: Jim Meyering <address@hidden>
> Date: Mon, 23 Sep 2013 14:04:09 -0700
> Subject: Re: bug#15440: [PATCH] dfa: fix \s and \S to work for multibyte
> To: Aharon Robbins <address@hidden>, address@hidden
>
> [using the right bug address, this time]
>
> On Mon, Sep 23, 2013 at 11:26 AM, Aharon Robbins <address@hidden> wrote:
> > Hi.
> >
> >> $ printf '\x82\n' > in; ./grep -q '\S' in && echo match
> >> match
> >>
> >> Now, require a back-reference (forcing switch from grep's DFA matcher
> >> to use of the regex functions), and you see there is no match:
> >>
> >> $ printf '\x82\x82\n' > in; ./grep -qE '(\S)\1' in && echo match
> >> $
> >
> > I see similar results with gawk, accounting for syntactic difference
> > and a different way to force the regex matcher.
> >
> > So far so good.
> >
> >> Uh oh. This is worse: \s is not multi-byte aware.
> >> The two-byte "NO-BREAK SPACE" character is not matched by \s.
> >>
> >> This fails:
> >> $ printf 'a\xc2\xa0b\n'|./grep 'a\sb'
> >> $
> >>
> >> This matches in spite of the fact that grep.texi says \s is
> >> equivalent to [[:space:]] :
> >> $ printf 'a\xc2\xa0b\n'|./grep 'a[[:space:]]b'
> >> a b
> >>
> >> GNU grep fails:
> >> (but if I do s/\\s/[[:space:]]/ to the RE, then it does match)
> >> $ printf 'a\xc2\xa0ba\xc2\xa0b\n'|./grep -E '(a\sb)\1' grep:
> >> $
> >
> > I cannot reproduce this with gawk. Setting GAWK_NO_DFA=1 in the
> > environment causes gawk to bypass dfa. For these it makes no
> > difference:
> >
> > $ printf 'a\xc2\xa0b\n' | ./gawk '/a\sb/'
> > $ printf 'a\xc2\xa0b\n' | GAWK_NO_DFA=1 ./gawk '/a\sb/'
> >
> > No result from either, and similar results for [[:space:]].
>
> Hi Arnold,
> [re-adding CC to the bug tracker]
>
> Thanks for testing.
> When I test on glibc, I confirm what you report: [[:space:]] fails to
> match NBSP. Makes me think either glibc's UTF8 attribute tables are
> wrong, or there's a bug in regex:
>
> $ printf 'a\xc2\xa0b\n'|LC_ALL=en_US.
> UTF-8 grep 'a[[:space:]]b'
> [Exit 1]
>
> Initially, I considered constructing a DFA that would match all UTF8
> white space characters (see the FIXME comment), and another that would
> match the complement of that set minus the set of invalid UTF8 bytes,
> but ended up preferring the simpler change.
>
> FTR, I tested this only on a system for which all tests passed (OS/X).
> Very surprised to find it doesn't work on a glibc-based system.