bug#15440: [PATCH] dfa: fix \s and \S to work for multibyte

bug-grep

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#15440: [PATCH] dfa: fix \s and \S to work for multibyte

From:	Aharon Robbins
Subject:	bug#15440: [PATCH] dfa: fix \s and \S to work for multibyte
Date:	Tue, 24 Sep 2013 15:24:45 +0300
User-agent:	Heirloom mailx 12.5 6/20/10

Hi Jim.

I should note that gawk uses its own regex, although it does rely
on glibc for isspace / iswspace etc...

Can you test gawk (using the master branch is fine) on Mac OS X?
Basically you'd want to enclose the pattern in /.../ on the command
line and use GAWK_NO_DFA=1 to force use of regex.

In any case, once you push the changes I'll pick them up.

Thanks,

Arnold

P.S. To test gawk, cut and paste:

        git clone git://git.savannah.gnu.org/gawk.git
        cd gawk
        ./bootstrap.sh && ./configure && make -j 10 # or whatever
        make check      # optional

        printf '....' | ./gawk '/.../'  # your tests here. :-)

Much thanks!

> From: Jim Meyering <address@hidden>
> Date: Mon, 23 Sep 2013 14:04:09 -0700
> Subject: Re: bug#15440: [PATCH] dfa: fix \s and \S to work for multibyte
> To: Aharon Robbins <address@hidden>, address@hidden
>
> [using the right bug address, this time]
>
> On Mon, Sep 23, 2013 at 11:26 AM, Aharon Robbins <address@hidden> wrote:
> > Hi.
> >
> >>     $ printf '\x82\n' > in; ./grep -q '\S' in && echo match
> >>     match
> >>
> >> Now, require a back-reference (forcing switch from grep's DFA matcher
> >> to use of the regex functions), and you see there is no match:
> >>
> >>     $ printf '\x82\x82\n' > in; ./grep -qE '(\S)\1' in && echo match
> >>     $
> >
> > I see similar results with gawk, accounting for syntactic difference
> > and a different way to force the regex matcher.
> >
> > So far so good.
> >
> >> Uh oh.  This is worse: \s is not multi-byte aware.
> >> The two-byte "NO-BREAK SPACE" character is not matched by \s.
> >>
> >> This fails:
> >>     $ printf 'a\xc2\xa0b\n'|./grep 'a\sb'
> >>     $
> >>
> >> This matches in spite of the fact that grep.texi says \s is
> >>      equivalent to [[:space:]] :
> >>     $ printf 'a\xc2\xa0b\n'|./grep 'a[[:space:]]b'
> >>     a b
> >>
> >> GNU grep fails:
> >> (but if I do s/\\s/[[:space:]]/ to the RE, then it does match)
> >>     $ printf 'a\xc2\xa0ba\xc2\xa0b\n'|./grep -E '(a\sb)\1' grep:
> >>     $
> >
> > I cannot reproduce this with gawk.  Setting GAWK_NO_DFA=1 in the
> > environment causes gawk to bypass dfa. For these it makes no
> > difference:
> >
> > $ printf 'a\xc2\xa0b\n' | ./gawk '/a\sb/'
> > $ printf 'a\xc2\xa0b\n' | GAWK_NO_DFA=1 ./gawk '/a\sb/'
> >
> > No result from either, and similar results for [[:space:]].
>
> Hi Arnold,
> [re-adding CC to the bug tracker]
>
> Thanks for testing.
> When I test on glibc, I confirm what you report: [[:space:]] fails to
> match NBSP.  Makes me think either glibc's UTF8 attribute tables are
> wrong, or there's a bug in regex:
>
>   $ printf 'a\xc2\xa0b\n'|LC_ALL=en_US.
> UTF-8 grep 'a[[:space:]]b'
>   [Exit 1]
>
> Initially, I considered constructing a DFA that would match all UTF8
> white space characters (see the FIXME comment), and another that would
> match the complement of that set minus the set of invalid UTF8 bytes,
> but ended up preferring the simpler change.
>
> FTR, I tested this only on a system for which all tests passed (OS/X).
>  Very surprised to find it doesn't work on a glibc-based system.

[Prev in Thread]

Current Thread

[Next in Thread]

bug#15440: [PATCH] dfa: fix \s and \S to work for multibyte, Jim Meyering, 2013/09/23
- bug#15440: [PATCH] dfa: fix \s and \S to work for multibyte, Jim Meyering, 2013/09/23
  - bug#15440: [PATCH] dfa: fix \s and \S to work for multibyte, Aharon Robbins <=

Prev by Date: bug#15444: One character can be lost if colors are enabled
Next by Date: bug#15472: segmentation fault if input line is too long
Previous by thread: bug#15440: [PATCH] dfa: fix \s and \S to work for multibyte
Next by thread: bug#15441: [PATCH] tests: ensure neither \s nor \S matches an invalid multibyte character
Index(es):
- Date
- Thread