bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#22655: grep-2.21 (and git master): --null-data and ranges work in an


From: Jim Meyering
Subject: bug#22655: grep-2.21 (and git master): --null-data and ranges work in an odd way (-P works fine)
Date: Sat, 20 Feb 2016 20:19:20 -0800

On Sun, Feb 14, 2016 at 12:02 PM, Ulya Fokanova <address@hidden> wrote:
> I've explored the following case:
>
>    $ printf '12\n34\0' | LC_ALL=en_US.utf-8 grep -z '^[1-4]*$' | wc -c
>    6
>
> It's a bug (there should be no match).
>
> This is what grep does:
>
>  * triesto build DFA (as indfa.c)
>  * fails to expand character range [1-4] because of multibyte
>    localeen_US.utf-8 and gives up building DFA(marks [1-4] as BACKREF
>    that suppressesall dfa.c-related code), note the difference with
>    [1234] casein whichthere's no need to expand multibyte range
>  * falls back to Regex (gnulib extension of regex.h)
>  * Regex doesn't support '-z'semantics(the closest configuration to
>    '-z' is RE_NEWLINE_ALT, which is already included in RE_SYNTAX_GREP
>    set), so '\n'is treated as newline and match erroneously succeeds
>
> I think this should be worked around in grep: before calling 're_search' it
> should split the input string by 'eolbyte'.
>
> The bug also present with PCRE engine:
>
>    $ printf '12\n34\0' | LC_ALL=en_US.utf-8 grep -z -P '^[1234]*$' | wc -c
>    6
>    $ printf '12\n34\0' | LC_ALL=en_US.utf-8 grep -z -P '^[1-4]*$' | wc -c
>    6

Thank you for the analysis and the report.
I have fixed the regex-oriented problem with the attached
patch, but not yet the case using -P -z (PCRE + --null-data):

Attachment: 0001-grep-z-avoid-erroneous-match-with-regexp-anchor-and-.patch
Description: Text Data


reply via email to

[Prev in Thread] Current Thread [Next in Thread]