bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#17376: [PATCH] grep: fix the different behaviour for a invalid seque


From: Paul Eggert
Subject: bug#17376: [PATCH] grep: fix the different behaviour for a invalid sequence between KWset and DFA
Date: Wed, 30 Apr 2014 12:04:50 -0700
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.4.0

On 04/30/2014 08:02 AM, Norihiro Tanaka wrote:
Thare is different behaviour for a invalid sequence between KWset and DFA.

   encode() { echo "$1" | tr ABC '\357\274\241'; }
   encode ABC | env LC_ALL=en_US.utf8 src/grep "$(encode A)\|q"
   encode ABC | env LC_ALL=en_US.utf8 src/grep -F "$(encode A)"
   encode sABC | env LC_ALL=en_US.utf8 src/grep "a$(encode A)\|q"
   encode sABC | env LC_ALL=en_US.utf8 src/grep -F "a$(encode A)"

We expect that all of them are same results, but only 4th returns 1 row.

Sorry, but I am not observing this behavior. With grep 2.18, none of the commands output anything. The same is true for the git master.

If the pattern or data have encoding errors, POSIX says grep can do whatever it likes. As I understand it, in grep 2.18 and the git master, an encoding-error byte in a pattern matches only the same encoding-error byte in the data. Does this bug report's patch change behavior, so that an encoding-error byte in a pattern can match part of a valid multibyte-character in the data? If so, it's not clear to me why the proposed behavior change is helpful -- as a user, I'm not sure I'd want such a match to work. If not, then could you please explain a bit more what's going on?

More generally, I don't think users care about encoding-error bytes in patterns. If it helps simplify the code and/or improves performance, I'd favor changing 'grep' so that it simply rejects patterns containing encoding errors, and exits with status 2. We should probably wait until after the next release before doing anything that drastic, though.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]