|
From: | Paul Eggert |
Subject: | bug#17376: [PATCH] grep: fix the different behaviour for a invalid sequence between KWset and DFA |
Date: | Wed, 30 Apr 2014 12:04:50 -0700 |
User-agent: | Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.4.0 |
On 04/30/2014 08:02 AM, Norihiro Tanaka wrote:
Thare is different behaviour for a invalid sequence between KWset and DFA. encode() { echo "$1" | tr ABC '\357\274\241'; } encode ABC | env LC_ALL=en_US.utf8 src/grep "$(encode A)\|q" encode ABC | env LC_ALL=en_US.utf8 src/grep -F "$(encode A)" encode sABC | env LC_ALL=en_US.utf8 src/grep "a$(encode A)\|q" encode sABC | env LC_ALL=en_US.utf8 src/grep -F "a$(encode A)" We expect that all of them are same results, but only 4th returns 1 row.
Sorry, but I am not observing this behavior. With grep 2.18, none of the commands output anything. The same is true for the git master.
If the pattern or data have encoding errors, POSIX says grep can do whatever it likes. As I understand it, in grep 2.18 and the git master, an encoding-error byte in a pattern matches only the same encoding-error byte in the data. Does this bug report's patch change behavior, so that an encoding-error byte in a pattern can match part of a valid multibyte-character in the data? If so, it's not clear to me why the proposed behavior change is helpful -- as a user, I'm not sure I'd want such a match to work. If not, then could you please explain a bit more what's going on?
More generally, I don't think users care about encoding-error bytes in patterns. If it helps simplify the code and/or improves performance, I'd favor changing 'grep' so that it simply rejects patterns containing encoding errors, and exits with status 2. We should probably wait until after the next release before doing anything that drastic, though.
[Prev in Thread] | Current Thread | [Next in Thread] |