[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#23234: unexpected results with charset handling in GNU grep 2.23
From: |
Eric Blake |
Subject: |
bug#23234: unexpected results with charset handling in GNU grep 2.23 |
Date: |
Wed, 6 Apr 2016 15:04:26 -0600 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.7.1 |
On 04/06/2016 01:25 PM, Björn JACKE wrote:
> Let's take this example using grep 2.23:
>
> # echo -e "test\ntäst\ntest" | iconv -f utf8 -t latin1 | LC_ALL=C grep "st" ;
> echo $?
[As a side point, 'echo -e' is non-portable; better is to use printf.]
Hmm. POSIX says that a file is binary if it does not end in newline, if
it contains embedded NUL, or if it contains an encoding error. But it
also says that LC_ALL=C is _required_ to treat all 256 byte values as
valid characters (ASCII is only required to treat 7-bit characters as
valid, and may reject 8-bit bytes, but LC_ALL=C is _not_ ASCII). This
indeed looks like a bug in current grep.git, as I can reproduce it:
$ git rev-parse HEAD
2ba6ab34da05d3aebc5e7e3dfaedb1cf3ddc5a73
$ printf "test\ntäst\ntest\n" | iconv -f utf8 -t latin1 |
LC_ALL=C src/grep "st"
test
Binary file (standard input) matches
Looks like we don't have something quite right in claiming that 0xe4 is
not a valid character when in the single-byte C locale.
> I really hope this change will be reverted as soon as possible. I would rather
> prefer GNU grep to become posix compliant and not do any binary detection by
> default actually.
The change of treating encoding errors as binary files will NOT be
reverted, but here, you HAVE pointed out a bug where we are treating
something as binary that is NOT an encoding error (because by
definition, LC_ALL=C has no encoding errors - all 256 byte values are
characters). So this is indeed a bug to be fixed.
--
Eric Blake eblake redhat com +1-919-301-3266
Libvirt virtualization library http://libvirt.org
signature.asc
Description: OpenPGP digital signature
- bug#23234: unexpected results with charset handling in GNU grep 2.23, Björn JACKE, 2016/04/06
- bug#23234: unexpected results with charset handling in GNU grep 2.23,
Eric Blake <=
- bug#23234: unexpected results with charset handling in GNU grep 2.23, Bjoern Jacke, 2016/04/06
- bug#23234: unexpected results with charset handling in GNU grep 2.23, Eric Blake, 2016/04/06
- bug#23234: unexpected results with charset handling in GNU grep 2.23, Bjoern Jacke, 2016/04/06
- bug#23234: unexpected results with charset handling in GNU grep 2.23, Eric Blake, 2016/04/06
- bug#23234: unexpected results with charset handling in GNU grep 2.23, Paul Eggert, 2016/04/06
- bug#23234: unexpected results with charset handling in GNU grep 2.23, Norihiro Tanaka, 2016/04/09
- bug#23234: unexpected results with charset handling in GNU grep 2.23, Paul Eggert, 2016/04/09
bug#23234: unexpected results with charset handling in GNU grep 2.23, Paul Eggert, 2016/04/06