bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#23234: unexpected results with charset handling in GNU grep 2.23


From: Eric Blake
Subject: bug#23234: unexpected results with charset handling in GNU grep 2.23
Date: Wed, 6 Apr 2016 15:04:26 -0600
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.7.1

On 04/06/2016 01:25 PM, Björn JACKE wrote:
> Let's take this example using grep 2.23:
> 
> # echo -e "test\ntäst\ntest" | iconv -f utf8 -t latin1 | LC_ALL=C grep "st" ; 
> echo $?

[As a side point, 'echo -e' is non-portable; better is to use printf.]

Hmm.  POSIX says that a file is binary if it does not end in newline, if
it contains embedded NUL, or if it contains an encoding error.  But it
also says that LC_ALL=C is _required_ to treat all 256 byte values as
valid characters (ASCII is only required to treat 7-bit characters as
valid, and may reject 8-bit bytes, but LC_ALL=C is _not_ ASCII).  This
indeed looks like a bug in current grep.git, as I can reproduce it:

$ git rev-parse HEAD
2ba6ab34da05d3aebc5e7e3dfaedb1cf3ddc5a73
$ printf "test\ntäst\ntest\n" | iconv -f utf8 -t latin1 |
   LC_ALL=C src/grep "st"
test
Binary file (standard input) matches

Looks like we don't have something quite right in claiming that 0xe4 is
not a valid character when in the single-byte C locale.

> I really hope this change will be reverted as soon as possible. I would rather
> prefer GNU grep to become posix compliant and not do any binary detection by
> default actually.

The change of treating encoding errors as binary files will NOT be
reverted, but here, you HAVE pointed out a bug where we are treating
something as binary that is NOT an encoding error (because by
definition, LC_ALL=C has no encoding errors - all 256 byte values are
characters).  So this is indeed a bug to be fixed.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

Attachment: signature.asc
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]