[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8
From: |
Jim Meyering |
Subject: |
Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8 |
Date: |
Fri, 01 Jun 2012 12:02:47 +0200 |
Strahinja Kustudic wrote:
> URL:
> <http://savannah.gnu.org/bugs/?36567>
>
> Summary: grep -i (case-insensitive) is broken with UTF8
> Project: grep
> Submitted by: kustodian
> Submitted on: Thu 31 May 2012 11:18:30 AM GMT
> Category: None
> Severity: 3 - Normal
> Item Group: None
> Status: None
> Privacy: Public
> Assigned to: None
> Open/Closed: Open
> Discussion Lock: Any
>
> Details:
>
> Since version 2.6.1 grep doesn't work correctly if you use a case-insesitive
> search with UTF8 encoding when there is an UTF8 character. Here is the
> example:
>
> # Without -i switch everything works correctly
> $ echo -e 'AA UTF8 char İ 12345\nAA 12345' | grep 'AA'
> AA UTF8 char İ 12345
> AA 12345
>
>
> # With -i it breaks
> $ echo -e 'AA UTF8 char İ 12345\nAA 12345' | grep -i 'AA'
> AA UTF8 char İ 12345AA 12345
>
>
> As you can see it somehow deletes the new line character in the line which has
> an UTF8 'İ' character.
>
> Everything works correctly in versions 2.5.4 and below, it's broken from 2.6.1
> to the latest version (which is atm 2.6.12).
>
> This is a big concern, since it can break scripts which filtered UTF8 input
Thanks for the report.
This is the same bug that prompted the addition of the
tests/turkish-I test (still expected to fail):
http://thread.gmane.org/gmane.comp.gnu.grep.bugs/3413/focus=3417
Sorry no one has followed up since then.
Here's another demonstrator:
i='\xC4\xB0'
printf "$i$i$i$i$i$i$i\n" > in
LC_ALL=en_US.UTF-8 grep -i .... in > out
cmp in out > /dev/null || echo FAIL
As I mentioned in the link above, this is a problem because of the way
grep's -i is implemented: it converts both the RE and the buffer-to-search
to lower case, and then performs the search. The problem arises with
turkish-I because the conversion changes the length of the buffer (in
the example test, the input is 15 bytes long -- 7 x 2-byte I-with-dot
+ newline, yet the lower case version has a length of just 8: 7 x
lower-cased i + NL), and the code returns the match offset and length
relative to the shortened lower-case buffer (that lower-cased buffer is
internal to code duplicated in EGexecute/Fexecute), yet it uses those
offset,length numbers to manipulate the original buffer.
Without re-architecting too much, one solution is to change mbtolower to
return additional information: a malloc'd mapping vector M, of the same
length as its returned buffer, where M[i] is the length-in-bytes of the
character that formed byte I of the result. With that, or something
similar, the caller could then map the currently-erroneous offset,len
numbers to equivalent numbers that apply to the original buffer. This
mapping could be allocated/defined only when lengths actually differ,
so that the cost in general would be negligible.
- [bug #36567] grep -i (case-insensitive) is broken with UTF8, (continued)
- [bug #36567] grep -i (case-insensitive) is broken with UTF8, Paul Eggert, 2012/06/12
- [bug #36567] grep -i (case-insensitive) is broken with UTF8, Paul Eggert, 2012/06/12
- Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8, Jim Meyering, 2012/06/12
- Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8, Paul Eggert, 2012/06/12
- Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8, Paolo Bonzini, 2012/06/15
- [PATCH] grep -i: work also when converting to lower-case inflates byte count, Jim Meyering, 2012/06/16
- Re: [PATCH] grep -i: work also when converting to lower-case inflates byte count, Paul Eggert, 2012/06/16
- Re: [PATCH] grep -i: work also when converting to lower-case inflates byte count, Jim Meyering, 2012/06/16
- Re: [PATCH] grep -i: work also when converting to lower-case inflates byte count, Paolo Bonzini, 2012/06/23
Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8,
Jim Meyering <=
Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8, Jim Meyering, 2012/06/01
Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8, Jim Meyering, 2012/06/01
Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8, Jim Meyering, 2012/06/02
Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8, Johannes Meixner, 2012/06/12