bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8


From: Johannes Meixner
Subject: Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8
Date: Tue, 12 Jun 2012 16:52:16 +0200 (CEST)
User-agent: Alpine 2.00 (LNX 1167 2008-08-23)


Hello,

On Jun 1 12:02 Jim Meyering wrote (excerpt):

   i='\xC4\xB0'
   printf "$i$i$i$i$i$i$i\n" > in
   LC_ALL=en_US.UTF-8 grep -i .... in > out
   cmp in out > /dev/null || echo FAIL

As I mentioned in the link above, this is a problem because of the way
grep's -i is implemented: it converts both the RE and the buffer-to-search
to lower case, and then performs the search.  The problem arises with
turkish-I because the conversion changes the length of the buffer (in
the example test, the input is 15 bytes long -- 7 x 2-byte I-with-dot
+ newline, yet the lower case version has a length of just 8: 7 x
lower-cased i + NL), and the code returns the match offset and length
relative to the shortened lower-case buffer (that lower-cased buffer is
internal to code duplicated in EGexecute/Fexecute), yet it uses those
offset,length numbers to manipulate the original buffer.

Without re-architecting too much, one solution is to change mbtolower to
return additional information: a malloc'd mapping vector M, of the same
length as its returned buffer, where M[i] is the length-in-bytes of the
character that formed byte I of the result.  With that, or something
similar, the caller could then map the currently-erroneous offset,len
numbers to equivalent numbers that apply to the original buffer.  This
mapping could be allocated/defined only when lengths actually differ,
so that the cost in general would be negligible.

I am not at all a localization expert and perhaps I misunderstand
something but perhaps it is not safe to only test if lengths differ.

I fear there exists a special locale setting where a special
multibyte character string exists where its lower-cased counterpart
has same length but nevertheless the character positions in both
strings do not match.

I am thinking about something like a two-character string
"[3-byte-upper-case-character-1][2-byte-upper-case-character-2]"
where its lower-cased counterpart is
"[2-byte-lower-case-character-1][3-byte-lower-case-character-2]"

Something like "[AAA][BB]" versus "[aa][bbb]" where
[AAA] is a 3-byte upper-case character where
[aa] is its 2-byte lower-case counterpart and
[BB] is a 2-byte upper-case character where
[bbb] is its 3-byte lower-case counterpart.

Do such or similar kind of strings actually exist?
If yes could such kind of strings still cause errors?


Kind Regards
Johannes Meixner
--
SUSE LINUX Products GmbH -- Maxfeldstrasse 5 -- 90409 Nuernberg -- Germany
HRB 16746 (AG Nuernberg) GF: Jeff Hawn, Jennifer Guild, Felix Imendoerffer



reply via email to

[Prev in Thread] Current Thread [Next in Thread]