|
From: | Johannes Meixner |
Subject: | Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8 |
Date: | Tue, 12 Jun 2012 16:52:16 +0200 (CEST) |
User-agent: | Alpine 2.00 (LNX 1167 2008-08-23) |
Hello, On Jun 1 12:02 Jim Meyering wrote (excerpt):
i='\xC4\xB0' printf "$i$i$i$i$i$i$i\n" > in LC_ALL=en_US.UTF-8 grep -i .... in > out cmp in out > /dev/null || echo FAIL As I mentioned in the link above, this is a problem because of the way grep's -i is implemented: it converts both the RE and the buffer-to-search to lower case, and then performs the search. The problem arises with turkish-I because the conversion changes the length of the buffer (in the example test, the input is 15 bytes long -- 7 x 2-byte I-with-dot + newline, yet the lower case version has a length of just 8: 7 x lower-cased i + NL), and the code returns the match offset and length relative to the shortened lower-case buffer (that lower-cased buffer is internal to code duplicated in EGexecute/Fexecute), yet it uses those offset,length numbers to manipulate the original buffer. Without re-architecting too much, one solution is to change mbtolower to return additional information: a malloc'd mapping vector M, of the same length as its returned buffer, where M[i] is the length-in-bytes of the character that formed byte I of the result. With that, or something similar, the caller could then map the currently-erroneous offset,len numbers to equivalent numbers that apply to the original buffer. This mapping could be allocated/defined only when lengths actually differ, so that the cost in general would be negligible.
I am not at all a localization expert and perhaps I misunderstand something but perhaps it is not safe to only test if lengths differ. I fear there exists a special locale setting where a special multibyte character string exists where its lower-cased counterpart has same length but nevertheless the character positions in both strings do not match. I am thinking about something like a two-character string "[3-byte-upper-case-character-1][2-byte-upper-case-character-2]" where its lower-cased counterpart is "[2-byte-lower-case-character-1][3-byte-lower-case-character-2]" Something like "[AAA][BB]" versus "[aa][bbb]" where [AAA] is a 3-byte upper-case character where [aa] is its 2-byte lower-case counterpart and [BB] is a 2-byte upper-case character where [bbb] is its 3-byte lower-case counterpart. Do such or similar kind of strings actually exist? If yes could such kind of strings still cause errors? Kind Regards Johannes Meixner -- SUSE LINUX Products GmbH -- Maxfeldstrasse 5 -- 90409 Nuernberg -- Germany HRB 16746 (AG Nuernberg) GF: Jeff Hawn, Jennifer Guild, Felix Imendoerffer
[Prev in Thread] | Current Thread | [Next in Thread] |