Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8

bug-grep

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8

From:	Johannes Meixner
Subject:	Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8
Date:	Tue, 12 Jun 2012 16:52:16 +0200 (CEST)
User-agent:	Alpine 2.00 (LNX 1167 2008-08-23)


Hello,

On Jun 1 12:02 Jim Meyering wrote (excerpt):


   i='\xC4\xB0'
   printf "$i$i$i$i$i$i$i\n" > in
   LC_ALL=en_US.UTF-8 grep -i .... in > out
   cmp in out > /dev/null || echo FAIL

As I mentioned in the link above, this is a problem because of the way
grep's -i is implemented: it converts both the RE and the buffer-to-search
to lower case, and then performs the search.  The problem arises with
turkish-I because the conversion changes the length of the buffer (in
the example test, the input is 15 bytes long -- 7 x 2-byte I-with-dot
+ newline, yet the lower case version has a length of just 8: 7 x
lower-cased i + NL), and the code returns the match offset and length
relative to the shortened lower-case buffer (that lower-cased buffer is
internal to code duplicated in EGexecute/Fexecute), yet it uses those
offset,length numbers to manipulate the original buffer.

Without re-architecting too much, one solution is to change mbtolower to
return additional information: a malloc'd mapping vector M, of the same
length as its returned buffer, where M[i] is the length-in-bytes of the
character that formed byte I of the result.  With that, or something
similar, the caller could then map the currently-erroneous offset,len
numbers to equivalent numbers that apply to the original buffer.  This
mapping could be allocated/defined only when lengths actually differ,
so that the cost in general would be negligible.


I am not at all a localization expert and perhaps I misunderstand
something but perhaps it is not safe to only test if lengths differ.

I fear there exists a special locale setting where a special
multibyte character string exists where its lower-cased counterpart
has same length but nevertheless the character positions in both
strings do not match.

I am thinking about something like a two-character string
"[3-byte-upper-case-character-1][2-byte-upper-case-character-2]"
where its lower-cased counterpart is
"[2-byte-lower-case-character-1][3-byte-lower-case-character-2]"

Something like "[AAA][BB]" versus "[aa][bbb]" where
[AAA] is a 3-byte upper-case character where
[aa] is its 2-byte lower-case counterpart and
[BB] is a 2-byte upper-case character where
[bbb] is its 3-byte lower-case counterpart.

Do such or similar kind of strings actually exist?
If yes could such kind of strings still cause errors?


Kind Regards
Johannes Meixner
--
SUSE LINUX Products GmbH -- Maxfeldstrasse 5 -- 90409 Nuernberg -- Germany
HRB 16746 (AG Nuernberg) GF: Jeff Hawn, Jennifer Guild, Felix Imendoerffer

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [PATCH] grep -i: work also when converting to lower-case inflates byte count, (continued)
- Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8, Jim Meyering, 2012/06/01
  - Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8, Jim Meyering, 2012/06/01
    - Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8, Paul Eggert, 2012/06/01
    - Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8, Jim Meyering, 2012/06/01
    - Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8, Paul Eggert, 2012/06/01
    - Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8, Jim Meyering, 2012/06/01
    - Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8, Jim Meyering, 2012/06/01
    - Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8, Jim Meyering, 2012/06/02
  - Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8, Johannes Meixner <=
    - Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8, Jim Meyering, 2012/06/12
    - Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8, Paul Eggert, 2012/06/12
  - Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8, Johannes Meixner, 2012/06/14
    - Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8, Paul Eggert, 2012/06/14
    - Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8, Johannes Meixner, 2012/06/15
    - Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8, Johannes Meixner, 2012/06/15
    - Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8, Paul Eggert, 2012/06/15
    - Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8, Paolo Bonzini, 2012/06/15

Prev by Date: Re: [PATCH] tests: extend coverage of dfa.c's match_mb_charset
Next by Date: [bug #36567] grep -i (case-insensitive) is broken with UTF8
Previous by thread: Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8
Next by thread: Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8
Index(es):
- Date
- Thread