|
From: | Johannes Meixner |
Subject: | Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8 |
Date: | Fri, 15 Jun 2012 15:00:29 +0200 (CEST) |
User-agent: | Alpine 2.00 (LNX 1167 2008-08-23) |
Hello, On Jun 14 07:44 Paul Eggert wrote (excerpt):
On 06/14/2012 04:07 AM, Johannes Meixner wrote:Is grep's -i implemented via plain convert to lower case or is it actually implemented via "case folding"?I'm not sure which you mean by "plain convert" and by "case folding", but it should handle the Greek sigma case correctly. If there are bugs please let us know.
I meant the difference between "convert to lower case" and "case folding" as described in http://www.unicode.org/versions/Unicode6.1.0/ch05.pdf But I am not at all an expert in this area so that I may misunderstand things. The Greek sigma case: I think that the Greek sigma case is not handled correctly in grep-2.7 which I use. My steps to reproduce it: 1. From http://www.utf8-chartable.de/unicode-utf8-table.pl I have this table: Unicode | UTF-8 (oct.) | name ---------------------------------------------------------- U+03A3 | 0316 0243 | GREEK CAPITAL LETTER SIGMA U+03C2 | 0317 0202 | GREEK SMALL LETTER FINAL SIGMA U+03C3 | 0317 0203 | GREEK SMALL LETTER SIGMA 2. I set a Greek UTF-8 locale: $ export LC_ALL=el_GR.utf8 ; export LANG=el_GR.utf8 3. I create four UTF-8 files with those characters: For the file names I use the following ASCII characters to denote the content of the file: 'S' means a GREEK CAPITAL LETTER SIGMA 'f' means a GREEK SMALL LETTER FINAL SIGMA 's' means a GREEK SMALL LETTER SIGMA $ echo -e '\0316\0243\0316\0243' >SS $ echo -e '\0316\0243\0317\0202' >Sf $ echo -e '\0317\0203\0317\0202' >sf $ echo -e '\0317\0203\0317\0203' >ss 4. Testing what grep versus grep -i finds: $ grep -q -i -f SS ss && echo yes || echo no yes $ grep -q -i -f ss SS && echo yes || echo no yes $ grep -q -i -f Sf sf && echo yes || echo no yes $ grep -q -i -f sf Sf && echo yes || echo no yes $ grep -q -i -f SS sf && echo yes || echo no no $ grep -q -i -f sf SS && echo yes || echo no no My conclusion: The last two "no" are an error as far as I understand http://www.unicode.org/versions/Unicode6.1.0/ch05.pdf Therein in the "Caseless Matching" sub-section there is an example given that GREEK CAPITAL LETTER SIGMA should match GREEK SMALL LETTER FINAL SIGMA. I think this is caused by "the way grep's -i is implemented: it converts both the RE and the buffer-to-search to lower case" so that GREEK CAPITAL LETTER SIGMA gets converted to GREEK SMALL LETTER SIGMA which does not match GREEK SMALL LETTER FINAL SIGMA. I think if grep's -i would be implemented by "case folding" according to how I understand http://www.unicode.org/versions/Unicode6.1.0/ch05.pdf then I assume something like the following should be done: In both the RE and the buffer-to-search GREEK CAPITAL LETTER SIGMA gets converted to GREEK SMALL LETTER SIGMA and GREEK SMALL LETTER FINAL SIGMA gets converted to GREEK SMALL LETTER SIGMA so that then in the end there is only GREEK SMALL LETTER SIGMA in both the RE and the buffer-to-search and that matches. The German sharp s case: Another example is the LATIN SMALL LETTER SHARP S (U+00DF) which is described in the "Complications for Case Mapping" sub-section in http://www.unicode.org/versions/Unicode6.1.0/ch05.pdf The LATIN SMALL LETTER SHARP S (U+00DF / octal 0303 0237) expands when uppercased to the sequence of two characters "SS". I use here in this mail the ASCII character 'f' to denote a LATIN SMALL LETTER SHARP S (U+00DF / octal 0303 0237). There is the German lowercase word 'heif' (English 'hot') and when 'heif' is uppercased it becomes 'HEISS'. Therefore for "grep -i" 'heif' and 'HEISS' should match. $ export LC_ALL=de_DE.utf8 ; export LANG=de_DE.utf8 $ echo -e 'hei\0303\0237' >heif $ echo 'HEISS' >HEISS $ grep -q -i -f heif HEISS && echo yes || echo no no $ grep -q -i -f HEISS heif && echo yes || echo no no In this case it seems "case folding" can be implemented as follows: In both the RE and the buffer-to-search LATIN SMALL LETTER SHARP S gets converted to 'SS' To make it more complicated since Unicode 5.1 there exists LATIN CAPITAL LETTER SHARP S ( U+1E9E coctal 0341 0272 0236) so that 'HEISS' could be also written as $ echo -e 'HEI\0341\0272\0236' >HEIF I use 'F' to denote a LATIN CAPITAL LETTER SHARP S. For "grep -i" 'heif' and 'HEISS' and 'HEIF' should match. Therefore "case folding" for LATIN SHARP S in general might be implemented as follows: In both the RE and the buffer-to-search LATIN SMALL LETTER SHARP S gets converted to 'SS' and LATIN CAPITAL LETTER SHARP S gets converted to 'SS' I wonder if it is right or wrong to convert 'SS' to 'ss' in an additional step which would mean that LATIN SHARP S would match 'ss' but I don't know if this is the right meaning of caseless matching because there are the German words 'Masse' (English mass) and 'Mafe' (English measures) again I wrote 'f' to denote a LATIN SMALL LETTER SHARP S. Both words have different menaing so that I think that for "grep -i" 'Masse' and 'Mafe' should not match. Therefore I think "case folding" should result something like 'Masse' is case folded to 'masse' and 'Mafe' is case folded to 'maSSe' without an an additional step which converts 'SS' to 'ss' so that 'Masse' and 'Mafe' do not match for caseless matching. In the end I think "case folding" means to have a list of mappings for special characters how to convert them in both the RE and the buffer-to-search into a fixed form (i.e. a sequence of bytes) which is appropriate for caseless matching using a binary comparison, see "Caseless Matching" in http://www.unicode.org/versions/Unicode6.1.0/ch05.pdf Kind Regards Johannes Meixner -- SUSE LINUX Products GmbH -- Maxfeldstrasse 5 -- 90409 Nuernberg -- Germany HRB 16746 (AG Nuernberg) GF: Jeff Hawn, Jennifer Guild, Felix Imendoerffer
[Prev in Thread] | Current Thread | [Next in Thread] |