[bug #36682] Ignore case handling of special unicode characters (case fo

bug-grep

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[bug #36682] Ignore case handling of special unicode characters (case fo

From:	Johannes Meixner
Subject:	[bug #36682] Ignore case handling of special unicode characters (case folding)
Date:	Tue, 19 Jun 2012 10:35:54 +0000
User-agent:	Mozilla/5.0 (X11; U; Linux i686; en-GB; rv:1.9.1.9) Gecko/20100317 SUSE/3.5.9-0.1.1 Firefox/3.5.9

URL:
  <http://savannah.gnu.org/bugs/?36682>

                 Summary: Ignore case handling of special unicode characters
(case folding)
                 Project: grep
            Submitted by: jsmeix
            Submitted on: Tue 19 Jun 2012 10:35:53 AM GMT
                Category: None
                Severity: 3 - Normal
              Item Group: None
                  Status: None
                 Privacy: Public
             Assigned to: None
             Open/Closed: Open
         Discussion Lock: Any

    _______________________________________________________

Details:

Basically this is related to what is mentioned in the
"POSIX and --ignore-case" and "Unicode and --ignore-case"
sections in grep's TODO file.

The current behaviour is not a bug in grep
but a missing feature.

But I think that the current behaviour is not sufficiently
described in the grep documentation and this is probably
a (minor) bug in current grep.

Currently "grep -i" does not implement "case folding"
according to what is described in
http://www.unicode.org/versions/Unicode6.1.0/ch05.pdf

Here "case folding" means to have a list of mappings
for special characters (like the Greek sigma or
the German sharp s) how to convert them in both the
RE and the buffer-to-search into a sequence of bytes
which is appropriate for caseless matching using
a binary comparison.

Currently for "grep -i" such special characters
do not match.

For example:

Unicode | UTF-8 (oct.)   | name
----------------------------------------------------------
U+03A3  | 0316 0243      | GREEK CAPITAL LETTER SIGMA
U+03C2  | 0317 0202      | GREEK SMALL LETTER FINAL SIGMA
U+03C3  | 0317 0203      | GREEK SMALL LETTER SIGMA
U+00DF  | 0303 0237      | LATIN SMALL LETTER SHARP S
U+1E9E  | 0341 0272 0236 | LATIN CAPITAL LETTER SHARP S

Note that according to the Unicode specification
converting LATIN SMALL LETTER SHARP S
to uppere case results 'SS' (two ASCII 'S')
and not a LATIN CAPITAL LETTER SHARP S
but converting LATIN CAPITAL LETTER SHARP S
to lower case results LATIN SMALL LETTER SHARP S

Converting to lower case results that
GREEK CAPITAL LETTER SIGMA gets converted
to GREEK SMALL LETTER SIGMA which does not match
GREEK SMALL LETTER FINAL SIGMA.

There is the German lowercase word
'hei[LATIN SMALL LETTER SHARP S]' (English 'hot')
and when it is uppercased it becomes 'HEISS'
but 'HEISS' could be also written as
'HEI[LATIN SMALL LETTER SHARP S]'

Current results:
----------------------------------------------------------
$ export LC_ALL=el_GR.utf8 ; export LANG=el_GR.utf8

$ echo -e '\0316\0243\0316\0243' >SS

$ echo -e '\0316\0243\0317\0202' >Sf

$ echo -e '\0317\0203\0317\0202' >sf

$ echo -e '\0317\0203\0317\0203' >ss

$ grep -q -i -f SS ss && echo yes || echo no
yes

$ grep -q -i -f ss SS && echo yes || echo no
yes

$ grep -q -i -f Sf sf && echo yes || echo no
yes

$ grep -q -i -f sf Sf && echo yes || echo no
yes

$ grep -q -i -f SS sf && echo yes || echo no
no

$ grep -q -i -f sf SS && echo yes || echo no
no

$ export LC_ALL=de_DE.utf8 ; export LANG=de_DE.utf8

$ echo -e 'hei\0303\0237' >heif

$ echo -e 'HEI\0341\0272\0236' >HEIF

$ echo 'HEISS' >HEISS

$ grep -q -i -f heif HEIF && echo yes || echo no
yes

$ grep -q -i -f HEIF heif && echo yes || echo no
yes

$ grep -q -i -f heif HEISS && echo yes || echo no
no

$ grep -q -i -f HEISS heif && echo yes || echo no
no

$ grep -q -i -f HEIF HEISS && echo yes || echo no
no

$ grep -q -i -f HEISS HEIF && echo yes || echo no
no

$ grep -V
grep (GNU grep) 2.12
...
----------------------------------------------------------

The "no" results should be "yes" if "case folding"
would be used according to what is described in
http://www.unicode.org/versions/Unicode6.1.0/ch05.pdf

Missing documentation:

Currently "man grep" does not mention the issue.

The README file mentiones only the "tr_TR.UTF-8" locale.
Probably the "Turkish I/i with/without dot" issue
is meant there.

I think that "man grep" should explicitly mention
the current shortcoming of "-i, --ignore-case".

I suggest to enhance "man grep" by something like this:
------------------------------------------------------------
  -i, --ignore-case
         Ignore case distinctions in both the PATTERN
         and the input files. (-i is specified by POSIX.)
         Currently grep does not implement "case folding"
         accodring to the Unicode Standard. For special
         unicode characters caseless matching fails.
         For example GREEK SMALL LETTER FINAL SIGMA
         does not match GREEK CAPITAL LETTER SIGMA and
         LATIN SMALL LETTER SHARP S does not match 'SS'.
------------------------------------------------------------





    _______________________________________________________

Reply to this item at:

  <http://savannah.gnu.org/bugs/?36682>

_______________________________________________
  Message sent via/by Savannah
  http://savannah.gnu.org/

[Prev in Thread]

Current Thread

[Next in Thread]

[bug #36682] Ignore case handling of special unicode characters (case folding), Johannes Meixner <=
- [bug #36682] Ignore case handling of special unicode characters (case folding), Johannes Meixner, 2012/06/19
  - [bug #36682] Ignore case handling of special unicode characters (case folding), Paul Eggert, 2012/06/19

Prev by Date: Re: [PATCH] grep -i: work also when converting to lower-case inflates byte count
Next by Date: [bug #36682] Ignore case handling of special unicode characters (case folding)
Previous by thread: [PATCH] tests: extend coverage of dfa.c's match_mb_charset
Next by thread: [bug #36682] Ignore case handling of special unicode characters (case folding)
Index(es):
- Date
- Thread