[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[bug #36682] Ignore case handling of special unicode characters (case fo
From: |
Johannes Meixner |
Subject: |
[bug #36682] Ignore case handling of special unicode characters (case folding) |
Date: |
Tue, 19 Jun 2012 10:35:54 +0000 |
User-agent: |
Mozilla/5.0 (X11; U; Linux i686; en-GB; rv:1.9.1.9) Gecko/20100317 SUSE/3.5.9-0.1.1 Firefox/3.5.9 |
URL:
<http://savannah.gnu.org/bugs/?36682>
Summary: Ignore case handling of special unicode characters
(case folding)
Project: grep
Submitted by: jsmeix
Submitted on: Tue 19 Jun 2012 10:35:53 AM GMT
Category: None
Severity: 3 - Normal
Item Group: None
Status: None
Privacy: Public
Assigned to: None
Open/Closed: Open
Discussion Lock: Any
_______________________________________________________
Details:
Basically this is related to what is mentioned in the
"POSIX and --ignore-case" and "Unicode and --ignore-case"
sections in grep's TODO file.
The current behaviour is not a bug in grep
but a missing feature.
But I think that the current behaviour is not sufficiently
described in the grep documentation and this is probably
a (minor) bug in current grep.
Currently "grep -i" does not implement "case folding"
according to what is described in
http://www.unicode.org/versions/Unicode6.1.0/ch05.pdf
Here "case folding" means to have a list of mappings
for special characters (like the Greek sigma or
the German sharp s) how to convert them in both the
RE and the buffer-to-search into a sequence of bytes
which is appropriate for caseless matching using
a binary comparison.
Currently for "grep -i" such special characters
do not match.
For example:
Unicode | UTF-8 (oct.) | name
----------------------------------------------------------
U+03A3 | 0316 0243 | GREEK CAPITAL LETTER SIGMA
U+03C2 | 0317 0202 | GREEK SMALL LETTER FINAL SIGMA
U+03C3 | 0317 0203 | GREEK SMALL LETTER SIGMA
U+00DF | 0303 0237 | LATIN SMALL LETTER SHARP S
U+1E9E | 0341 0272 0236 | LATIN CAPITAL LETTER SHARP S
Note that according to the Unicode specification
converting LATIN SMALL LETTER SHARP S
to uppere case results 'SS' (two ASCII 'S')
and not a LATIN CAPITAL LETTER SHARP S
but converting LATIN CAPITAL LETTER SHARP S
to lower case results LATIN SMALL LETTER SHARP S
Converting to lower case results that
GREEK CAPITAL LETTER SIGMA gets converted
to GREEK SMALL LETTER SIGMA which does not match
GREEK SMALL LETTER FINAL SIGMA.
There is the German lowercase word
'hei[LATIN SMALL LETTER SHARP S]' (English 'hot')
and when it is uppercased it becomes 'HEISS'
but 'HEISS' could be also written as
'HEI[LATIN SMALL LETTER SHARP S]'
Current results:
----------------------------------------------------------
$ export LC_ALL=el_GR.utf8 ; export LANG=el_GR.utf8
$ echo -e '\0316\0243\0316\0243' >SS
$ echo -e '\0316\0243\0317\0202' >Sf
$ echo -e '\0317\0203\0317\0202' >sf
$ echo -e '\0317\0203\0317\0203' >ss
$ grep -q -i -f SS ss && echo yes || echo no
yes
$ grep -q -i -f ss SS && echo yes || echo no
yes
$ grep -q -i -f Sf sf && echo yes || echo no
yes
$ grep -q -i -f sf Sf && echo yes || echo no
yes
$ grep -q -i -f SS sf && echo yes || echo no
no
$ grep -q -i -f sf SS && echo yes || echo no
no
$ export LC_ALL=de_DE.utf8 ; export LANG=de_DE.utf8
$ echo -e 'hei\0303\0237' >heif
$ echo -e 'HEI\0341\0272\0236' >HEIF
$ echo 'HEISS' >HEISS
$ grep -q -i -f heif HEIF && echo yes || echo no
yes
$ grep -q -i -f HEIF heif && echo yes || echo no
yes
$ grep -q -i -f heif HEISS && echo yes || echo no
no
$ grep -q -i -f HEISS heif && echo yes || echo no
no
$ grep -q -i -f HEIF HEISS && echo yes || echo no
no
$ grep -q -i -f HEISS HEIF && echo yes || echo no
no
$ grep -V
grep (GNU grep) 2.12
...
----------------------------------------------------------
The "no" results should be "yes" if "case folding"
would be used according to what is described in
http://www.unicode.org/versions/Unicode6.1.0/ch05.pdf
Missing documentation:
Currently "man grep" does not mention the issue.
The README file mentiones only the "tr_TR.UTF-8" locale.
Probably the "Turkish I/i with/without dot" issue
is meant there.
I think that "man grep" should explicitly mention
the current shortcoming of "-i, --ignore-case".
I suggest to enhance "man grep" by something like this:
------------------------------------------------------------
-i, --ignore-case
Ignore case distinctions in both the PATTERN
and the input files. (-i is specified by POSIX.)
Currently grep does not implement "case folding"
accodring to the Unicode Standard. For special
unicode characters caseless matching fails.
For example GREEK SMALL LETTER FINAL SIGMA
does not match GREEK CAPITAL LETTER SIGMA and
LATIN SMALL LETTER SHARP S does not match 'SS'.
------------------------------------------------------------
_______________________________________________________
Reply to this item at:
<http://savannah.gnu.org/bugs/?36682>
_______________________________________________
Message sent via/by Savannah
http://savannah.gnu.org/
- [bug #36682] Ignore case handling of special unicode characters (case folding),
Johannes Meixner <=