[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug in join: case comparisons don't work in multibyte locales
From: |
Bruno Haible |
Subject: |
bug in join: case comparisons don't work in multibyte locales |
Date: |
Wed, 11 Mar 2009 01:40:50 +0100 |
User-agent: |
KMail/1.9.9 |
Hi Jim,
In coreutils/src/join.c, there is a FIXME mentioning that the -i option for
case insensitive comparison of the input lines does not work in multibyte
locales. And indeed, in an UTF-8 locale, I see this:
$ cat > in1 <<EOF
müsste
EOF
$ cat > in2 <<EOF
MÜSSTE
EOF
$ join -i in1 in2
[empty result]
The expected result is:
$ join -i in1 in2
müsste
Similarly, with a German word in lower and upper case:
$ cat > in1 <<EOF
Ruß
EOF
$ cat > in2 <<EOF
RUSS
EOF
$ join -i in1 in2
[empty result]
The expected result is:
$ join -i in1 in2
Ruß
Before going on, let me summarize the case comparison functions for strings
that we have available with gnulib:
| on NUL terminated | on memory areas or
| strings | strings with embedded NULs
----------------------+----------------------+---------------------------
For ASCII strings | c_strcasecmp, |
only | STRCASEEQ |
----------------------+----------------------+---------------------------
For unibyte locales | strcasecmp | memcasecmp
only | |
----------------------+----------------------+---------------------------
Support for multibyte | mbscasecmp | mbmemcasecmp
locales | |
------------------+----------------------+---------------------------
+ German, Greek etc.| | ulc_casecmp
----------------------+----------------------+---------------------------
Support for multibyte | | mbmemcasecoll
locales and locale | |
collation order | |
------------------+----------------------+---------------------------
+ German, Greek etc.| | ulc_casecoll
----------------------+----------------------+---------------------------
Find attached a draft patch for the 'join' program, that fixes the bug
mentioned above by use of the mbmemcasecmp or ulc_casecmp functions. It
is not ready to apply, because there are three big questions:
1) Which functions to use for case comparison in coreutils?
The difference between mbmemcasecmp and ulc_casecmp (or between
mbmemcasecoll and ulc_casecoll) is:
mbmemcasecmp treats only English and a few European languages correctly,
- Turkish i / I is halfway correct, but not fully,
whereas ulc_casecmp handles all known specialities of languages:
- Turkish i / I is fully correct,
- German ß is equivalent to ss,
- Croatian and Bosnian: Characters with 3 forms, such as DZ dz Dz, are
considered equivalent,
- Greek final sigma (lowercase) is considered equivalent to uppercase
sigma, (There is no difference between final and non-final sigma in the
upper case.)
- Lithuanian soft-dot,
- etc.
I think ulc_casecmp is "correct", whereas mbmemcasecmp is only "half
correct".
The reason is that mbmemcasecmp is based on the POSIX APIs, but these APIs
have some assumptions built-in that are not valid in some languages:
- It assumes that there is only uppercase and lowercase - not true for
DZ dz Dz.
- It assumes that uppercasing of 1 character leads to 1 character - not
true for German ß.
- It assumes that there is 1:1 mapping between uppercase and lowercase
forms - not true for Greek sigma.
- It assumes that the upper/lowercase mappings are position independent -
not true for Greek sigma and Lithuanian i.
2) There is a problem with the case comparison in "sort -f": POSIX specifies
how this option should behave, in terms of the old POSIX terms
("all lowercase characters that have uppercase equivalents").
How to deal with that?
a) Use mbmemcasecmp for the option -f, and introduce a long option that
works with ulc_casecmp?
b) Use mbmemcasecmp if the environment variable POSIXLY_CORRECT is set,
and ulc_casecmp otherwise?
3) There is also a problem with the executable size: the ulc_casecmp (and
ulc_casecoll) functions are implemented using a couple of tables. I
squeezed them already, while still guaranteeing O(1) time for each
access. Most of the tables are about 10 KB large, the largest one ca. 45 KB.
But it sums up:
join executable size (decimal)
coreutils-7.1 unmodified 35436
with mbmemcasecmp 36473
with ulc_casecmp 174336
with ulc_casecmp and mbmemcasecmp 176521
(switched at runtime)
When an executable grows from 35 KB to 175 KB, just for correct string
comparisons, some people will certainly complain. Especially embedded
developers, like the busybox guys, try to reduce total executable size.
And that's not only about 'join', it's ultimately about every coreutils
program that has an option to perform case-insensitive comparisons on
user's data.
How do deal with that?
a) Add a configure option --disable-extra-i18n, that will refrain from
using the ulc_casecmp function?
b) Let coreutils build and install a shared library for these large
modules?
c) Should these Unicode string functions be packaged externally to
coreutils, and coreutils can link to it as an external dependency
(like it does for libiconv, libintl, libacl, etc.)?
d) any other idea?
Bruno
coreutils-7.1-join-i18n-fix.diff
Description: Text Data