[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: bug in join: case comparisons don't work in multibyte locales
From: |
Jim Meyering |
Subject: |
Re: bug in join: case comparisons don't work in multibyte locales |
Date: |
Wed, 11 Mar 2009 09:09:19 +0100 |
Bruno Haible wrote:
> In coreutils/src/join.c, there is a FIXME mentioning that the -i option for
> case insensitive comparison of the input lines does not work in multibyte
> locales. And indeed, in an UTF-8 locale, I see this:
...
> Find attached a draft patch for the 'join' program, that fixes the bug
> mentioned above by use of the mbmemcasecmp or ulc_casecmp functions. It
> is not ready to apply, because there are three big questions:
>
> 1) Which functions to use for case comparison in coreutils?
>
> The difference between mbmemcasecmp and ulc_casecmp (or between
> mbmemcasecoll and ulc_casecoll) is:
> mbmemcasecmp treats only English and a few European languages correctly,
> - Turkish i / I is halfway correct, but not fully,
> whereas ulc_casecmp handles all known specialities of languages:
> - Turkish i / I is fully correct,
> - German ß is equivalent to ss,
> - Croatian and Bosnian: Characters with 3 forms, such as DZ dz Dz, are
> considered equivalent,
> - Greek final sigma (lowercase) is considered equivalent to uppercase
> sigma, (There is no difference between final and non-final sigma in the
> upper case.)
> - Lithuanian soft-dot,
> - etc.
>
> I think ulc_casecmp is "correct", whereas mbmemcasecmp is only "half
> correct".
>
> The reason is that mbmemcasecmp is based on the POSIX APIs, but these APIs
> have some assumptions built-in that are not valid in some languages:
> - It assumes that there is only uppercase and lowercase - not true for
> DZ dz Dz.
> - It assumes that uppercasing of 1 character leads to 1 character - not
> true for German ß.
> - It assumes that there is 1:1 mapping between uppercase and lowercase
> forms - not true for Greek sigma.
> - It assumes that the upper/lowercase mappings are position independent -
> not true for Greek sigma and Lithuanian i.
Hi Bruno,
Wow. Thanks for all that work.
I prefer the "correct" approach, especially since I believe that will
eventually align with POSIX, even if it doesn't match the current intent
(I don't know).
> 2) There is a problem with the case comparison in "sort -f": POSIX specifies
> how this option should behave, in terms of the old POSIX terms
> ("all lowercase characters that have uppercase equivalents").
>
> How to deal with that?
> a) Use mbmemcasecmp for the option -f, and introduce a long option that
> works with ulc_casecmp?
> b) Use mbmemcasecmp if the environment variable POSIXLY_CORRECT is set,
> and ulc_casecmp otherwise?
How about a third approach?
Use ulc_casecmp unconditionally (assuming it's available), and resort
to adding POSIXLY_CORRECT if enough people complain *and* if somehow
POSIX cannot be changed to accommodate the correct behavior.
> 3) There is also a problem with the executable size: the ulc_casecmp (and
> ulc_casecoll) functions are implemented using a couple of tables. I
> squeezed them already, while still guaranteeing O(1) time for each
> access. Most of the tables are about 10 KB large, the largest one ca. 45
> KB.
> But it sums up:
>
> join executable size (decimal)
>
> coreutils-7.1 unmodified 35436
>
> with mbmemcasecmp 36473
>
> with ulc_casecmp 174336
>
> with ulc_casecmp and mbmemcasecmp 176521
> (switched at runtime)
>
> When an executable grows from 35 KB to 175 KB, just for correct string
> comparisons, some people will certainly complain. Especially embedded
> developers, like the busybox guys, try to reduce total executable size.
> And that's not only about 'join', it's ultimately about every coreutils
> program that has an option to perform case-insensitive comparisons on
> user's data.
>
> How do deal with that?
> a) Add a configure option --disable-extra-i18n, that will refrain from
> using the ulc_casecmp function?
> b) Let coreutils build and install a shared library for these large
> modules?
> c) Should these Unicode string functions be packaged externally to
> coreutils, and coreutils can link to it as an external dependency
> (like it does for libiconv, libintl, libacl, etc.)?
c) would be great. The size issue is non-negligible, even if
it's just four programs. Besides, I'd like to keep coreutils
out of the shared-library-creation/installation business.
BTW, your patch looked impeccable.