[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: bug in join: case comparisons don't work in multibyte locales
From: |
Pádraig Brady |
Subject: |
Re: bug in join: case comparisons don't work in multibyte locales |
Date: |
Thu, 12 Mar 2009 12:07:37 +0000 |
User-agent: |
Thunderbird 2.0.0.6 (X11/20071008) |
Bruno Haible wrote:
> Pádraig Brady wrote:
>> Note as well as folding case I think it might
>> be useful to fold other forms like:
>> Enclosed: \u24b6 -> A
>> Stylistic: \uff21-> A
>
> These two transformations are already executed when you use ulc_casecmp
> with the UNINORM_NFKD argument.
Ah right they're covered by compatibility equivalence:
http://www.unicode.org/reports/tr15/
>
>> Diacritics: À -> A
>
> Very good point. The case-insensitive comparisons are used in contexts
> where different people enter the same word / name / term. But in these
> context, additional transformations need to be done, depending on
> culture. I think Google's front end to the search engine does these
> transformations. They are:
> - for French, to remove accents and diacritics,
> - for German, to transform umlauts (ü -> ue),
> - for Danish, probably to transform å -> aa,
> - and certainly much more for other languages (what is it for Chinese)?
>
>> I.E. have more general function like:
>> ulc_coll(fold={Case|Diactritics|Stylistic}, ...);
>
> _coll or _cmp ? _coll is used when people want to put lists of names in
> order. The use case where diacritics are ignored is to do lookups, not for
> sorting.
sorry you're right, _cmp
> Also, as mentioned above, I think which parts should be folded is locale
> dependent. For French, it is ok to ignore diacritics when doing caseless
> matching; for German, it is not.
Well if the locale database stores this info (I don't think it does).
Otherwise it would be left as an option to the user like:
sort --fold={case,variants,diacritics,all} where "variants" corresponds to NFKD.
cheers,
Pádraig.