Re: On language-dependent defaults for character-folding

From:

Elias Mårtenson

Subject:

Date:

Fri, 19 Feb 2016 17:22:18 +0800

On 19 February 2016 at 16:20, Eli Zaretskii <address@hidden> wrote:

> From: Lars Ingebrigtsen <address@hidden>
> Date: Fri, 19 Feb 2016 16:11:41 +1100
>
> Here's my vote: I think character folding is a good idea, and that it
> should be turned on by default if it respects the locale. If not, it
> should be off by default.

Thanks. But what does "respect the locale" mean, in practical terms?
A large portion of the characters that have some decomposition, and
thus will be folded when searching, belong to scripts that are not
related to any language or other locale-specific attribute. What do
you think should be done with them in the context of this feature?

The Unicode character decomposition was never meant to be used to provide a feature such as character folding in Emacs. But, Unicode really doesn't provide a good alternative. The standard itself states that this belongs to the realm of localisation (IIRC, it even goes as far as mentioning Swedish as a counterexample).

I readily agree that using the decomposition is a clever way to get the functionality quite a long way, but the cases where it breaks down, it does so quite spectacularly, and that's what I (and others) have been opposing.

My suggestion would be to apply several levels of comparisons:

1. Check if the characters have locale-specific folding rules (for Swedish, this would be no more than 3-5 characters or so). If not:

2. Check the equivalence according to the Unicode collation charts: http://unicode.org/charts/collation/

3. (maybe) Use the decomposition trick

As for the per-locale exception tables mentioned in point 1, I don't know if such information is easily available. It may be possible to extract it from the localedata files from Glibc. But even if it isn't, creating one for a language should be trivial since we only need a list of character groups that should _not_ be folded, which for most languages should be a very small list (in fact, for most(?) it's probably empty).

Regards,

Elias