Re: On language-dependent defaults for character-folding

On 19 February 2016 at 19:46, Eli Zaretskii <address@hidden> wrote:

> Of course you have to use the decomposition algorithms to ensure that the precomposed and decomposed
> variations of the same character compares equal.

Then you agree that _some_ form of character-folding should be turned
on by default?

Yes.

> This is, however, different from using the decomposition to to decompose a character and then using the
> base character as the thing to match against. The latter is what Emacs is doing today, as far as I understand.

Please describe in more detail why do you think what Emacs does today
is not what you think it should do. It's possible we have a
miscommunication here.

The main issue to me is that it matches things that should not be matched. A secondary (minor) issue is that some things that should be matched is not (see my example with U+2C65).

For example, if the buffer includes ñ (2 characters), should "C-s n"
find the n in it?

That depends on the locale of the user. However, from the point of a user, there should not be a visible difference between the precomposed and the composed variants are the exact same character. This is in line with Unicode recommendations (https://en.wikipedia.org/wiki/Unicode_equivalence)

Note: I know that it's possible that I am wrong about this and that Unicode actually _has_ said that the equivalence tables can be used for this purpose (I.e. decompose and only use the primary character). If that is the case, I'd be interested to see a reference to that, but I will still be of the same opinion that doing so will result in broken behaviour for a certain class of user.

Thus, if I am Spanish, I will _not_ want any of those to match "n". If I'm Swedish I will likely want both of them to match "n".

That equivalence is encoded in the decomposition data that is part of
UnicodeData.txt which Emacs uses for character-folding.

The equivalence tables explains that the precomposed character U+00F1 is equivalent to the specific sequence U+006E U+0303. That is all it says. It does not say that ñ is a variation of n. It's an instruction how to construct a given character.

The decompositions are used in the normalisation forms to ensure that the two variants are treated equally (such as the two alternative representations of ñ that we have been discussing).

> If you look at the latin collation chart for example
> (http://unicode.org/charts/collation/chart_Latin.html) you will see that the characters are grouped. These are
> the equivalences I'm referring to.

Yes. And if you look at the entries of the equivalent characters in
UnicodeData.txt, you will see there they have decompositions, which is
what Emacs uses for searching when character-folding is in effect.

Yes, and this is where the crux of our disagreement lies, I think. I previously referred to using the decompositions as a guide to character equivalence as a "trick". I stand by this, since this is not the purpose of the decompositions. The best thing that Unicode provides for that purpose (to my knowledge) are the collation charts that I mentioned previously (http://unicode.org/charts/collation/)

> Now, I note that on these charts, U+0061 LATIN SMALL LETTER A and U+2C65 LATIN SMALL LETTER A
> WITH STROKE compares as different characters, and the latter does not have a decomposition. Should this
> also be addressed?

Maybe so, but given the controversy even about what we do now, which
is a subset, I'd doubt extending what we do now is a wise move.

I was just asking to understand your position better.

> As for the locale-specific parts: using that will only DTRT if we
> assume that the majority of searches are done in buffers holding text
> in locale's language. Is that a good assumption?
>
> My opinion is that the default search behaviour should depend primarily on the locale of the entire Emacs
> session. I.e. the locale of the user starting the application. I'm not disagreeing that allowing a buffer-local locale
> override this behaviour is a good idea, but as a Swedish speaker I really see å, ä and a as completely
> separate things, even if the language of the buffer that I am editing happens to be English. The equivalence of
> these characters is the odd behaviour here, and the one that should be enabled explicitly.
>
> Also, if I happen to be editing a Spanish document (I don't speak Spanish) I would find equivalence of ñ and n
> to be incredibly useful, even though Óscar would grind his teeth at it. :-)

So you are in fact making two contradicting statements here.

Interesting. I have re-read what I wrote and I really don't see myself holding two contradicting statement. Perhaps you think that I am both against folding and not, at the same time. If that's the case, let me try to rephrase:

I like the idea of character folding. But, if it's incorrectly (by my standards, of course) implemented I would rather not have it at all since it will be highly annoying.

Indeed,
the locale in which Emacs started says almost nothing about the
documents being edited, nor even about the user's preferences: it is
easy to imagine a user whose "native" locale is X starting Emacs in
another locale.

Yes. I am fully aware of this. But so be it. Having applications work differently depending on the locale of the environment the application was started in is nothing new.

> We are talking
> about a multilingual Emacs, in an age of global communications, where
> you can have conversations with someone on the other side of the
> world, or read text that combines several languages in the same
> buffer. Do we really want to go back to the l10n days, when there was
> ever only one locale that was interesting -- the current one? I
> wonder.
>
> Actually, I think so. This is because the search equivalence is inherently a local thing.

Being a multi-lingual environment, Emacs has no real notion of the
locale.

Perhaps it should?

> It is, Unicode provides it. We just didn't import it yet.
>
> It does? I was looking for such tables, but didn't find it. Do you have a link?

Look for DUCET and its tailoring data. These should be a good
starting point:

http://www.unicode.org/Public/UCA/latest/
http://cldr.unicode.org/

Those are the decomposition charts, and don't actually say anything about equivalence outside of providing a canonical form for precomposed characters, as was discussed above.

Regards,

Elias

From:	Elias Mårtenson
Subject:	Re: On language-dependent defaults for character-folding
Date:	Fri, 19 Feb 2016 21:37:26 +0800