Re: Character folding in the pretest

On 5 February 2016 at 14:01, Werner LEMBERG <address@hidden> wrote:

>> This naturally leads to a possible user option: Having `optical'
>> matches or not, where `optical' means `base character plus
>> diacritic and/or slight modifications', e.g., o → ø → ö etc., etc.
>
> How do you even define "optical similarities"?

Basically the same as Eli has described: Base character plus
diacritics, probably plus some basic shapes with `diacritics' that
Unicode doesn't represent as composable: o → ø, l → ł, d → đ, etc.

Composability is somewhat arbitrary. The character composition has very little to do with "visual similarities". Just have a look at character compositions in Devanagari for example.

> Should l and I compare the same under this definition? They
> certainly looks similar.

No, since the similarity is a font issue only. For this reason I
*never* use Arial-like fonts.

And that argument works equally well for a and å. They really have _nothing_ in common. The fact that there exists a Unicode decomposition for them is completely irrelevant to a Swedish speaker.

Also note that to a Swedish speaker (well, at least up until recently), W and V were variations of the same character. Yet I'm not advocating that Emacs should consider them similar unless the locale says they should be.

In fact, the links to the Unicode TR on collations that Eli posted mentions that as a specific example.

> What about p and q? They look like mirror images of each other.
> What about z and s? They even sound similar.

Nonsense. I've clearly mentioned `base character plus diacritic'.
Why do you intentionally skip that? Doing so reminds me of
Schopenhauer's first stratagem in `The Art of Being Right'...

I did not intentionally skip that. I would appreciate it if you didn't assume that I was out to simply prove you wrong, or that I am here to troll.

I was using that as an example in trying to highlight that to some people (like myself) ä just simply is not a character with a diacritic. It is in German, but not in Swedish.

I think this is hard to explain because in many European language (such as English, German and French) you have characters which are variations or alternatives. For example, in French you have the letter Œ, which is a variation of "OE". Likewise in German, ß is a variation of SS and Ü is a variation of UE. As far as I know, I could write "Müller" as "Mueller".

However, this is not true for Swedish. I'll say it again (and I apologise for repeating myself, this kind of repetition makes me sound like the troll that you accused me of being) but in Swedish the difference between Å and A are just as great as the difference in English between the letters E and O. Writing my last name as "Martenson" looks just as bizarre as me writing your last name as "Merner". And yes, I picked M because it kinda looks like an upside-down W and I'm doing that not because I'm really suggesting that that equivalence should be implemented, but because I want to illustrate just how silly it looks.

> To a Swedish speaker there are zero similarities between a, ä and å.

I'm a native German speaker, and there is *zero* similarity in the
sound between `a' and `ä', say.

I know. Speak a little German. In fact, Ä is pronounced exactly the same in German and Swedish. That said, as far as I can recall from my German lessons 25 years ago, German grammar does see Ä as a variation of A. At least they are sorted together in the dictionary.

Swedish distinction is much greater. This discussion would have been much easier if the letter looked completely different. :-)

But it is quite common in English
texts, say, to omit the diaeresis dots, thus having a searching mode
that finds both `Hänsel und Gretel' and `Hansel and Gretel' at the
same time would be very valuable.

I never said it's not valuable. I never even suggested that this kind of comparisons should not be possible.

In fact, I'm not even suggesting that this kind of comparisons should not be the default, even. Especially given the fact that locale-dependent comparators are not very well supported in Emacs at the moment.

What I did want to do was try try to explain that even though there is a visual similarity between A, Ä and Å, to a Swedish speaker those similarities are no greater than those of q and k. And definitely much more different than W and V (which were, up until recently sorted under V in dictionaries and seen as simply a visual variation).

What you describe naturally leads to another user option: Don't handle
characters as `equal' (with a proper definition of `equal') that
aren't `equal' in the user's locale.

This is exactly my point. And you have managed to compress hundreds of my words into a single, district sentence. Thank you.

From:	Elias Mårtenson
Subject:	Re: Character folding in the pretest
Date:	Fri, 5 Feb 2016 14:36:13 +0800