Re: On language-dependent defaults for character-folding

emacs-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: On language-dependent defaults for character-folding

From:	Eli Zaretskii
Subject:	Re: On language-dependent defaults for character-folding
Date:	Sat, 20 Feb 2016 11:21:17 +0200

> Date: Sat, 20 Feb 2016 13:22:57 +0800
> From: Elias Mårtenson <address@hidden>
> Cc: Lars Ingebrigtsen <address@hidden>, emacs-devel <address@hidden>
> 
>  The reference you are looking for is the Unicode Standard itself. It
>  says to use the normalization forms, see for example section 5.16
>  there.
> 
> I have read that section before, and I have now read it again. The section 
> certainly talks about searching
> ignores diacritics, but does not discuss a method to do so. There is also a 
> reference to TR29, but it refers to
> grapheme clusters which would be a very strange way to do character folding 
> (Koreans would be very
> confused).
> 
>  Every character-folding search implementation decomposes characters
>  before matching them. So does Emacs. We didn't invent this, and we
>  certainly didn't use the decompositions where they weren't supposed to
>  be used. It's not a trick, it's what everyone else does to do the
>  job. See the ICU library, for example.
> 
> Every example you have given so far discusses the decomposition equivalence. 
> I.e. the fact that the who
> variants of ñ are the same. Section 5.16 discuss the _concept_ of allowing n 
> and ñ match similarly but the
> mechanism to do so is locale-dependent. This is what Unicode says, and that 
> is what I say. My position is
> simply that the default (if absolutely nothing else overrides it) should be 
> chosen to take the locale of the user
> into account.
> 
>  > The decompositions are used in the normalisation forms to ensure that the 
> two variants are treated
>  equally
>  > (such as the two alternative representations of ñ that we have been 
> discussing).
> 
>  Yes, and any character-folding search uses normalization forms as
>  well.
> 
> Yes, but that's not what normalisation forms were designed to do.

Your interpretation is wrong, because every implementation of
character-folding in search uses normalization forms.  So if you want
to maintain that whoever does that is abusing normalization forms, you
are not just up against Emacs, you are up against the ICU library and
others.  You are also up against http://www.unicode.org/notes/tn5/.

It is possible that you only see the "equivalence" parts of all these
sources.  But in that case, you are actually claiming that folding
characters should never be done at all!  "Folding" means mapping
_distinct_ character sequences to the same basic sequence.  You start
from a normalization form, then compare the results disregarding
certain secondary, tertiary, etc. differences.  The Emacs
implementation simply expresses this algorithm by using suitable
regular expressions, and it's currently only capable of either
ignoring all the non-base weights or none at all, but the principle is
preserved to the letter.

> Again (I really apologise for repeating myself, I'm starting to sound like a 
> troll and that is truly not my intention),
> the purpose of normalisation forms are to ensure that the two variants of ñ 
> compare the same. It is not
> designed to provide a mechanism to allow n to compare equal to ñ.

Under character-folding that ignores diacritics, ñ should indeed
compare equal to n.

>  > Yes. I am fully aware of this. But so be it. Having applications work 
> differently depending on the locale
>  of the
>  > environment the application was started in is nothing new.
> 
>  It's not new. It's old. We should move on to more general
>  environments that support multiple languages. Emacs is such an
>  environment. The old l10n paradigms are fundamentally incompatible
>  with that.
> 
> Sure, but doesn't it make sense to fall back to the user's default if the 
> buffer does not have an overriding
> locale?

I don't know what you mean by "buffer has an overriding locale".
Emacs buffers don't have a locale, and they cannot do that in
principle because we support multiple languages.  E.g., what could the
locale of the HELLO buffer created by "C-h H" be?

>  > Being a multi-lingual environment, Emacs has no real notion of the
>  > locale.
>  >
>  > Perhaps it should?
> 
>  That'd be a step backward, IMO.
> 
> As opposed to having no concept of locale at all?

Yes.  A multilingual environment cannot have a locale in principle.
It will cease being multilingual if it does.

>  Strange, I always thought the data was there. Perhaps you should ask
>  a question on the Unicode mailing list, then.
> 
> That's a good idea actually.

That's a relief.  I was beginning to suspect I don't have any good
ideas at all.

[Prev in Thread]

Current Thread

[Next in Thread]

Re: On language-dependent defaults for character-folding, (continued)

Prev by Date: Re: On language-dependent defaults for character-folding
Next by Date: Re: Human-readable file sorting
Previous by thread: Re: On language-dependent defaults for character-folding
Next by thread: Re: On language-dependent defaults for character-folding
Index(es):
- Date
- Thread