Re: On language-dependent defaults for character-folding

emacs-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: On language-dependent defaults for character-folding

From:	Eli Zaretskii
Subject:	Re: On language-dependent defaults for character-folding
Date:	Fri, 19 Feb 2016 12:09:44 +0200

> Date: Fri, 19 Feb 2016 17:22:18 +0800
> From: Elias Mårtenson <address@hidden>
> Cc: Lars Ingebrigtsen <address@hidden>, emacs-devel <address@hidden>
> 
> The Unicode character decomposition was never meant to be used to provide a 
> feature such as character
> folding in Emacs.

That's not true.  Canonical equivalence, which is encoded in canonical
decompositions, is a must for searching.  Otherwise, what looks the
same on display will not be found, and will look like a bug.  See the
example I gave with ñ and ñ (the latter one is 2 characters).

So using decomposition is not a trick, it simply uses the same data
that determines equivalence of character sequences.

> My suggestion would be to apply several levels of comparisons:
> 
> 1. Check if the characters have locale-specific folding rules (for Swedish, 
> this would be no more than 3-5
> characters or so). If not:
> 2. Check the equivalence according to the Unicode collation charts: 
> http://unicode.org/charts/collation/
> 3. (maybe) Use the decomposition trick

2 and 3 are the same as we do already, AFAICT.  (Collation charts
describe ordering, which is irrelevant for searching; other than that,
you will see that Emacs already implements the data shown in
http://unicode.org/charts/collation/.)

As for the locale-specific parts: using that will only DTRT if we
assume that the majority of searches are done in buffers holding text
in locale's language.  Is that a good assumption?  We are talking
about a multilingual Emacs, in an age of global communications, where
you can have conversations with someone on the other side of the
world, or read text that combines several languages in the same
buffer.  Do we really want to go back to the l10n days, when there was
ever only one locale that was interesting -- the current one?  I
wonder.

> As for the per-locale exception tables mentioned in point 1, I don't know if 
> such information is easily available.

It is, Unicode provides it.  We just didn't import it yet.

> It may be possible to extract it from the localedata files from Glibc. But 
> even if it isn't, creating one for a
> language should be trivial since we only need a list of character groups that 
> should _not_ be folded, which for
> most languages should be a very small list (in fact, for most(?) it's 
> probably empty).

It's more complex than that, but patches are welcome, of course.

Note that the prerequisite for anything more complicated and elaborate
than what we have now is to re-implement character-folding on the C
level, inside search.c functions.  The current implementation is at
its limits already.  I tried to convince the interested people to do
this in C to be gin with, but couldn't, and the feature was important
enough to have even in its current implementation.

[Prev in Thread]

Current Thread

[Next in Thread]

Re: On language-dependent defaults for character-folding, (continued)

Prev by Date: Re: New update of the Emacs homepage online
Next by Date: Re: New emacs download page
Previous by thread: Re: On language-dependent defaults for character-folding
Next by thread: Re: On language-dependent defaults for character-folding
Index(es):
- Date
- Thread