emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: On language-dependent defaults for character-folding


From: Eli Zaretskii
Subject: Re: On language-dependent defaults for character-folding
Date: Sat, 20 Feb 2016 12:34:41 +0200

> From: Lars Ingebrigtsen <address@hidden>
> Cc: Eli Zaretskii <address@hidden>,  emacs-devel <address@hidden>
> Date: Sat, 20 Feb 2016 17:31:48 +1100
> 
> It seems to me that we're considering using the Unicode decomposition
> rules for "variant detection" because it's what we have.

No, we use decompositions because that's how equivalent strings are to
be compared and mapped/folded.

> But this doesn't allow people to say `C-s l' to find ł or `C-s o' to
> find ø, and this would obviously be something that many people would
> find helpful.
> 
> So the Unicode decomposition rules only get us halfway there.

Yes, the current implementation is just a first step.

> On the other hand, they go to far for other users, who absolutely do
> not want `C-s o' to find ø, but would be really glad if `C-s hermes'
> would find "Hermés" (or is it "Hermès"?  I can't even type that in
> on this keyboard).

Which is why this is toggle-able.

> (defvar *character-variants*
>   '((?a ?á ?å ?ä ...)
>     (?o ?ø ?ö ?ó ...)
>     ...))
> 
> Everything that somebody says "that's kinda an a, right?" goes on there.

The above won't support finding decomposed sequences as in á (there
are 2 characters here, they are just displayed as one).  I hope it's
agreed that it is imperative for us to support finding such decomposed
sequences (and we already do, under the current character-folding
default).  There are also more complicated cases like ǖ and ǖ (3
characters), where there are several diacritics which can be in either
order, and we still have to match them, because they look identical on
display.  We currently don't support that, but we should do that in
the future, and the decomposition data supports that.

It is, of course, possible to support this without normalization, by
having all those combinations in the database you proposed.  But why
should we bother creating and maintaining such a database (and
updating it whenever a new Unicode version is released), when one is
already available in data that we already read into Emacs?  So we
currently implement this by using the decomposition information in the
Unicode database.

Also, what would be the algorithm for searching using the data you
propose?  If you want to use regexps, then the data should already be
in the form of regexps, I think.  And I expect the regexp to look very
similar to what we current construct in character-fold.el.

So what are we really arguing here about?  Is it about a feature that
will allow exempting specific decompositions from the search?  If so,
I don't think it would be hard to do that with the current
implementation, using just the locale-exception data (which should be
much smaller).  If that will make everyone happier, we can do this
now, if we are sure we won't have another round of prolonged dispute
about that.

> And then we just look up the locale, create the mapping when we type
> `C-s', and there we are.  An awesome, very useful feature that would
> annoy nobody, and that should be on by default.

But it doesn't pass the simplest test above, so it really isn't good
enough.

Btw, this was already discussed in the past, before Artur sat down to
implement this stuff.  You may wish re-reading those discussions to
see the broader picture.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]