Re: On language-dependent defaults for character-folding

emacs-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: On language-dependent defaults for character-folding

From:	Eli Zaretskii
Subject:	Re: On language-dependent defaults for character-folding
Date:	Tue, 23 Feb 2016 19:11:52 +0200

> From: Juri Linkov <address@hidden>
> Cc: address@hidden,  address@hidden,  address@hidden,  address@hidden
> Date: Tue, 23 Feb 2016 02:14:55 +0200
> 
> > But the most basic issue is that any significant development in these
> > directions require to re-implement the feature on the C level, and use
> > char-tables for folding, like we do with case-mapping.  So until
> > someone steps forward for the job, all we can do is small corrections
> > to the existing implementation.
> 
> Do I understand correctly that essentially what is necessary to do on the
> C level is to extend char-tables with character insertions and deletions,
> so in addition to canonical equivalence mappings (like are used for the
> existing case-mappings) char-tables should also support matching of
> multi-character additions (like combining accents in the search
> string) and deletions (like combining accents from the search string
> missing in the search text)?

I'm not sure I understand why you think char-tables need to be
extended in support of folding search.  AFAIU, we need a way to
normalize each character, both in the search string and in the
buffer/string we search.  This normalization involves decomposition
followed by reordering the combining diacritics into a canonical
order.  Then we just match one against the other, almost as usual
("almost" because we need to backtrack in the buffer/string upon
mismatch).  (Of course, decomposition of buffer/string text needs to
be done on the fly, but this is an implementation detail unrelated to
this discussion.)

So we need a char-table that maps each character into its
decomposition sequence, which AFAIR is something the current
char-tables can support already.  Am I missing something?

If you are interested in the details, I suggest reading
http://unicode.org/reports/tr10/ and in particular
http://unicode.org/reports/tr10/#Searching, which deals specifically
with searching.  http://www.unicode.org/notes/tn5/ is also a useful
reading.

> > For example, the default state of character-folding might depend on
> > the locale's language -- we could turn it off by default for languages
> > whose users expressed dissatisfaction with the feature.  We could also
> > augment the regular expressions created for folding the search string
> > by filtering out variants that users of a particular language don't
> > want.  If people think these ideas will make more users happy, we can
> > work on that.
> 
> It seems two user variables are necessary for customization:
> 
> 1. inclusive folding groups that will include by default such pairs
>    as o - ø, l - ł added to the Unicode decomposition-based rules,
>    and allow the users to add more rules;
> 
> 2. exclusive folding groups to exclude locale/language-dependent rules from
>    the default mappings above, e.g. removing n - ñ for the "es" locale.

I think we should add those in item 1 unconditionally (i.e. include
them in the default mappings), and then exclude some of them under the
rules you describe in item 2.  Then the problem becomes easier, as we
only need to filter out some mappings, as determined by a single user
variable (whose default can come from the user locale).

The additional mappings can be picked up from the file decomps.txt in
the UCA database.

[Prev in Thread]

Current Thread

[Next in Thread]

Re: On language-dependent defaults for character-folding, (continued)

Prev by Date: RE: [Emacs-diffs] emacs-25 a9c48d5: Additional fixes for file notification
Next by Date: Re: [Emacs-diffs] emacs-25 6bd9d69: Fix documentation of 'global-disable-point-adjustment'
Previous by thread: Re: On language-dependent defaults for character-folding
Next by thread: Re: On language-dependent defaults for character-folding
Index(es):
- Date
- Thread