emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: On language-dependent defaults for character-folding


From: Eli Zaretskii
Subject: Re: On language-dependent defaults for character-folding
Date: Thu, 25 Feb 2016 18:24:08 +0200

> From: Juri Linkov <address@hidden>
> Cc: address@hidden,  address@hidden,  address@hidden,  address@hidden
> Date: Thu, 25 Feb 2016 02:29:11 +0200
> 
> >> >> It seems two user variables are necessary for customization:
> >> >>
> >> >> 1. inclusive folding groups that will include by default such pairs
> >> >>    as o - ø, l - ł added to the Unicode decomposition-based rules,
> >> >>    and allow the users to add more rules;
> >> >>
> >> >> 2. exclusive folding groups to exclude locale/language-dependent rules 
> >> >> from
> >> >>    the default mappings above, e.g. removing n - ñ for the "es" locale.
> >> >
> >> > I think we should add those in item 1 unconditionally (i.e. include
> >> > them in the default mappings), and then exclude some of them under the
> >> > rules you describe in item 2.  Then the problem becomes easier, as we
> >> > only need to filter out some mappings, as determined by a single user
> >> > variable (whose default can come from the user locale).
> >> 
> >> Better to have 4 variables (2 internal + 2 user customizable variables):
> >
> > Can you explain why it's better to have 4 variables rather than just
> > one?
> 
> If you mean that one customizable variable should contain all mappings from
> UnicodeData.txt and decomps.txt presented to the user for customization,
> such a list will be too huge to customize: there are 5721 decompositions
> in UnicodeData.txt, and 6674 decompositions in decomps.txt.

No, of course not.  That would be extremely inconvenient.

What I envisioned is a single variable that holds a list of folding
sub-features.  Examples include ignoring diacritics, matching
ligatures and their decompositions, "controversial" foldings that
users of specific languages might not want, etc.  The default value
will hold all of the sub-features; users that don't want some of them
will be able to remove them from the list, which will affect the
mapping at search time.  We could also have a setting that means "DTRT
for my locale", which will remove the sub-features inappropriate for
the locale's language.  Stuff like that.

> So we could have at least one default internal variable containing all
> decompositions from UnicodeData.txt plus decompositions from decomps.txt
> minus locale-dependent mappings.

Internally, we need a translation table for mapping equivalent
characters.  This table should be recomputed (or selected among
several precomputed ones) according to the list of sub-features that
the user requested.

> >   http://unicode.org/Public/UCA/latest/decomps.txt
> >
> > (The last release of Unicode is v8.0.)
> 
> Thanks, comparing UnicodeData.txt with the latest decomps.txt shows
> 1600 differences (such as ł decomposed to l and ̵ and ø to o and ̸)
> we need to add manually (a whole set of differences is attached below):

I think we need to create another uni-*.el file which defines a
decomposition char-table populated from decomps.txt.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]