emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: On language-dependent defaults for character-folding


From: Eli Zaretskii
Subject: Re: On language-dependent defaults for character-folding
Date: Wed, 24 Feb 2016 20:39:50 +0200

> From: Juri Linkov <address@hidden>
> Cc: address@hidden,  address@hidden,  address@hidden,  address@hidden
> Date: Wed, 24 Feb 2016 02:16:23 +0200
> 
> > So we need a char-table that maps each character into its
> > decomposition sequence, which AFAIR is something the current
> > char-tables can support already.  Am I missing something?
> 
> Searching for a base character and matching a sequence of characters
> (e.g. a base character and combining accents) might be already possible
> by the current char-tables indexed by a base character.  But I see
> no way to specify such a mapping in a char-table that e.g.
> a character should be skipped in the search buffer.  Maybe this need
> could be avoided in an asymmetric search with combining characters
> in the search buffer, but still is required for ignorable characters.

Whether ignorables can be supported by the current char-tables depends
on the data we store in that table.  It could be a vector of objects
that provide both the codepoint and its weight; then it's easy to
implement skipping characters by throwing away characters whose weight
is above the threshold specified by the caller.

> >> It seems two user variables are necessary for customization:
> >>
> >> 1. inclusive folding groups that will include by default such pairs
> >>    as o - ø, l - ł added to the Unicode decomposition-based rules,
> >>    and allow the users to add more rules;
> >>
> >> 2. exclusive folding groups to exclude locale/language-dependent rules from
> >>    the default mappings above, e.g. removing n - ñ for the "es" locale.
> >
> > I think we should add those in item 1 unconditionally (i.e. include
> > them in the default mappings), and then exclude some of them under the
> > rules you describe in item 2.  Then the problem becomes easier, as we
> > only need to filter out some mappings, as determined by a single user
> > variable (whose default can come from the user locale).
> 
> Better to have 4 variables (2 internal + 2 user customizable variables):

Can you explain why it's better to have 4 variables rather than just
one?

> It would be good to find all differences between UnicodeData.txt and
> decomps.txt.  Is this the latest version?
> http://unicode.org/Public/UCA/6.3.0/decomps.txt

No, the latest is always here:

  http://unicode.org/Public/UCA/latest/decomps.txt

(The last release of Unicode is v8.0.)



reply via email to

[Prev in Thread] Current Thread [Next in Thread]