emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: On language-dependent defaults for character-folding


From: Juri Linkov
Subject: Re: On language-dependent defaults for character-folding
Date: Wed, 24 Feb 2016 02:16:23 +0200
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/25.0.91 (x86_64-pc-linux-gnu)

>> > But the most basic issue is that any significant development in these
>> > directions require to re-implement the feature on the C level, and use
>> > char-tables for folding, like we do with case-mapping.  So until
>> > someone steps forward for the job, all we can do is small corrections
>> > to the existing implementation.
>>
>> Do I understand correctly that essentially what is necessary to do on the
>> C level is to extend char-tables with character insertions and deletions,
>> so in addition to canonical equivalence mappings (like are used for the
>> existing case-mappings) char-tables should also support matching of
>> multi-character additions (like combining accents in the search
>> string) and deletions (like combining accents from the search string
>> missing in the search text)?
>
> I'm not sure I understand why you think char-tables need to be
> extended in support of folding search.  AFAIU, we need a way to
> normalize each character, both in the search string and in the
> buffer/string we search.  This normalization involves decomposition
> followed by reordering the combining diacritics into a canonical
> order.  Then we just match one against the other, almost as usual
> ("almost" because we need to backtrack in the buffer/string upon
> mismatch).  (Of course, decomposition of buffer/string text needs to
> be done on the fly, but this is an implementation detail unrelated to
> this discussion.)
>
> So we need a char-table that maps each character into its
> decomposition sequence, which AFAIR is something the current
> char-tables can support already.  Am I missing something?

Searching for a base character and matching a sequence of characters
(e.g. a base character and combining accents) might be already possible
by the current char-tables indexed by a base character.  But I see
no way to specify such a mapping in a char-table that e.g.
a character should be skipped in the search buffer.  Maybe this need
could be avoided in an asymmetric search with combining characters
in the search buffer, but still is required for ignorable characters.

> If you are interested in the details, I suggest reading
> http://unicode.org/reports/tr10/ and in particular
> http://unicode.org/reports/tr10/#Searching, which deals specifically
> with searching.  http://www.unicode.org/notes/tn5/ is also a useful
> reading.

Thanks, looks like a complete specification with comprehensive answers
to most questions.

>> > For example, the default state of character-folding might depend on
>> > the locale's language -- we could turn it off by default for languages
>> > whose users expressed dissatisfaction with the feature.  We could also
>> > augment the regular expressions created for folding the search string
>> > by filtering out variants that users of a particular language don't
>> > want.  If people think these ideas will make more users happy, we can
>> > work on that.
>>
>> It seems two user variables are necessary for customization:
>>
>> 1. inclusive folding groups that will include by default such pairs
>>    as o - ø, l - ł added to the Unicode decomposition-based rules,
>>    and allow the users to add more rules;
>>
>> 2. exclusive folding groups to exclude locale/language-dependent rules from
>>    the default mappings above, e.g. removing n - ñ for the "es" locale.
>
> I think we should add those in item 1 unconditionally (i.e. include
> them in the default mappings), and then exclude some of them under the
> rules you describe in item 2.  Then the problem becomes easier, as we
> only need to filter out some mappings, as determined by a single user
> variable (whose default can come from the user locale).

Better to have 4 variables (2 internal + 2 user customizable variables):

1.1. (internal) default mappings with additional data from decomps.txt

1.2. user mappings to add to the default list

2.1. (internal) locale-dependent mappings to remove from the default list

2.2. user mappings to remove from the default list

> The additional mappings can be picked up from the file decomps.txt in
> the UCA database.

It would be good to find all differences between UnicodeData.txt and
decomps.txt.  Is this the latest version?
http://unicode.org/Public/UCA/6.3.0/decomps.txt



reply via email to

[Prev in Thread] Current Thread [Next in Thread]