Re: On language-dependent defaults for character-folding

emacs-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: On language-dependent defaults for character-folding

From:	Eli Zaretskii
Subject:	Re: On language-dependent defaults for character-folding
Date:	Sat, 13 Feb 2016 10:49:30 +0200

> From: Juri Linkov <address@hidden>
> Cc: Óscar Fuentes <address@hidden>,  address@hidden
> Date: Sat, 13 Feb 2016 01:57:33 +0200
> 
> Can't we somehow use the same char-folding as is implemented in
> ICU String Search Service (this is also used for search in Chromium):
> http://userguide.icu-project.org/collation/icu-string-search-service
> that supports matching of accented letters, conjoined letters,
> and ignorable punctuation.
> 
> As is described in http://userguide.icu-project.org/collation/concepts
> there are several levels of character matching:
> 
> 1. Primary Level: differences between base characters
> 
> 2. Secondary Level: Accents in the characters
> 
> 3. Tertiary Level: Upper and lower case differences in characters
> 
> 4. Quaternary Level: Punctuation is ignored (where e.g. snake-cased
>    “black_bird” matches camel-cased “blackBird”)
> 
> 5. Identical Level
> 
> Maybe our customization could provide options to choose
> between all these levels?

That's the final goal, yes.  The current implementation is just the
initial step, and it basically does just item #1.  (The list above is
about collation, not about searching, so the wording does not really
fit the searching use case.  Also, they just reiterate what the
Unicode TR#10, http://unicode.org/reports/tr10/, specifies.)

The implementation should really be on the C level, like the
case-folding support.  The current implementation isn't, and therefore
has several disadvantages some of which were already pointed out
(e.g., the regexp it uses that gets exposed in some situations and
causes users to be surprised).  For these and other reasons, I think
we should replace the current implementation with one that's in
search_buffer, driven by tables generated from the Unicode database.
I also think we will be unable to move to the higher levels mentioned
above without first moving the implementation into search_buffer.

Volunteers are welcome to work on that.  Doing this will eventually
require to use the data in DUCET (Default Unicode Collation Element
Table) and CLDR (Common Locale Data Repository), I think, to support
both the language-independent and language-dependent folding.  But
this is only needed for the next levels, the current level that
basically only looks at the base character doesn't need fancy
databases apart of what we already have.

At the time, no one stepped forward to do this on the C level, and the
current implementation was considered to be good-enough for the first
step.

[Prev in Thread]

Current Thread

[Next in Thread]

Re: On language-dependent defaults for character-folding, (continued)

Prev by Date: Re: On language-dependent defaults for character-folding
Next by Date: Re: Disappearance of hi-lock bindings from global key map.
Previous by thread: RE: On language-dependent defaults for character-folding
Next by thread: RE: On language-dependent defaults for character-folding
Index(es):
- Date
- Thread