bug-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#13041: 24.2; diacritic-fold-search


From: martin rudalics
Subject: bug#13041: 24.2; diacritic-fold-search
Date: Thu, 06 Dec 2012 11:28:05 +0100

>> `ignore-diacritics' is misleading.  The variable would have
>> to be called `observe-decompositions' or something the like.
>
>
> 1. "Observe decompositions" doesn't mean anything to me.  The verb should
> probably be more active - what does it mean to observe the char decompositions
> here?
>
> BTW, if we use "decomposition" in the name and description then we should
> probably also use "char" - this is not about decomposing strings in some way
> (whatever that might mean); it involves decomposing Unicode characters.

`ignore-diacritics' is misleading because when we, for example,
sort/match ligatures we already do more than ignore diacritics.  A
variable using the term `observe-decompositions' would express what the
underlying algorithm does - observe the decomposition properties
provided by `get-char-code-property'.

Bear in mind that a "correct" solution for searching and sorting would
have to be based on a correct implementation of a collation table (see
bug#12008) plus some options that make searching more convenient (aka
"asymmetric searching" http://www.unicode.org/reports/tr10/#Searching).
In that sense, Juri's approach for searching and my function can be
considered only as poor man's variants of what should be eventually
done.

For example my Austrian locale sorts

  o < ö < p

while IIUC Swedish has

  o < p ... < z < ö

which IIUC can't be done via the decomposition table.  I don't know
whether this implies that searching for "o" in Swedish means to _not_
list results for "ö" either.

> 2. But my confusion over the name/description is in fact wrt function
> `decomposed-string-lessp': I guess it's not 100% clear to me what it does.
>
> Your doc string said "STRING1 is decomposition-less than STRING2", which
> confuses me.  And it is a bit ambiguous wrt "-less":
>
>  a. decomposition-less as in comparing the strings only after
>     removing (some parts of) their decompositions (i.e., "-less"
>     as in "sans")?
>
> or
>
>  b. -lessp as in `string<': a comparison ordering relation?

I didn't think much about the wording.  But I can't, in general, talk
about comparing characters because in the ligature case (or the "ß" vs
"ss" case) I do compare substrings.

> In the version of `decomposed-string-lessp' that I sent, I changed the doc
> string to this: "decomposed STRING1 is less than decomposed STRING2".  But 
that
> is no doubt incorrect (less correct than yours, if perhaps clearer).  In
> particular, it says nothing about how we compare the two decompositions.
>
> In practical (use) terms, this is typically about ignoring diacritics, keeping
> only the "base" characters.  Something about that should at least be mentioned
> in the doc, so that users know they can use this for that.

Yes.

> But IIUC this is not just about diacritics; it sometimes might not be about
> diacritics at all; and diacritics present are sometimes not ignored.  E.g., 
the
> ligature ffi gets treated the same as the 3 chars f f i.  There are no
> diacritics present in that case.

That's why I want to just talk about decompositions for the moment.

> IIUC, we convert the two strings to their Unicode decompositions and then use
> the Unicode char compatibility specs to compare the decompositions.  IOW, we
> treat equivalent chars, as defined by Unicode, as the same.

Character sequences, IIUC.

> Perhaps the name/description should speak in terms of Unicode char 
compatibility
> or equivalence.  Perhaps a name like `string-less-compat-p'?  Or
> `Unicode-equivalent-p'?  Or `string-equivalent-p'?
>
> How would you characterize what the function does?  No doubt Eli can help 
here.
> It is important to try to get the function name and description right from the
> outset, if we can.  If the Unicode standard has some terminology that applies
> here then perhaps we can/should leverage that.

I'm not sure whether we can ever fully support Unicode here - the
weights you find in http://www.unicode.org/Public/UCA/6.2.0/allkeys.txt
appear hardly digestible for me (and my machine, presumably).

> Beyond the name and an accurate description, the doc should, as I say, at 
least
> mention that you can use this to ignore diacritics (such as accents), as that
> will be a common use case.

Sure.

martin






reply via email to

[Prev in Thread] Current Thread [Next in Thread]