bug-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#13041: 24.2; diacritic-fold-search


From: Drew Adams
Subject: bug#13041: 24.2; diacritic-fold-search
Date: Wed, 5 Dec 2012 07:38:10 -0800

> `ignore-diacritics' is misleading.  The variable would have 
> to be called `observe-decompositions' or something the like.


1. "Observe decompositions" doesn't mean anything to me.  The verb should
probably be more active - what does it mean to observe the char decompositions
here?

BTW, if we use "decomposition" in the name and description then we should
probably also use "char" - this is not about decomposing strings in some way
(whatever that might mean); it involves decomposing Unicode characters.


2. But my confusion over the name/description is in fact wrt function
`decomposed-string-lessp': I guess it's not 100% clear to me what it does.

Your doc string said "STRING1 is decomposition-less than STRING2", which
confuses me.  And it is a bit ambiguous wrt "-less":

 a. decomposition-less as in comparing the strings only after
    removing (some parts of) their decompositions (i.e., "-less"
    as in "sans")?

or

 b. -lessp as in `string<': a comparison ordering relation?

In the version of `decomposed-string-lessp' that I sent, I changed the doc
string to this: "decomposed STRING1 is less than decomposed STRING2".  But that
is no doubt incorrect (less correct than yours, if perhaps clearer).  In
particular, it says nothing about how we compare the two decompositions.

In practical (use) terms, this is typically about ignoring diacritics, keeping
only the "base" characters.  Something about that should at least be mentioned
in the doc, so that users know they can use this for that.

But IIUC this is not just about diacritics; it sometimes might not be about
diacritics at all; and diacritics present are sometimes not ignored.  E.g., the
ligature ffi gets treated the same as the 3 chars f f i.  There are no
diacritics present in that case.

IIUC, we convert the two strings to their Unicode decompositions and then use
the Unicode char compatibility specs to compare the decompositions.  IOW, we
treat equivalent chars, as defined by Unicode, as the same.

Perhaps the name/description should speak in terms of Unicode char compatibility
or equivalence.  Perhaps a name like `string-less-compat-p'?  Or
`Unicode-equivalent-p'?  Or `string-equivalent-p'?

How would you characterize what the function does?  No doubt Eli can help here.
It is important to try to get the function name and description right from the
outset, if we can.  If the Unicode standard has some terminology that applies
here then perhaps we can/should leverage that.

Beyond the name and an accurate description, the doc should, as I say, at least
mention that you can use this to ignore diacritics (such as accents), as that
will be a common use case.






reply via email to

[Prev in Thread] Current Thread [Next in Thread]