bug-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#18051: [Emacs-diffs] trunk r117726: Add string collation.


From: Michael Albinus
Subject: bug#18051: [Emacs-diffs] trunk r117726: Add string collation.
Date: Wed, 27 Aug 2014 13:24:48 +0200
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/24.4.50 (gnu/linux)

Eli Zaretskii <eliz@gnu.org> writes:

> Here are a few more thoughts about related issues:
>
> 1. Why does str_collate return a ptrdiff_t value?  AFAIK, wcscoll
>    etc. return int data type, and of rather small values.

Hm, yes. Both wcscoll and w32_compare_strings return int, so I've
changed that for str_collate accordingly.

> 2. Should we signal an error if the input strings are not pure-ASCII
>    or multibyte?  Unibyte strings will at best cause incorrect
>    results.

Maybe we shall convert the strings to multibyte, via string_to_multibyte()?
If the string is already multibyte, it doesn't harm.

>    And what about strings with invalid codepoints,
>    e.g. those outside of the Unicode range, which can happen inside
>    Lisp strings?

> 3. What about errors in wcscoll?  The current code ignores them;
>    however, the value returned by wcscoll in case of an error is not
>    documented, so it could be random.  Should we signal an error if
>    errno gets set by wcscoll?

wcscoll sets EINVAL when the codepoint is out of range. I've added a
check for this case, returning an error.

(string-collate-equalp (string 1) (string ?\U0020FFFF))
  => error: Non-Unicode character: 0x20ffff

> 4. How to control the optional features of the collating sequence?  I
>    mean, for example, the fact that punctuation characters are ignored
>    in the .UTF-8 locales on glibc hosts (or so it seems).  At least on
>    Windows, a somewhat higher degree of control is available, but it
>    must be specified separately of the locale ID.  E.g., the
>    comparison function accepts flags to ignore punctuation and
>    symbols, width differences, diacritics, etc. Should we have another
>    variable, perhaps w32-specific, to request these features?
>    Alternatively, we could use .UTF-8 on Windows to communicate that,
>    although that sounds like a kludge.

In Posix systems, I'm not aware of configuring such optional features
via glibc. The most granular selection is what you dou with LC_COLLATE.

If we want to offer more granular settings, we would need to use a library
like libicu (http://icu-project.org/). Could be done, but should be optional.

> 5. The locale names on Windows are different from Posix: Windows uses
>    3-letter abbreviations of the country and the language,
>    e.g. "fra_FRA" instead of the Posix "fr_FR".  Do we want the locale
>    string values used for let-binding the above-mentioned variable to
>    be portable across systems?  Then we'd need some conversion
>    database on MS-Windows.

Here I'm a bit undecided. We could let it to the users to find the
proper locale name, but this is inconvenient. OTOH it would be much work
to install a mapping system, and we would need to maintain it. What if
there would be a new "en_SC" (Scotland) locale? We would need to
maintain such changes in Emacs forever ...

> 6. I think we will want case-insensitive version of this function.

That's also on my todo list. But I'm a little bit undecided whether we
shall add it to string-collate-* functions, or whether there shall be
further functions.

Maybe we could use sort-fold-case for this as indication? Or is this too
specific?

Best regards, Michael.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]