[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: strcoll on Solaris
From: |
Bruno Haible |
Subject: |
Re: strcoll on Solaris |
Date: |
Sat, 5 Jun 2010 05:15:55 +0200 |
User-agent: |
KMail/1.9.9 |
Eric Blake wrote:
> And in the meantime, now that we've confirmed that it is a Solaris
> strcoll() bug, it would be nice to code a gnulib workaround.
As expected, this code reproduces it:
=======================================================================
#include <errno.h>
#include <locale.h>
#include <stdio.h>
#include <string.h>
int main ()
{
static char input1[] = "\360\222\203\276\360\222\205\226"; /* U+120FE U+12156
*/
static char input2[] =
"\360\222\200\255\360\222\213\253\360\222\213\253\360\222\200\255"; /* U+1202D
U+122EB U+122EB U+1202D */
int ret;
setlocale (LC_ALL, "");
errno = 0;
ret = strcoll (input1, input2);
printf ("ret = %d, errno = %d%s\n", ret, errno, errno == EILSEQ ? " = EILSEQ"
: "");
ret = strcoll (input2, input1);
printf ("ret = %d, errno = %d%s\n", ret, errno, errno == EILSEQ ? " = EILSEQ"
: "");
return 0;
}
=======================================================================
When run in an en_US.UTF-8 locale, this yields
$ LC_ALL=en_US.UTF-8 ./a.out
ret = 3, errno = 88 = EILSEQ
ret = -3, errno = 88 = EILSEQ
The next program prints the ranges of Unicode characters for which strcoll
returns an error.
=========================================================================
#include <errno.h>
#include <locale.h>
#include <stdio.h>
#include <string.h>
int main ()
{
static char input2[] = "a";
int uc;
int interval_start;
setlocale (LC_ALL, "");
interval_start = -1;
for (uc = 0x10000; uc < 0x110000; uc++)
{
char input1[5];
input1[0] = 0xf0 | (uc >> 18);
input1[1] = 0x80 | ((uc >> 12) & 0x3f);
input1[2] = 0x80 | ((uc >> 6) & 0x3f);
input1[3] = 0x80 | (uc & 0x3f);
input1[4] = 0;
errno = 0;
strcoll (input1, input2);
if (errno)
{
/* Failed */
if (interval_start < 0)
interval_start = uc;
}
else
{
/* Succeeded */
if (interval_start >= 0)
{
printf("U+%04X..U+%04X\n", interval_start, uc-1);
interval_start = -1;
}
}
}
if (interval_start >= 0)
{
printf("U+%04X..U+%04X\n", interval_start, uc-1);
interval_start = -1;
}
return 0;
}
=========================================================================
It prints:
U+10200..U+102FF
U+10500..U+107FF
U+10900..U+1CFFF
U+1D200..U+1D2FF
U+1D800..U+1FFFF
U+2A700..U+2F7FF
U+2FB00..U+DFFFF
U+E0200..U+EFFFF
In other words, the Solaris strcoll recognizes non-BMP characters only
from the following ranges:
range | introduced in
------------------+--------------
U+10000..U+10200 | Unicode 4.0
U+10300..U+104FF | Unicode 3.1
U+10800..U+108FF | Unicode 4.0
U+1D000..U+1D1FF | Unicode 3.1
U+1D300..U+1D3FF | Unicode 4.0
U+1D400..U+1D7FF | Unicode 3.1
U+20000..U+2A6FF | Unicode 3.1
U+2F800..U+2FAFF | Unicode 3.1
U+E0000..U+E01FF | Unicode 3.1
U+F0000..U+10FFFF | Unicode 3.1
---------------------------------
So, you can see that
- it's using Unicode 4.0, not a newer version.
- it rejects unassigned Unicode characters, which is against the
recommendation of the Unicode consortium.
See http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf section 3.1, 3.2,
http://www.unicode.org/versions/Unicode5.2.0/ch05.pdf section 5.3.
IMO, this is a "quality of implementation" issue. But POSIX specifies that
strcoll can fail anyway [1]. So I think all callers of strcoll() should be
able to deal with an EINVAL error. Adding handling for other errors, such
as EILSEQ, is not hard then.
I therefore don't think gnulib should replace strcoll() because of this.
But it may be worth to introduce an xstrcoll() function that treats strings
with invalid bytes or characters, sorting them at the end of the continuum,
and comparing them with strcmp. (Similar to what lib/unistr/u-strcoll.h does,
except that I just notice that this implementation is not entirely right.)
[1] http://www.opengroup.org/onlinepubs/9699919799/functions/strcoll.html
2010-06-05 Bruno Haible <address@hidden>
* doc/posix-functions/strcoll.texi: Mention Solaris limitation.
Reported by River Tarnell <address@hidden> via Eric Blake.
--- doc/posix-functions/strcoll.texi.orig Sat Jun 5 05:13:08 2010
+++ doc/posix-functions/strcoll.texi Sat Jun 5 05:13:03 2010
@@ -12,4 +12,8 @@
Portability problems not fixed by Gnulib:
@itemize
address@hidden
+This function fails, setting @code{errno} to @code{EILSEQ}, on Solaris 10, in
+UTF-8 locales, when at least one of the strings contains a Unicode character
+in a block that was not assigned in Unicode 4.0.
@end itemize