Re: strcoll on Solaris

bug-gnulib

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: strcoll on Solaris

From:	Bruno Haible
Subject:	Re: strcoll on Solaris
Date:	Sat, 5 Jun 2010 05:15:55 +0200
User-agent:	KMail/1.9.9

Eric Blake wrote:
> And in the meantime, now that we've confirmed that it is a Solaris
> strcoll() bug, it would be nice to code a gnulib workaround.

As expected, this code reproduces it:

=======================================================================
#include <errno.h>
#include <locale.h>
#include <stdio.h>
#include <string.h>

int main ()
{
  static char input1[] = "\360\222\203\276\360\222\205\226"; /* U+120FE U+12156 
*/
  static char input2[] = 
"\360\222\200\255\360\222\213\253\360\222\213\253\360\222\200\255"; /* U+1202D 
U+122EB U+122EB U+1202D */
  int ret;

  setlocale (LC_ALL, "");

  errno = 0;
  ret = strcoll (input1, input2);
  printf ("ret = %d, errno = %d%s\n", ret, errno, errno == EILSEQ ? " = EILSEQ" 
: "");
  ret = strcoll (input2, input1);
  printf ("ret = %d, errno = %d%s\n", ret, errno, errno == EILSEQ ? " = EILSEQ" 
: "");
  return 0;
}
=======================================================================

When run in an en_US.UTF-8 locale, this yields
  $ LC_ALL=en_US.UTF-8 ./a.out 
  ret = 3, errno = 88 = EILSEQ
  ret = -3, errno = 88 = EILSEQ

The next program prints the ranges of Unicode characters for which strcoll
returns an error.

=========================================================================
#include <errno.h>
#include <locale.h>
#include <stdio.h>
#include <string.h>

int main ()
{
  static char input2[] = "a";
  int uc;
  int interval_start;

  setlocale (LC_ALL, "");

  interval_start = -1;
  for (uc = 0x10000; uc < 0x110000; uc++)
    {
      char input1[5];

      input1[0] = 0xf0 | (uc >> 18);
      input1[1] = 0x80 | ((uc >> 12)  & 0x3f);
      input1[2] = 0x80 | ((uc >> 6)  & 0x3f);
      input1[3] = 0x80 | (uc & 0x3f);
      input1[4] = 0;

      errno = 0;
      strcoll (input1, input2);
      if (errno)
        {
          /* Failed */
          if (interval_start < 0)
            interval_start = uc;
        }
      else
        {
          /* Succeeded */
          if (interval_start >= 0)
            {
              printf("U+%04X..U+%04X\n", interval_start, uc-1);
              interval_start = -1;
            }
        }
    }
  if (interval_start >= 0)
    {
      printf("U+%04X..U+%04X\n", interval_start, uc-1);
      interval_start = -1;
    }
  return 0;
}
=========================================================================

It prints:

U+10200..U+102FF
U+10500..U+107FF
U+10900..U+1CFFF
U+1D200..U+1D2FF
U+1D800..U+1FFFF
U+2A700..U+2F7FF
U+2FB00..U+DFFFF
U+E0200..U+EFFFF

In other words, the Solaris strcoll recognizes non-BMP characters only
from the following ranges:

     range        | introduced in
------------------+--------------
U+10000..U+10200  |  Unicode 4.0
U+10300..U+104FF  |  Unicode 3.1
U+10800..U+108FF  |  Unicode 4.0
U+1D000..U+1D1FF  |  Unicode 3.1
U+1D300..U+1D3FF  |  Unicode 4.0
U+1D400..U+1D7FF  |  Unicode 3.1
U+20000..U+2A6FF  |  Unicode 3.1
U+2F800..U+2FAFF  |  Unicode 3.1
U+E0000..U+E01FF  |  Unicode 3.1
U+F0000..U+10FFFF |  Unicode 3.1
---------------------------------

So, you can see that
  - it's using Unicode 4.0, not a newer version.
  - it rejects unassigned Unicode characters, which is against the
    recommendation of the Unicode consortium.
    See http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf section 3.1, 3.2,
    http://www.unicode.org/versions/Unicode5.2.0/ch05.pdf section 5.3.

IMO, this is a "quality of implementation" issue. But POSIX specifies that
strcoll can fail anyway [1]. So I think all callers of strcoll() should be
able to deal with an EINVAL error. Adding handling for other errors, such
as EILSEQ, is not hard then.

I therefore don't think gnulib should replace strcoll() because of this.
But it may be worth to introduce an xstrcoll() function that treats strings
with invalid bytes or characters, sorting them at the end of the continuum,
and comparing them with strcmp. (Similar to what lib/unistr/u-strcoll.h does,
except that I just notice that this implementation is not entirely right.)

[1] http://www.opengroup.org/onlinepubs/9699919799/functions/strcoll.html


2010-06-05  Bruno Haible  <address@hidden>

        * doc/posix-functions/strcoll.texi: Mention Solaris limitation.
        Reported by River Tarnell <address@hidden> via Eric Blake.

--- doc/posix-functions/strcoll.texi.orig       Sat Jun  5 05:13:08 2010
+++ doc/posix-functions/strcoll.texi    Sat Jun  5 05:13:03 2010
@@ -12,4 +12,8 @@
 
 Portability problems not fixed by Gnulib:
 @itemize
address@hidden
+This function fails, setting @code{errno} to @code{EILSEQ}, on Solaris 10, in
+UTF-8 locales, when at least one of the strings contains a Unicode character
+in a block that was not assigned in Unicode 4.0.
 @end itemize

[Prev in Thread]

Current Thread

[Next in Thread]

Re: bug#6327: sort fails on some UTF-8 input, Eric Blake, 2010/06/02
- Re: bug#6327: sort fails on some UTF-8 input, Eric Blake, 2010/06/02
  - Re: strcoll on Solaris, Bruno Haible <=

Prev by Date: Re: havelib module: '+' in library names
Next by Date: Re: ‘libunistring’ module errors
Previous by thread: Re: bug#6327: sort fails on some UTF-8 input
Next by thread: mingw, ptw32, and struct timespec
Index(es):
- Date
- Thread