bug-gnulib
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

16-bit wchar_t on Windows and Cygwin


From: Bruno Haible
Subject: 16-bit wchar_t on Windows and Cygwin
Date: Mon, 31 Jan 2011 03:04:42 +0100
User-agent: KMail/1.9.9

Hi,

It is known for a long time that on native Windows, the wchar_t[] encoding on
strings is UTF-16. [1] Now, Corinna Vinschen has confirmed that it is the same
for Cygwin >= 1.7. [2]

Other platforms have either a 32-bit wchar_t (such as glibc, Solaris, *BSD,
and many others), or have a 16-bit wchar_t that, in UTF-8 locales, uses the
UCS-2 encoding (namely AIX).[3]

What consequences does this have?

  1) All code that uses the functions from <wctype.h> (wide character
     classification and mapping) or wcwidth() malfunctions on strings that
     contains Unicode characters outside the BMP, i.e. outside the range
     U+0000..U+FFFF.

  2) Code that uses mbrtowc() or wcrtomb() is also likely to malfunction.
     On Cygwin >= 1.7 mbrtowc() and wcrtomb() is implemented in an intelligent
     but somewhat surprising way: wcrtomb() may return 0, that is, produce no
     output bytes when it consumes a wchar_t.
     On native Windows, I could not test it (I could not enable any UTF-8 or
     GB18030 locale on Windows XP), but due to the behaviour of the functions
     MultiByteToWideChar and WideCharToMultiByte [4] it looks like the
     implementations of mbrtowc() and wcrtomb() will not be able to cope
     with characters outside the BMP.

Examples of such code are:

- In gnulib, the files

  file              uses

  exclude.c         towlower
  fnmatch.c         towlower
  mbchar.h          isw*
  mbmemcasecoll.c   towlower
  mbscasestr.c      towlower
  mbswidth.c        iswcntrl, wcwidth
  quotearg.c        iswprint
  regcomp.c         towlower
  regex_internal.h  iswalnum, iswlower
  regex_internal.c  towupper
  strftime.c        towlower, towupper
  strtol.c          iswalpha, iswspace, towupper

- In coreutils, the program 'wc':

  The correct behaviour is:

  $ echo 'a b' | wc -w -m
        2       4

  Now with an U+2002 space:
  $ printf 'a\xe2\x80\x82b\n' | wc -w -m
        2       4

  Now with a chinese character from the BMP:
  $ printf 'a\xe3\x91\x96b\n' | wc -w -m
        1       4
  $ printf 'a \xe3\x91\x96 b\n' | wc -w -m
        3       6

  Now with a chinese character outside the BMP:
  $ printf 'a\xf0\xa1\x88\xb4b\n' | wc -w -m
        1       4
  $ printf 'a \xf0\xa1\x88\xb4 b\n' | wc -w -m
        3       6

  On Cygwin 1.7.5 (with LANG=C.UTF-8 and 'wc' from GNU coreutils 8.5):

  $ printf 'a\xf0\xa1\x88\xb4b\n' | wc -w -m
        1       5
  $ printf 'a \xf0\xa1\x88\xb4 b\n' | wc -w -m
        2       7

  So both the number of characters and the number of words are counted
  wrong as soon as non-BMP characters occur.


What can we do about it?

Adding lots of conditional code to the above listed gnulib, coreutils, gettext
etc. source files? That would be and endless amount of work.

I'm more in favour of overriding wchar_t and all functions that depend on it -
like we did successfully for the socket functions.

In practice, this would mean that on Windows (both native Windows and
Cygwin >= 1.7) the use of a 'wchar_t' module will
  - override wchar_t to be 32 bits, like in glibc,
  - cause functions from mbrtowc() to wcwidth() to be overridden. Since the
    corresponding system functions are unusable, the replacements will use the
    modules from libunistring (such as unictype/ctype-alnum and uniwidth/width).

It also means that we will have separate modules for 'iswalnum', ..., 
'towupper',
which are currently all in the module 'wctype'.

How does that sound? Other thoughts?

Bruno


[1] http://msdn.microsoft.com/en-us/library/dd319072%28v=vs.85%29.aspx
[2] http://cygwin.com/ml/cygwin/2011-01/msg00410.html
[3] Found by running the attached program multibyte-utf16-unix.c
[4] See the attached program multibyte-utf16-win32.c

Attachment: multibyte-utf16-unix.c
Description: Text Data

Attachment: multibyte-utf16-win32.c
Description: Text Data


reply via email to

[Prev in Thread] Current Thread [Next in Thread]