[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Windows and non-BMP characters
From: |
Bruno Haible |
Subject: |
Windows and non-BMP characters |
Date: |
Sun, 13 Feb 2011 22:26:36 +0100 |
User-agent: |
KMail/1.9.9 |
Two weeks ago, when discussing the support of non-BMP characters on Windows,
I was under the impression that it would be useful to use the wwchar_t layer
on both Cygwin >= 1.7 _and_ native Windows.
Now I've come to the conclusion that it's pointless on native Windows. The
reason is that
1) native Windows provides no locales (in the ISO C90 sense) that support
non-BMP characters,
2) the use of the 'char *' data type for strings is based on such locales,
3) the programs for which gnulib is meant to be used are based on 'char *'
strings and ISO C99 APIs.
In detail:
The documentation of the setlocale function in msvcrt [1] mentions
"The set of available languages, country/region codes, and code
pages includes all those supported by the Win32 NLS API except
code pages that require more than two bytes per character, such
as UTF-7 and UTF-8. If you provide a code page like UTF-7 or
UTF-8, setlocale will fail, returning NULL."
This coincides with my experiments on Windows XP:
- For code pages that requires MB_CUR_MAX <= 2, Windows msvcrt
supports such locales, e.g.
Japanese_Japan.932
Chinese_Taiwan.950
Chinese_China.936
See [2] for a more complete list.
- The only widely used encodings with MB_CUR_MAX > 2 are UTF-8 and GB18030.
Attempts to use setlocale with a codepage of 54936 (= GB18030)
or 65001 (= UTF-8) fail. Although the functions
MultiByteToWideChar and WideCharToMultiByte support codepage
65001. [3][4]
In contrast, Windows supports locales also at the Win32 level:
[5][6]. But this page [7] says:
"New Windows applications should use Unicode to avoid the
inconsistencies of varied code pages ..."
and
"Your application can convert between Windows code pages and OEM code
pages using the standard C runtime library functions. However, use
of these functions presents a risk of data loss because the characters
that can be represented by each code page do not match exactly."
Similarly, [8] says:
"the system ACP might not cover all code points in the user's
selected logon language identifier. For compatibility with this
edition, your application should avoid calls that depend on GetACP
either implicitly or explicitly, as this function can cause some
locales to display text as question marks. Instead, the application
should use the Unicode API functions directly ..."
See also [9].
In summary, Microsoft added support for UTF-8 and GB18030 to the
Win32 API, but there are no (and will likely be no) locales at the
setlocale() level that support UTF-8 or GB18030. They are basically
saying "stop using ANSI or OEM code pages because they are too
limited", with the implicit consequence "use API where strings are
'wchar_t *', and stop using API where strings are 'char *'".
It's obvious that Unix programs will continue to use 'char *'.
So there's no point for gnulib to try to support non-BMP characters
on native Windows.
Bruno
[1] http://msdn.microsoft.com/en-us/library/x99tb11d.aspx
[2] http://docs.moodle.org/en/Table_of_locales
[3] http://msdn.microsoft.com/en-us/library/dd319072%28v=VS.85%29.aspx
[4] http://msdn.microsoft.com/en-us/library/dd374130%28v=VS.85%29.aspx
[5] http://msdn.microsoft.com/en-us/library/dd318661%28v=VS.85%29.aspx
[6] http://msdn.microsoft.com/en-us/library/dd318716%28v=VS.85%29.aspx
[7] http://msdn.microsoft.com/en-us/library/dd317752%28v=VS.85%29.aspx
[8] http://msdn.microsoft.com/en-us/library/dd318070%28v=VS.85%29.aspx
[9] http://blogs.msdn.com/b/michkap/archive/2007/07/11/3823291.aspx
--
In memoriam Alexander Samoylovich
<http://en.wikipedia.org/wiki/Alexander_Samoylovich>
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- Windows and non-BMP characters,
Bruno Haible <=