[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
setlocale() issue on Windows with UTF-8
From: |
Lasse Collin |
Subject: |
setlocale() issue on Windows with UTF-8 |
Date: |
Sat, 21 Dec 2024 10:31:02 +0200 |
Hello!
This is about binaries built for native Windows, not Cygwin.
Since Windows 10 version 1903, it is possible to use an application
manifest to change the active code page to UTF-8 for individual
executables. This way the char-based file system APIs use UTF-8 and
argv[] in main() is in UTF-8.
Below is a comparison of native setlocale(LC_ALL, "") and
libintl_setlocale(LC_ALL, ""). This is without any environment
variables LC_*, LANG, or LANGUAGE. Windows is in the Finnish locale.
setlocale() libintl_setlocale()
Legacy: Finnish_Finland.1252 fi_FI
UTF-8: Finnish_Finland.utf8 fi_FI (should be fi_FI.UTF-8)
The main problem is LC_CTYPE. When it's fi_FI instead of something
ending in .utf8 or .UTF-8, translated messages are garbled, and
mbrtowc() and related functions don't use UTF-8.
LC_COLLATE needs the .UTF-8 suffix too; strcoll() doesn't use the
character set from LC_CTYPE. I suppose it's best to use the suffix in
all locale categories.
According to Microsoft's setlocale() docs, the .UTF-8 suffix is case
insensitive. The dash is optional.
I attached code that demonstrates the issue. I built it under MSYS2 in
the UCRT64 environment, and then ran the test programs in both Command
Prompt and MSYS2 shell.
Note that there is no point to test in MSYS2's MINGW64 environment whose
toolchain builds against msvcrt.dll. The old msvcrt.dll doesn't support
UTF-8 locales.
Looking at Gettext and Gnulib sources, the problem might be in
localename-unsafe.c. It seems that gl_locale_name_default() should
return something ending in ".UTF-8" when GetACP() == CP_UTF8.
gl_locale_name_default() calls gl_locale_name_from_win32_LCID(), so I
tried changing gl_locale_name_from_win32_LCID() to conditionally append
".UTF-8". It fixed the issue at least in a basic use case. I don't know
the code well at all though, and thus I don't know if this is a good
way to fix the issue.
An imperfect but still decent workaround with existing Gettext releases
seems to be using the native setlocale() instead of libintl_setlocale().
Translations work, and gettext() still uses the language specified via
environment variables (LANGUAGE, LC_ALL, LC_MESSAGES, and LANG). I know
that libintl_setlocale() increments _nl_msg_cat_cntr but skipping that
step shouldn't create problems with apps that only call setlocale() once
at startup.
Links to Microsoft's documentation:
https://learn.microsoft.com/en-us/windows/win32/sbscs/application-manifests#activecodepage
https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page
https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/setlocale-wsetlocale?view=msvc-170#utf-8-support
--
Lasse Collin
libintl_setlocale-utf8-w32.tar.xz
Description: application/xz
- setlocale() issue on Windows with UTF-8,
Lasse Collin <=