[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: setlocale() issue on Windows with UTF-8
From: |
Bruno Haible |
Subject: |
Re: setlocale() issue on Windows with UTF-8 |
Date: |
Tue, 24 Dec 2024 18:51:32 +0100 |
Hi,
Lasse Collin wrote:
> This is about binaries built for native Windows, not Cygwin.
>
> Since Windows 10 version 1903, it is possible to use an application
> manifest to change the active code page to UTF-8 for individual
> executables. This way the char-based file system APIs use UTF-8 and
> argv[] in main() is in UTF-8.
>
> Below is a comparison of native setlocale(LC_ALL, "") and
> libintl_setlocale(LC_ALL, ""). This is without any environment
> variables LC_*, LANG, or LANGUAGE. Windows is in the Finnish locale.
>
> setlocale() libintl_setlocale()
> Legacy: Finnish_Finland.1252 fi_FI
> UTF-8: Finnish_Finland.utf8 fi_FI (should be fi_FI.UTF-8)
>
> The main problem is LC_CTYPE. When it's fi_FI instead of something
> ending in .utf8 or .UTF-8, translated messages are garbled, and
> mbrtowc() and related functions don't use UTF-8.
>
> LC_COLLATE needs the .UTF-8 suffix too; strcoll() doesn't use the
> character set from LC_CTYPE. I suppose it's best to use the suffix in
> all locale categories.
>
> According to Microsoft's setlocale() docs, the .UTF-8 suffix is case
> insensitive. The dash is optional.
>
> I attached code that demonstrates the issue. I built it under MSYS2 in
> the UCRT64 environment, and then ran the test programs in both Command
> Prompt and MSYS2 shell.
>
> Note that there is no point to test in MSYS2's MINGW64 environment whose
> toolchain builds against msvcrt.dll. The old msvcrt.dll doesn't support
> UTF-8 locales.
> ...
> Links to Microsoft's documentation:
>
> https://learn.microsoft.com/en-us/windows/win32/sbscs/application-manifests#activecodepage
>
> https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page
>
> https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/setlocale-wsetlocale?view=msvc-170#utf-8-support
Thanks for the good explanations and ready-to-try sample code!
I was not familiar with "application manifests" and this new
UTF-8 environment (in fact, with the two UTF-8 environments [1])
at all, nor with an UCRT-linked mingw, so this helped me a lot
understanding things and preparing unit tests.
The issue is now fixed through a series of patches in gnulib and
gettext.
The output of your test program, when no LC_* nor LANG environment
variable is set, has changed:
* Linked with legacy.o, the output was
-------------------------------------------------
LC_ALL: C
GetACP: 1252 (legacy)
mbrtowc: 1 (legacy)
utf8str: äößš567 (garbled but it's *not* a bug)
argv[1]: (try providing a non-ASCII argument)
-------------------------------------------------
and now is
-------------------------------------------------
LC_ALL: en_US
GetACP: 1252 (legacy)
mbrtowc: 1 (legacy)
utf8str: äößš567 (garbled but it's *not* a bug)
argv[1]: (try providing a non-ASCII argument)
-------------------------------------------------
* Linked with utf8.o, the output was
-------------------------------------------------
LC_ALL: C
GetACP: 65001 (UTF-8)
mbrtowc: 1 (legacy)
utf8str: äößš567
argv[1]: (try providing a non-ASCII argument)
-------------------------------------------------
and now is
-------------------------------------------------
LC_ALL: en_US.UTF-8
GetACP: 65001 (UTF-8)
mbrtowc: 2 (UTF-8)
utf8str: äößš567
argv[1]: (try providing a non-ASCII argument)
-------------------------------------------------
I hope it works fine in your environment as well.
> Looking at Gettext and Gnulib sources, the problem might be in
> localename-unsafe.c. It seems that gl_locale_name_default() should
> return something ending in ".UTF-8" when GetACP() == CP_UTF8.
> gl_locale_name_default() calls gl_locale_name_from_win32_LCID(), so I
> tried changing gl_locale_name_from_win32_LCID() to conditionally append
> ".UTF-8". It fixed the issue at least in a basic use case. I don't know
> the code well at all though, and thus I don't know if this is a good
> way to fix the issue.
After adding the unit tests, I could confirm that localename-unsafe.c was
one of the file that needed changes, but not the only one.
Thanks for the report!
Bruno
[1] https://lists.gnu.org/archive/html/bug-gnulib/2024-12/msg00159.html