[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
From wchar_t to char32_t
From: |
Bruno Haible |
Subject: |
From wchar_t to char32_t |
Date: |
Mon, 19 Jun 2023 20:05:15 +0200 |
For many years, processing multibyte strings required the mbrtowc functions
and the 'wchar_t' type.
The major limitation of this API is that on Windows platforms (Cygwin as well
as native Windows) and in 32-bit mode on AIX, a 'wchar_t' is limited to 16 bits,
and this causes all sorts of bugs with characters outside the Unicode BMP.
Before 2010, we thought that this would only impact rarely used Chinese
characters. But nowadays, emoticons are in Unicode, outside the BMP, and
are frequently used on the web. So, supporting characters outside the BMP
has become more important.
In 2011, ISO C added the 'char32_t' type as a "32-bit wide character" type.
Meanwhile, many OSes have this type and the corresponding mbrtoc32 function.
Elements of this type are actual Unicode code points. The ISO C 11 standard
did only hint at it; but ISO C 23 actually requires it. All platforms that
have the mbrtoc32 function fulfil this requirement, and Gnulib's substitute
(module 'mbrtoc32') does so as well.
In particular, on glibc systems: since glibc 2.24, mbrtoc32 is identical to
mbrtowc. And the Gnulib convenience functions for char32_t characters
just delegate to the corresponding glibc functions for wchar_t wide characters.
So, we are now in a position to support characters outside the BMP in GNU
programs overall and in a portable and maintainable way.
I added some documentation a month ago:
https://www.gnu.org/software/gnulib/manual/html_node/Strings-and-Characters.html
The migration from wchar_t to char32_t can be done by writing 'char32_t'
instead of 'wchar_t', and replacing function names according to this table:
wchar_t char32_t
------- --------
7.31.2
*wprintf -- rarely used
*wscanf -- rarely used
7.31.3
fgetwc -- rarely used, see "The wchar_t mess"
fputwc -- rarely used
7.31.4.1
wcsto{f,d,ld} -- rarely used
wcsto{l,ll,ul,ull} -- rarely used
7.31.4.2
wcscpy u32_strcpy
wcsncpy u32_strncpy
wmemcpy u32_cpy
wmemmove u32_move
7.31.4.3
wcscat u32_strcat
wcsncat u32_strncat
7.31.4.4
wcscmp u32_strcmp
wcscoll u32_strcoll
wcsncmp u32_strncmp
wcsxfrm -- rarely used
wmemcmp u32_cmp
7.31.4.5/6
wcschr u32_strchr
wcscspn u32_strcspn
wcspbrk u32_strpbrk
wcsrchr u32_strrchr
wcsspn u32_strspn
wcsstr u32_strstr
wcstok u32_strtok
wmemchr u32_chr
7.31.4.7
wcslen u32_strlen
wmemset u32_set
7.31.5
wcsftime -- rarely used
7.31.6.1
btowc btoc32
wctob c32tob
7.31.6.2
mbsinit mbsinit
7.31.6.3
mbrlen -- rarely used, use mbrtoc32 instead
mbrtowc mbrtoc32
wcrtomb c32rtomb
7.31.6.4
mbsrtowcs mbsrtoc32s
wcsrtombs c32srtombs
7.32.2.1
iswalnum c32isalnum
iswalpha c32isalpha
iswblank c32isblank
iswcntrl c32iscntrl
iswdigit c32isdigit
iswgraph c32isgraph
iswlower c32islower
iswprint c32isprint
iswpunct c32ispunct
iswspace c32isspace
iswupper c32isupper
iswxdigit c32isxdigit
7.32.2.2
iswctype -- rarely used
wctype -- rarely used
7.32.3.1
towlower c32tolower
towupper c32toupper
7.32.3.2
towctrans -- rarely used
wctrans -- rarely used
POSIX
wcwidth c32width
wcswidth c32swidth
Paul has already started this migration, in diffutils:
https://git.savannah.gnu.org/gitweb/?p=diffutils.git;a=commitdiff;h=a2e301b52cc5bdb44540aa66860dc59fa1fa5a89
In Gnulib, the following areas will need migration:
* lib/mbchar.h
lib/mbiter.h
lib/mbuiter.h
Draft patch attached.
* lib/dfa.c
lib/localeinfo.h
lib/localeinfo.c
Needs to be carefully done, so as to not break gawk.
* lib/regcomp.c
lib/regexec.c
lib/regex_internal.h
lib/regex_internal.c
Needs to be done in a way that is acceptable to glibc upstream.
* lib/fnmatch.c
Likewise.
* lib/exclude.c
* lib/nstrftime.c
* lib/quotearg.c
Bruno
mbchar-migration.diff
Description: Text Data
- From wchar_t to char32_t,
Bruno Haible <=