[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
new module mbrtoc32-regular
|
From: |
Bruno Haible |
|
Subject: |
new module mbrtoc32-regular |
|
Date: |
Tue, 11 Jul 2023 00:45:41 +0200 |
In the thread "From wchar_t to char32_t" we discussed the mbrtoc32 function,
in particular.
mbrtoc32, compared to mbrtowc, has two new features:
(a) it overcomes wchar_t limitations, especially the fact that on Windows,
wchar_t is only 16 bits wide.
(b) it allows a multibyte sequence to be mapped to a sequence of char32_t
characters, whereas mbrtowc maps a multibyte sequence to a single
wchar_t (or returns an error).
With (a), we can satisfy
Goal (A): Support non-BMP characters (such as Emojis) better on Windows,
including Cygwin.
With (b), we could theoretically satisfy
Goal (B): Support locales with BIG5-HKSCS encoding better.
However, (B) is a NON-GOAL.
1) Hardly anyone uses the BIG5-HKSCS encoding.
2) As we have found out, through the diffutils exercise and the 'dfa'
module, supporting goal (B) means that
* Applications need to distinguish places where it's OK to handle
the several Unicode characters separately, such as in mbswidth,
from places where the multibyte character has to be kept as a unit,
and thus a wchar_t needs to be replaced not with a single char32_t
but with a sequence of char32_t.
* Accordingly, there is a need for two different modules 'mbchar' —
one that produces a single Unicode character at a time, and one
that produces a sequence of Unicode characters.
* Likewise for the modules 'mbiter' and 'mbuiter'.
This is basically the sort of complexity that we did NOT want to add
for supporting Windows with mbrtowc.
3) It's also a testability problem. Code that is not tested is buggy,
in general. There is no glibc version so far that implements the
mbrtoc32 with BIG5-HKSCS encoding correctly; see
<https://sourceware.org/bugzilla/show_bug.cgi?id=30611>.
In order to test application code, we would have to write an alternate
mbrtoc32 function which, for example, maps the 'ä' character to
U+0041 U+0308.
But this would be even more complexity, for the sake of a hypothetical
scenario.
Paul seems to agree that this is a non-goal:
- https://lists.gnu.org/archive/html/bug-gnulib/2023-07/msg00021.html
"We don't have time to support every oddball coding system that POSIX
allows."
- https://lists.gnu.org/archive/html/bug-gnulib/2023-07/msg00026.html
"And since it'll likely be a hassle to port the rest of the code to
purely-theoretical platforms where nbytes == (size_t) -3, I suggest
instead simply adding a comment that nbytes cannot be (size_t) -3 there."
- https://lists.gnu.org/archive/html/bug-gnulib/2023-07/msg00032.html
"you and I have already spent more time on theoretical platforms than
they're likely worth"
Adding a comment would be a possibility. But we can do better by formalizing
the notion that we do NOT want (b).
DEFINITION: We call an mbrtoc32 function _regular_ if
- It never returns (size_t)-3.
- When it returns < (size_t)-2, the mbstate_t is in the initial state.
Here I'm adding a Gnulib module that provides a _regular_ mbrtoc32 function.
With a unit test. (Once we have formalized the notion, we can test it through
a unit test.)
2023-07-10 Bruno Haible <bruno@clisp.org>
mbrtoc32-regular: Add tests.
* tests/test-mbrtoc32-regular.c: New file.
* modules/mbrtoc32-regular-tests: New file.
mbrtoc32-regular: New module.
* modules/mbrtoc32-regular: New file.
* lib/mbrtoc32.c (mbrtoc32): If the module 'mbrtoc32-regular' is present
and the system's mbrtoc32 returned a char32_t, clear the mbstate_t.
* doc/posix-functions/mbrtoc32.texi: Mention the new module.
0001-mbrtoc32-regular-New-module.patch
Description: Text Data
0002-mbrtoc32-regular-Add-tests.patch
Description: Text Data
- new module mbrtoc32-regular,
Bruno Haible <=