bug-gnulib
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

new module mbrtoc32-regular


From: Bruno Haible
Subject: new module mbrtoc32-regular
Date: Tue, 11 Jul 2023 00:45:41 +0200

In the thread "From wchar_t to char32_t" we discussed the mbrtoc32 function,
in particular.

mbrtoc32, compared to mbrtowc, has two new features:
  (a) it overcomes wchar_t limitations, especially the fact that on Windows,
      wchar_t is only 16 bits wide.
  (b) it allows a multibyte sequence to be mapped to a sequence of char32_t
      characters, whereas mbrtowc maps a multibyte sequence to a single
      wchar_t (or returns an error).

With (a), we can satisfy
  Goal (A): Support non-BMP characters (such as Emojis) better on Windows,
            including Cygwin.

With (b), we could theoretically satisfy
  Goal (B): Support locales with BIG5-HKSCS encoding better.

However, (B) is a NON-GOAL.

1) Hardly anyone uses the BIG5-HKSCS encoding.

2) As we have found out, through the diffutils exercise and the 'dfa'
   module, supporting goal (B) means that

   * Applications need to distinguish places where it's OK to handle
     the several Unicode characters separately, such as in mbswidth,
     from places where the multibyte character has to be kept as a unit,
     and thus a wchar_t needs to be replaced not with a single char32_t
     but with a sequence of char32_t.

   * Accordingly, there is a need for two different modules 'mbchar' —
     one that produces a single Unicode character at a time, and one
     that produces a sequence of Unicode characters.

   * Likewise for the modules 'mbiter' and 'mbuiter'.

   This is basically the sort of complexity that we did NOT want to add
   for supporting Windows with mbrtowc.

3) It's also a testability problem. Code that is not tested is buggy,
   in general. There is no glibc version so far that implements the
   mbrtoc32 with BIG5-HKSCS encoding correctly; see
   <https://sourceware.org/bugzilla/show_bug.cgi?id=30611>.
   In order to test application code, we would have to write an alternate
   mbrtoc32 function which, for example, maps the 'ä' character to
   U+0041 U+0308.
   But this would be even more complexity, for the sake of a hypothetical
   scenario.

Paul seems to agree that this is a non-goal:
  - https://lists.gnu.org/archive/html/bug-gnulib/2023-07/msg00021.html
    "We don't have time to support every oddball coding system that POSIX
     allows."
  - https://lists.gnu.org/archive/html/bug-gnulib/2023-07/msg00026.html
    "And since it'll likely be a hassle to port the rest of the code to
     purely-theoretical platforms where nbytes == (size_t) -3, I suggest
     instead simply adding a comment that nbytes cannot be (size_t) -3 there."
  - https://lists.gnu.org/archive/html/bug-gnulib/2023-07/msg00032.html
    "you and I have already spent more time on theoretical platforms than
     they're likely worth"

Adding a comment would be a possibility. But we can do better by formalizing
the notion that we do NOT want (b).

DEFINITION: We call an mbrtoc32 function _regular_ if
  - It never returns (size_t)-3.
  - When it returns < (size_t)-2, the mbstate_t is in the initial state.

Here I'm adding a Gnulib module that provides a _regular_ mbrtoc32 function.
With a unit test. (Once we have formalized the notion, we can test it through
a unit test.)


2023-07-10  Bruno Haible  <bruno@clisp.org>

        mbrtoc32-regular: Add tests.
        * tests/test-mbrtoc32-regular.c: New file.
        * modules/mbrtoc32-regular-tests: New file.

        mbrtoc32-regular: New module.
        * modules/mbrtoc32-regular: New file.
        * lib/mbrtoc32.c (mbrtoc32): If the module 'mbrtoc32-regular' is present
        and the system's mbrtoc32 returned a char32_t, clear the mbstate_t.
        * doc/posix-functions/mbrtoc32.texi: Mention the new module.

Attachment: 0001-mbrtoc32-regular-New-module.patch
Description: Text Data

Attachment: 0002-mbrtoc32-regular-Add-tests.patch
Description: Text Data


reply via email to

[Prev in Thread] Current Thread [Next in Thread]