[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: mbcel module for Gnulib?, incomplete multibyte sequences
|
From: |
Bruno Haible |
|
Subject: |
Re: mbcel module for Gnulib?, incomplete multibyte sequences |
|
Date: |
Mon, 17 Jul 2023 00:18:30 +0200 |
Paul Eggert wrote:
> Although I'm sure mbiter can be improved I
> don't see how it could catch up to mbcel so long as it continues to
> solve a harder problem than mbcel solves.
I don't know exactly what you mean by "harder problem".
The other significant difference that I see is the handling of multibyte
sequences. When there 2 or 3 bytes (of, say, UTF-8) that constitute an
incomplete multibyte character at the end of the string,
- mbiter returns the entire sequence as a single multibyte character
without a char32_t code,
- mbcel returns each byte, one by one, as a character without a
char32_t code.
The mbcel behaviour does not match best practices:
- ISO 10646 says ([1] section R.7) "it shall interpret that malformed
sequence in the same way that it interprets a character that is outside
the adopted subset".
- Markus Kuhn's example ([2] section 3) has a section where
"All bytes of an incomplete sequence should be signalled as a single
malformed sequence, i.e., you should see only a single replacement
character in each of the next 10 tests."
This is noticeable in the result of
$ src/diff -y -t -W 150 UTF-8-test.txt UTF-8-test2.txt
with the attached input files, in an xterm: In the attached screenshot, the
width pre-computations done by 'diff' don't match the number of U+FFFD
glyphs presented by the terminal emulator.
This may be acceptable as a corner case for 'diff'. But for a module offered
by Gnulib, we should IMO continue to follow the best practice here.
Bruno
[1] https://www.cl.cam.ac.uk/~mgk25/ucs/ISO-10646-UTF-8.html
[2] https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
UTF-8-test.txt
Description: Text document
UTF-8-test2.txt
Description: Text document

- Re: From wchar_t to char32_t, (continued)
- Re: From wchar_t to char32_t, Bruno Haible, 2023/07/04
- Re: From wchar_t to char32_t, Paul Eggert, 2023/07/04
- Re: From wchar_t to char32_t, Bruno Haible, 2023/07/06
- Re: From wchar_t to char32_t, Paul Eggert, 2023/07/06
- mbcel module for Gnulib?, Paul Eggert, 2023/07/09
- Re: mbcel module for Gnulib?, Bruno Haible, 2023/07/11
- Re: mbcel module for Gnulib?, Paul Eggert, 2023/07/12
- Re: mbcel module for Gnulib?, Bruno Haible, 2023/07/13
- Re: mbcel module for Gnulib?, Bruno Haible, 2023/07/16
- Re: mbcel module for Gnulib?, Bruno Haible, 2023/07/20
- Re: mbcel module for Gnulib?, incomplete multibyte sequences,
Bruno Haible <=
- Re: mbcel module for Gnulib?, incomplete multibyte sequences, Paul Eggert, 2023/07/17
- Re: mbcel module for Gnulib?, incomplete multibyte sequences, Bruno Haible, 2023/07/20
- Re: mbcel module for Gnulib?, incomplete multibyte sequences, Paul Eggert, 2023/07/21
- Re: mbcel module for Gnulib?, incomplete multibyte sequences, Bruno Haible, 2023/07/21
- Re: mbcel module for Gnulib?, incomplete multibyte sequences, Paul Eggert, 2023/07/21
- Re: mbcel module for Gnulib?, incomplete multibyte sequences, Bruno Haible, 2023/07/24
- Re: mbcel module for Gnulib?, incomplete multibyte sequences, Paul Eggert, 2023/07/25
- Re: mbcel module for Gnulib?, incomplete multibyte sequences, Paul Eggert, 2023/07/22
- Re: mbcel module for Gnulib?, incomplete multibyte sequences, Bruno Haible, 2023/07/24
- Re: mbcel module for Gnulib?, incomplete multibyte sequences, Paul Eggert, 2023/07/24