[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: mbcel module for Gnulib?, incomplete multibyte sequences
|
From: |
Paul Eggert |
|
Subject: |
Re: mbcel module for Gnulib?, incomplete multibyte sequences |
|
Date: |
Mon, 17 Jul 2023 16:09:06 -0700 |
|
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.13.0 |
On 2023-07-16 15:18, Bruno Haible wrote:
Paul Eggert wrote:
Although I'm sure mbiter can be improved I
don't see how it could catch up to mbcel so long as it continues to
solve a harder problem than mbcel solves.
I don't know exactly what you mean by "harder problem".
I meant that it solves a harder porting problem because it worries about
more issues, e.g., it worries about mbrtoc32 returning (size_t) -3, or
returning (size_t) -1 in the C locale. I guess it also worries about
column counting (something I hadn't thought about but your email raised
the issue). There are probably other things that it worries about, that
mbcel does not. The more things mbiter needs to worry about the slower
it gets.
The other significant difference that I see is the handling of multibyte
sequences. When there 2 or 3 bytes (of, say, UTF-8) that constitute an
incomplete multibyte character at the end of the string,
This isn't a problem for programs like grep and diff, where there's
always a newline at the end the input buffer.
- mbcel returns each byte, one by one, as a character without a
char32_t code.
(A nit: it's not a character; it's an encoding error.)
- ISO 10646 says ([1] section R.7) "it shall interpret that malformed
sequence in the same way that it interprets a character that is outside
the adopted subset".
If I understand this requirement correctly mbcel satisfies it, as mbcel
treats those two things in the same way, namely, as sequences of
encoding error bytes.
- Markus Kuhn's example ([2] section 3) has a section where
"All bytes of an incomplete sequence should be signalled as a single
malformed sequence, i.e., you should see only a single replacement
character in each of the next 10 tests."
Kuhn is talking about programs that display characters to users and that
need some way to signal encoding errors. But diff is not such a program:
it doesn't need to display a signal for an incomplete sequence, because
it's not responsible for display.
Even for the class of programs that Kuhn is talking about it's not clear
that the practice he recommends is a good one. It's certainly not
typical practice in the GNU/Linux world. It's not true of the first five
applications that I tested on Ubuntu 23.04: Emacs, Chrome, Firefox,
less, and gnome-terminal.
Even if Kuhn's suggestion were good for display programs, programs like
diff should not treat differing encoding error byte sequences as if they
were equivalent. If two files A and B contain different encoding errors
I expect most users would prefer "diff A B" to report the differences.
I take the point that diff's column counting disagrees with Kuhn's
suggestion. However, there's no standard for columnar display of
encoding errors. Some programs display each encoding byte as a
single-column character. Some do it as a two-column character. Emacs by
default uses four columns. xterm, the program that you mention, is
glitchy: sometimes it displays a UTF-8-like sequence as a single-column
U+FFFD REPLACEMENT CHARACTER but sometimes it doesn't, and on my
platform, when I cat Kuhn's test to standard output, two of the four
tests in the last screenful fail to line up their columns. There's not
even a standard column width for U+FFFD itself: Kuhn recommends 1 in
<https://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c>, but 2 is more common in
my experience.
In short, in practice there's no way for diff to tell how encoding
errors are displayed. diff currently guesses 1 column per encoding error
byte, as that's an easy guess. It's not clear that complicating this
guess would be a net win for diff users. Which means mbcel is good
enough for diff.
(Composing this email prompted me to document this issue better in the
diffutils manual, so I installed the attached patch there.)
This may be acceptable as a corner case for 'diff'. But for a module offered
by Gnulib, we should IMO continue to follow the best practice here.
Although Kuhn's suggestion may be best practice for some applications,
it's not best for applications like diff, and it would be helpful if
Gnulib could support these applications.
0001-doc-document-tab-behavior-better.patch
Description: Text Data
- Re: From wchar_t to char32_t, (continued)
- Re: From wchar_t to char32_t, Paul Eggert, 2023/07/04
- Re: From wchar_t to char32_t, Bruno Haible, 2023/07/06
- Re: From wchar_t to char32_t, Paul Eggert, 2023/07/06
- mbcel module for Gnulib?, Paul Eggert, 2023/07/09
- Re: mbcel module for Gnulib?, Bruno Haible, 2023/07/11
- Re: mbcel module for Gnulib?, Paul Eggert, 2023/07/12
- Re: mbcel module for Gnulib?, Bruno Haible, 2023/07/13
- Re: mbcel module for Gnulib?, Bruno Haible, 2023/07/16
- Re: mbcel module for Gnulib?, Bruno Haible, 2023/07/20
- Re: mbcel module for Gnulib?, incomplete multibyte sequences, Bruno Haible, 2023/07/16
- Re: mbcel module for Gnulib?, incomplete multibyte sequences,
Paul Eggert <=
- Re: mbcel module for Gnulib?, incomplete multibyte sequences, Bruno Haible, 2023/07/20
- Re: mbcel module for Gnulib?, incomplete multibyte sequences, Paul Eggert, 2023/07/21
- Re: mbcel module for Gnulib?, incomplete multibyte sequences, Bruno Haible, 2023/07/21
- Re: mbcel module for Gnulib?, incomplete multibyte sequences, Paul Eggert, 2023/07/21
- Re: mbcel module for Gnulib?, incomplete multibyte sequences, Bruno Haible, 2023/07/24
- Re: mbcel module for Gnulib?, incomplete multibyte sequences, Paul Eggert, 2023/07/25
- Re: mbcel module for Gnulib?, incomplete multibyte sequences, Paul Eggert, 2023/07/22
- Re: mbcel module for Gnulib?, incomplete multibyte sequences, Bruno Haible, 2023/07/24
- Re: mbcel module for Gnulib?, incomplete multibyte sequences, Paul Eggert, 2023/07/24
- Re: mbcel module for Gnulib?, incomplete multibyte sequences, Bruno Haible, 2023/07/24