[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: mbcel module for Gnulib?, incomplete multibyte sequences
From: |
Bruno Haible |
Subject: |
Re: mbcel module for Gnulib?, incomplete multibyte sequences |
Date: |
Thu, 20 Jul 2023 17:28:23 +0200 |
Hi Paul,
> >> Although I'm sure mbiter can be improved I
> >> don't see how it could catch up to mbcel so long as it continues to
> >> solve a harder problem than mbcel solves.
> >
> > I don't know exactly what you mean by "harder problem".
>
> I meant that it solves a harder porting problem because it worries about
> more issues, e.g., it worries about mbrtoc32 returning (size_t) -3,
This makes only for a small performance difference. I could measure it
by doing the benchmarks of
mbiterf-bench-tests mbuiterf-bench-tests
versus
mbiterf-bench-tests mbuiterf-bench-tests mbrtoc32-regular
In the latter case, the lines marked with
#if !GNULIB_MBRTOC32_REGULAR
are optimized away.
> or returning (size_t) -1 in the C locale.
Indeed, this shows as a difference between mbiterf and mbcel in the
test cases c, f:
mbiterf mbcel mbuiterf
c 1.145 0.670 1.179
f 13.028 5.714 14.654
But since the glibc people are already working on resolving this issue,
I won't spend time optimizing it one way or the other.
> > The other significant difference that I see is the handling of multibyte
> > sequences. When there 2 or 3 bytes (of, say, UTF-8) that constitute an
> > incomplete multibyte character at the end of the string,
>
> This isn't a problem for programs like grep and diff, where there's
> always a newline at the end the input buffer.
I disagree: Any program can run into it when the input is
<some valid UTF-8 characters><an incomplete UTF-8 character><newline>
My screenshot from the 'src/diff -y -t' output in an xterm also shows
that there is an issue.
> > - mbcel returns each byte, one by one, as a character without a
> > char32_t code.
>
> (A nit: it's not a character; it's an encoding error.)
Sure. Some programs then treat that error as if an U+FFFD character
had been read.
> > - ISO 10646 says ([1] section R.7) "it shall interpret that malformed
> > sequence in the same way that it interprets a character that is outside
> > the adopted subset".
>
> If I understand this requirement correctly mbcel satisfies it, as mbcel
> treats those two things in the same way, namely, as sequences of
> encoding error bytes.
No, I don't think mbcel satisfies it, since mbcel interprets the
"malformed sequence" not like "a character" but like multiple characters.
> > - Markus Kuhn's example ([2] section 3) has a section where
> > "All bytes of an incomplete sequence should be signalled as a single
> > malformed sequence, i.e., you should see only a single replacement
> > character in each of the next 10 tests."
>
> Kuhn is talking about programs that display characters to users and that
> need some way to signal encoding errors. But diff is not such a program:
> it doesn't need to display a signal for an incomplete sequence, because
> it's not responsible for display.
Kuhn's writeup is generally about UTF-8 decoding. In the year 2000, when it
was written, the most important decoders were in display engines of terminal
emulators. Nowadays, we have UTF-8 decoders in many many programs.
The Unicode Standard has several pages about this topic:
Unicode 15.0 section 3.9
https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf pages 124..129
It is also referenced by section 5.22
https://www.unicode.org/versions/Unicode15.0.0/ch05.pdf page 255.
The relevant text starts at page 127:
"U+FFFD Substitution of Maximal Subparts
An increasing number of implementations are adopting the handling of
ill-formed subsequences as specified in the W3C standard for encoding
to achieve consistent U+FFFD replacements. ...
Although the Unicode Standard does not require this practice for conformance
..."
See also the table 3-11 on page 128.
So, clearly, this is not a *requirement* for a conforming UTF-8 decoder.
But the Unicode Standard's authors would not describe it in this great length
if it wasn't a good practice.
> It's certainly not
> typical practice in the GNU/Linux world. It's not true of the first five
> applications that I tested on Ubuntu 23.04: Emacs, Chrome, Firefox,
> less, and gnome-terminal.
Well, then I'll have to write a couple of QoI (quality of implementation)
reports...
(xterm does it right, but you are right that nowadays gnome-terminal and other
vte-based terminal emulators are the majority.)
> Even if Kuhn's suggestion were good for display programs, programs like
> diff should not treat differing encoding error byte sequences as if they
> were equivalent. If two files A and B contain different encoding errors
> I expect most users would prefer "diff A B" to report the differences.
Sure. If you were to use the 'mbiterf' module instead of mbcel, the
mb_equal macro from mbchar.h does the right thing. Yes, an mb_equal
call is a bit more complicated than the same_ch_err definition that
you have in diffutils/src/io.c. That's the unavoidable consequence of
treating a sequence of 2 or 3 bytes as *one* error.
> There's not
> even a standard column width for U+FFFD itself: Kuhn recommends 1 in
> <https://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c>, but 2 is more common in
> my experience.
Indeed, width considerations of strings with control characters are
hairy. And U+FFFD counts as a control character, according to
wcwidth(0xFFFD) == -1
(on glibc systems).
> > This may be acceptable as a corner case for 'diff'. But for a module offered
> > by Gnulib, we should IMO continue to follow the best practice here.
>
> Although Kuhn's suggestion may be best practice for some applications,
> it's not best for applications like diff, and it would be helpful if
> Gnulib could support these applications.
According to what I read in the Unicode Standard (above), it's a best practice
for all kinds of applications.
I'm not asking to rewrite the new code that you added in 'diff'. But for
other programs, from 'ls' to 'sed', I continue to think it would be a good
idea to follow that best practice.
Bruno
- Re: mbcel module for Gnulib?, incomplete multibyte sequences, Paul Eggert, 2023/07/17
- Re: mbcel module for Gnulib?, incomplete multibyte sequences,
Bruno Haible <=
- Re: mbcel module for Gnulib?, incomplete multibyte sequences, Paul Eggert, 2023/07/21
- Re: mbcel module for Gnulib?, incomplete multibyte sequences, Bruno Haible, 2023/07/21
- Re: mbcel module for Gnulib?, incomplete multibyte sequences, Paul Eggert, 2023/07/21
- Re: mbcel module for Gnulib?, incomplete multibyte sequences, Bruno Haible, 2023/07/24
- Re: mbcel module for Gnulib?, incomplete multibyte sequences, Paul Eggert, 2023/07/25
- Re: mbcel module for Gnulib?, incomplete multibyte sequences, Paul Eggert, 2023/07/22
- Re: mbcel module for Gnulib?, incomplete multibyte sequences, Bruno Haible, 2023/07/24
- Re: mbcel module for Gnulib?, incomplete multibyte sequences, Paul Eggert, 2023/07/24
- Re: mbcel module for Gnulib?, incomplete multibyte sequences, Bruno Haible, 2023/07/24
- Re: mbcel module for Gnulib?, incomplete multibyte sequences, Bruno Haible, 2023/07/24