bug-gnulib
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: mbcel module for Gnulib?, incomplete multibyte sequences


From: Paul Eggert
Subject: Re: mbcel module for Gnulib?, incomplete multibyte sequences
Date: Mon, 17 Jul 2023 16:09:06 -0700
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.13.0

On 2023-07-16 15:18, Bruno Haible wrote:
Paul Eggert wrote:
Although I'm sure mbiter can be improved I
don't see how it could catch up to mbcel so long as it continues to
solve a harder problem than mbcel solves.

I don't know exactly what you mean by "harder problem".

I meant that it solves a harder porting problem because it worries about more issues, e.g., it worries about mbrtoc32 returning (size_t) -3, or returning (size_t) -1 in the C locale. I guess it also worries about column counting (something I hadn't thought about but your email raised the issue). There are probably other things that it worries about, that mbcel does not. The more things mbiter needs to worry about the slower it gets.


The other significant difference that I see is the handling of multibyte
sequences. When there 2 or 3 bytes (of, say, UTF-8) that constitute an
incomplete multibyte character at the end of the string,

This isn't a problem for programs like grep and diff, where there's always a newline at the end the input buffer.


   - mbcel returns each byte, one by one, as a character without a
     char32_t code.

(A nit: it's not a character; it's an encoding error.)


   - ISO 10646 says ([1] section R.7) "it shall interpret that malformed
     sequence in the same way that it interprets a character that is outside
     the adopted subset".

If I understand this requirement correctly mbcel satisfies it, as mbcel treats those two things in the same way, namely, as sequences of encoding error bytes.


   - Markus Kuhn's example ([2] section 3) has a section where
       "All bytes of an incomplete sequence should be signalled as a single
        malformed sequence, i.e., you should see only a single replacement
        character in each of the next 10 tests."

Kuhn is talking about programs that display characters to users and that need some way to signal encoding errors. But diff is not such a program: it doesn't need to display a signal for an incomplete sequence, because it's not responsible for display.

Even for the class of programs that Kuhn is talking about it's not clear that the practice he recommends is a good one. It's certainly not typical practice in the GNU/Linux world. It's not true of the first five applications that I tested on Ubuntu 23.04: Emacs, Chrome, Firefox, less, and gnome-terminal.

Even if Kuhn's suggestion were good for display programs, programs like diff should not treat differing encoding error byte sequences as if they were equivalent. If two files A and B contain different encoding errors I expect most users would prefer "diff A B" to report the differences.

I take the point that diff's column counting disagrees with Kuhn's suggestion. However, there's no standard for columnar display of encoding errors. Some programs display each encoding byte as a single-column character. Some do it as a two-column character. Emacs by default uses four columns. xterm, the program that you mention, is glitchy: sometimes it displays a UTF-8-like sequence as a single-column U+FFFD REPLACEMENT CHARACTER but sometimes it doesn't, and on my platform, when I cat Kuhn's test to standard output, two of the four tests in the last screenful fail to line up their columns. There's not even a standard column width for U+FFFD itself: Kuhn recommends 1 in <https://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c>, but 2 is more common in my experience.

In short, in practice there's no way for diff to tell how encoding errors are displayed. diff currently guesses 1 column per encoding error byte, as that's an easy guess. It's not clear that complicating this guess would be a net win for diff users. Which means mbcel is good enough for diff.

(Composing this email prompted me to document this issue better in the diffutils manual, so I installed the attached patch there.)


This may be acceptable as a corner case for 'diff'. But for a module offered
by Gnulib, we should IMO continue to follow the best practice here.

Although Kuhn's suggestion may be best practice for some applications, it's not best for applications like diff, and it would be helpful if Gnulib could support these applications.

Attachment: 0001-doc-document-tab-behavior-better.patch
Description: Text Data


reply via email to

[Prev in Thread] Current Thread [Next in Thread]