diffutils-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: mbcel module for Gnulib?, incomplete multibyte sequences


From: Paul Eggert
Subject: Re: mbcel module for Gnulib?, incomplete multibyte sequences
Date: Tue, 25 Jul 2023 20:54:38 -0700
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.13.0

On 2023-07-24 17:34, Bruno Haible wrote:
Paul Eggert wrote:
It gets this info from mbrtoc32, which on most platforms gets this info
from mbrtowc. This multibyte scanner knows when the bytes it has seen
so far constitute
    - a complete character, or
    - an invalid character, or
    - an incomplete character (i.e. if additional bytes may lead to a
      complete character).

Ah, I had thought that the idea was to treat all the bytes of a byte
sequence from 10646-1[1] R.2 Table 1 as a single invalid "character"
(i.e., not a real character) if the byte sequence is not valid UTF-8.

An arbitrary sequence of invalid bytes (which therefore could be
arbitrarily long) is not meant here. This would not produce good results
for the user, and would not be implementable in O(1) space.

Arbitrarily long sequences would indeed be a problem, but R.2 Table 1 doesn't do that. The length limit is 6 bytes. (Not that this matters much to us, since glibc isn't taking this approach.)


xterm is probably the only terminal emulator that renders the entire section 3
of [2] as Markus Kuhn proposed.

And even xterm doesn't follow Kuhn's section 5.

I've pretty much given up on having 'diff' accurately count columns of display for encoding errors. It's just not practical.


Per [1] p. 127 paragraph 3, decoders can decompose it to
   [F4 90] [80] [80]
or to
   [F4] [90] [80] [80]

Since mbrtowc() returns (size_t)(-1) for this sequence, without telling
how long the invalid sequence was, decoders/scanners that are based on
mbrtowc() (or mbrtoc32()) will decompose it like this:
   [F4] [90] [80] [80]

Actually, these decoders/scanners can decompose it either way. The way you suggest is easier and I expect everybody does it that way. But a decoder/scanner could do it the other way, by calling mbrtoc32 with n=1, then with n=2, and so forth, and seeing when the return value stops being (size_t) -2 and starts being (size_t) -1. A similar approach would work for decoders/scanners that use mbcel or mbiter or etc.

Not that I'm suggesting this. diff can just do things the easy way that I expect everybody else uses.

> [1]: https://www.cl.cam.ac.uk/~mgk25/ucs/ISO-10646-UTF-8.html
> [2]: https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt




reply via email to

[Prev in Thread] Current Thread [Next in Thread]