Re: mbcel module for Gnulib?, incomplete multibyte sequences

diffutils-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: mbcel module for Gnulib?, incomplete multibyte sequences

From:	Paul Eggert
Subject:	Re: mbcel module for Gnulib?, incomplete multibyte sequences
Date:	Tue, 25 Jul 2023 20:54:38 -0700
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.13.0

On 2023-07-24 17:34, Bruno Haible wrote:

Paul Eggert wrote:

It gets this info from mbrtoc32, which on most platforms gets this info
from mbrtowc. This multibyte scanner knows when the bytes it has seen
so far constitute
    - a complete character, or
    - an invalid character, or
    - an incomplete character (i.e. if additional bytes may lead to a
      complete character).


Ah, I had thought that the idea was to treat all the bytes of a byte
sequence from 10646-1[1] R.2 Table 1 as a single invalid "character"
(i.e., not a real character) if the byte sequence is not valid UTF-8.


An arbitrary sequence of invalid bytes (which therefore could be
arbitrarily long) is not meant here. This would not produce good results
for the user, and would not be implementable in O(1) space.

Arbitrarily long sequences would indeed be a problem, but R.2 Table 1doesn't do that. The length limit is 6 bytes. (Not that this mattersmuch to us, since glibc isn't taking this approach.)

xterm is probably the only terminal emulator that renders the entire section 3
of [2] as Markus Kuhn proposed.


And even xterm doesn't follow Kuhn's section 5.

I've pretty much given up on having 'diff' accurately count columns ofdisplay for encoding errors. It's just not practical.

Per [1] p. 127 paragraph 3, decoders can decompose it to
   [F4 90] [80] [80]
or to
   [F4] [90] [80] [80]

Since mbrtowc() returns (size_t)(-1) for this sequence, without telling
how long the invalid sequence was, decoders/scanners that are based on
mbrtowc() (or mbrtoc32()) will decompose it like this:
   [F4] [90] [80] [80]

Actually, these decoders/scanners can decompose it either way. The wayyou suggest is easier and I expect everybody does it that way. But adecoder/scanner could do it the other way, by calling mbrtoc32 with n=1,then with n=2, and so forth, and seeing when the return value stopsbeing (size_t) -2 and starts being (size_t) -1. A similar approach wouldwork for decoders/scanners that use mbcel or mbiter or etc.

Not that I'm suggesting this. diff can just do things the easy way thatI expect everybody else uses.

> [1]: https://www.cl.cam.ac.uk/~mgk25/ucs/ISO-10646-UTF-8.html
> [2]: https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt

[Prev in Thread]

Current Thread

[Next in Thread]

Re: mbcel module for Gnulib?, incomplete multibyte sequences, Paul Eggert, 2023/07/17
- Re: mbcel module for Gnulib?, incomplete multibyte sequences, Bruno Haible, 2023/07/20
  - Re: mbcel module for Gnulib?, incomplete multibyte sequences, Paul Eggert, 2023/07/21
    - Re: mbcel module for Gnulib?, incomplete multibyte sequences, Bruno Haible, 2023/07/21
    - Re: mbcel module for Gnulib?, incomplete multibyte sequences, Paul Eggert, 2023/07/21
    - Re: mbcel module for Gnulib?, incomplete multibyte sequences, Bruno Haible, 2023/07/24
    - Re: mbcel module for Gnulib?, incomplete multibyte sequences, Paul Eggert <=
    - Re: mbcel module for Gnulib?, incomplete multibyte sequences, Paul Eggert, 2023/07/22
    - Re: mbcel module for Gnulib?, incomplete multibyte sequences, Bruno Haible, 2023/07/24
    - Re: mbcel module for Gnulib?, incomplete multibyte sequences, Paul Eggert, 2023/07/24
    - Re: mbcel module for Gnulib?, incomplete multibyte sequences, Bruno Haible, 2023/07/24
    - Re: mbcel module for Gnulib?, incomplete multibyte sequences, Bruno Haible, 2023/07/24
    - Re: mbcel module for Gnulib?, incomplete multibyte sequences, Paul Eggert, 2023/07/27
    - Re: mbcel module for Gnulib?, incomplete multibyte sequences, Paul Eggert, 2023/07/28
    - Re: mbcel module for Gnulib?, incomplete multibyte sequences, Paul Eggert, 2023/07/26

Prev by Date: Re: mbcel module for Gnulib?, incomplete multibyte sequences
Next by Date: Re: mbcel module for Gnulib?, incomplete multibyte sequences
Previous by thread: Re: mbcel module for Gnulib?, incomplete multibyte sequences
Next by thread: Re: mbcel module for Gnulib?, incomplete multibyte sequences
Index(es):
- Date
- Thread