diffutils-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: mbcel module for Gnulib?, incomplete multibyte sequences


From: Paul Eggert
Subject: Re: mbcel module for Gnulib?, incomplete multibyte sequences
Date: Fri, 21 Jul 2023 18:14:02 -0700
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.13.0

On 2023-07-21 17:33, Bruno Haible wrote:
It gets this info from mbrtoc32, which on most platforms gets this info
from mbrtowc. This multibyte scanner knows when the bytes it has seen
so far constitute
   - a complete character, or
   - an invalid character, or
   - an incomplete character (i.e. if additional bytes may lead to a
     complete character).

Ah, I had thought that the idea was to treat all the bytes of a byte sequence from 10646-1[1] R.2 Table 1 as a single invalid "character" (i.e., not a real character) if the byte sequence is not valid UTF-8. That's what Kuhn seems to be suggesting in [2].

But what you're saying is something different, that could be implemented by calling mbrtoc32.

For example, as I understand it, the byte sequence F4 90 80 80, which I had thought you were saying would be treated as a single byte sequence [F4 90 80 80] because that's in R.2 Table 1, would instead be treated as [F4 90] [80] [80], because [F4 90] is not an incomplete character (additional bytes cannot lead to a complete character).

Is this right?

[1]: https://www.cl.cam.ac.uk/~mgk25/ucs/ISO-10646-UTF-8.html
[2]: https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt



reply via email to

[Prev in Thread] Current Thread [Next in Thread]