diffutils-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: mbcel module for Gnulib?, incomplete multibyte sequences


From: Paul Eggert
Subject: Re: mbcel module for Gnulib?, incomplete multibyte sequences
Date: Sat, 22 Jul 2023 09:12:02 -0700
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.13.0

On 2023-07-21 17:33, Bruno Haible wrote:
It gets this info from mbrtoc32, which on most platforms gets this info
from mbrtowc. This multibyte scanner knows when the bytes it has seen
so far constitute
   - a complete character, or
   - an invalid character, or
   - an incomplete character (i.e. if additional bytes may lead to a
     complete character).

I thought about this a bit more, and it can be a tricky business. For example, in UTF-8 the byte sequence E0 80 is not an incomplete character (in the sense that additional bytes may lead to a complete character), because every byte you append to E0 80 causes glibc mbrtoc32 to return (size_t) -1. Yet glibc mbrtoc32 returns (size_t) -2 for E0 80.

Even if we ignore this issue and define an "invalid character" to be a byte sequence S such that mbrtoc32 returns (size_t) -1 for S but returns (size_t) -2 for each of S's prefixes, I don't see how mbiter would make it easier for diff to behave as Markus Kuhn suggested. When given an "invalid character" S, mbiter would simply tell diff that the next byte is the first byte of S, and it would be up to diff to figure out how many bytes are in S - which is the same thing diff would have to do if it used mbcel. This is because, as I understand it, mb_len (mbi_cur (ITER)) merely returns a byte count of 1 in this situation.

For GNU diff this is not much of an issue, since it's not responsible for display and in practice the display is so mucked up by encoding errors that there's no real point for diff to worry about this. However, for converters it might be an issue. And for converters I don't see why mbiter would help any more than mbcel does.

There might be an opportunity here to change mbiter so that mb_len (mbi_cur (ITER)) return a byte count greater than 1 in this situation, which might help converters (though it will bloat mbiter). In that case, perhaps we could use use mbiter for converters where this extra feature is useful, and use mbcel for scanners where mbiter's complexity is not needed.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]