bug-gnulib
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: mbcel module for Gnulib?, incomplete multibyte sequences


From: Bruno Haible
Subject: Re: mbcel module for Gnulib?, incomplete multibyte sequences
Date: Mon, 17 Jul 2023 00:18:30 +0200

Paul Eggert wrote:
> Although I'm sure mbiter can be improved I 
> don't see how it could catch up to mbcel so long as it continues to 
> solve a harder problem than mbcel solves.

I don't know exactly what you mean by "harder problem".

The other significant difference that I see is the handling of multibyte
sequences. When there 2 or 3 bytes (of, say, UTF-8) that constitute an
incomplete multibyte character at the end of the string,

  - mbiter returns the entire sequence as a single multibyte character
    without a char32_t code,
  - mbcel returns each byte, one by one, as a character without a
    char32_t code.

The mbcel behaviour does not match best practices:

  - ISO 10646 says ([1] section R.7) "it shall interpret that malformed
    sequence in the same way that it interprets a character that is outside
    the adopted subset".
  - Markus Kuhn's example ([2] section 3) has a section where
      "All bytes of an incomplete sequence should be signalled as a single
       malformed sequence, i.e., you should see only a single replacement
       character in each of the next 10 tests."

This is noticeable in the result of
  $ src/diff -y -t -W 150 UTF-8-test.txt UTF-8-test2.txt
with the attached input files, in an xterm: In the attached screenshot, the
width pre-computations done by 'diff' don't match the number of U+FFFD
glyphs presented by the terminal emulator.

This may be acceptable as a corner case for 'diff'. But for a module offered
by Gnulib, we should IMO continue to follow the best practice here.

Bruno

[1] https://www.cl.cam.ac.uk/~mgk25/ucs/ISO-10646-UTF-8.html
[2] https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt

Attachment: UTF-8-test.txt
Description: Text document

Attachment: UTF-8-test2.txt
Description: Text document

PNG image


reply via email to

[Prev in Thread] Current Thread [Next in Thread]