coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: multibyte support (round 4) - tr


From: Assaf Gordon
Subject: Re: multibyte support (round 4) - tr
Date: Wed, 31 Jan 2018 16:33:42 -0700
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.5.0

Hello,

On 2018-01-30 09:10 AM, Sebastian Kisela wrote:
 The patch is getting too big to attach, so it is available here:
>>  [...]
 (perhaps a non-master branch on the savannah git would be better?)

Yes that would be nice, if that is not too problematic.

I'm inclined to do so as well.
Any objections from others about creating a non-master branch
dedicated for multibyte efforts on the official gnu git repository ?

I tried the `tr` part of the patch and the tests passed well.

Thank you for testing and reporting back.


Although I am not sure if I get it correctly,
but there is a wide usage of wchar_t type in it. From what I understood so far, it is risky to use it
in case a cygwin(or the others..)

That is very true, and therefore the implementation is partial at best.

Especially given recent discussion here:
https://lists.gnu.org/archive/html/coreutils/2018-01/msg00035.html

Complete multibyte support in 'tr' will require better implementation
(possibly something using 'mbbuffer' like the other programs in the patch).

Since most of the characters ever translated will probably not take more than 2 bytes, (which is most important in my opinion) do I get it right, that the wider characters are not considered so far?

example usage of a problematic use case:
(Georgian letter AEN)
printf '\xe1\x83\xbd' | src/tr '[:lower:]' '[:upper:]'

Please note a subtle but important issue:

The cygwin/wchar_t/utf-16 limitation is not about how many bytes
the encoded multibyte character occupies, but whether its decoded
unicode codepoint is larger than 65535 (which then does not fit in a
16-bit wchar_t).

In your example, the UTF-8 encoding of "GEORGIAN LETTER AEN"
is indeed 3 bytes: 0xE1 0x83 0xBD.
But it encodes a unicode codepoint of U+10FD (or decimal 4349)
which fits without a problem in 16-bits.
Cygwin should be able to handle that character without a problem.

The problem in cygwin would happen for characters whose unicode
codepoint is above 65535 (also known as characters outside the "Basic Multilingual Plane").

For example, the character "SMILING FACE WITH SUNGLASSES" is encoded
in UTF-8 as 4 bytes: 0xF0 0x9F 0x98 0x8E.
This encodes the unicode value U+1F60E (128526 decimal) which does not
fit in 16-bits.

In cygwin such input would be returned as two 16bit wchar_t's
(e.g need to call mbrtowc(3) twice), the first is 0xD83D and the second
0xDE0E.
Then the application (e.g. 'tr') would need to merge these two UTF-16 surrogates into one unicode value.

---

Regarding the issue of which characters are most important - I think we
should aim to support all characters, not just the basic multilingual plane. Especially with the proliferation of emoji and new fancy characters - those might be quite often used in the future.

regards,
 - assaf












reply via email to

[Prev in Thread] Current Thread [Next in Thread]