coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Multibyte support (round 2)


From: Assaf Gordon
Subject: Re: Multibyte support (round 2)
Date: Mon, 29 Aug 2016 20:02:24 -0400

Hello Eric,

On 08/29/2016 01:13 PM, Eric Blake wrote:
> On 08/27/2016 12:05 AM, Assaf Gordon wrote:
>> Regarding wchar_t == UCS:
> But not in Cygwin, where wchar_t is 2 bytes, and where Cygwin already
> supports surrogate pairs in wchar_t to represent Unicode characters
> beyond 0xffff

Thank you for mentioning this.
On AIX-32bit wchar_t is also 2bytes, but I'm not sure if UCS2 or just BMP.

I can think of few options:

1. Process entire lines, keep them in-memory as multibyte strings in the 
current locale,
then use gnulib's unicode-normalization functions take take an entire string 
(e.g. u8_normalize).
(This was the initial implementation, in 
http://lists.gnu.org/archive/html/coreutils/2016-07/msg00018.html ).

2. Detect such systems (where wchar_t==UCS2 or BMP) in runtime or at 
configuration time,
and then either:
2.1: issue a warning if the input is beyond BMP (meaning partial unicode 
normaliation support on such systems)
2.2: add additional code to convert UCS-2 surrogate pairs into UCS4

3. Decide not to support unicode normalization on such systems (beyond what 
'just works' with BMP characters).


Comments welcomed,
- assaf




reply via email to

[Prev in Thread] Current Thread [Next in Thread]