bug-gnulib
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: why is MB_LEN_MAX so large (16) on glibc


From: Pádraig Brady
Subject: Re: why is MB_LEN_MAX so large (16) on glibc
Date: Thu, 14 May 2015 10:06:31 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.5.0

On 14/05/15 01:30, Bruno Haible wrote:
> [CCing bug-gnulib to share the understanding about i18n issues]
> 
> Pádraig Brady wrote on 13.05.2015:
>> MB_LEN_MAX was changed from 6 to 16 with:
>> https://sourceware.org/git/?p=glibc.git;a=commit;f=include/limits.h;h=d64b6ad075
>> Do you know why the value 16 is used exactly?
> 
> This was motivated either by the desire to be completely future-proof
> for the next 30 years (and you don't know what kinds of encodings will
> be invented).
> 
> Or because for a couple of months Ulrich Drepper & François Pinard were
> considering to add locales with stateful encodings such as ISO-2022-JP-2.
> This later turned out to be not worth the effort (as the user experience
> with filenames and shell in such locales was found to be terrible).

Excellent. Info like that is nigh on impossible to search for.

>> BTW I see MB_LEN_MAX is 4 on musl libc.
> 
> The value of 4 is sufficient to accommodate all stateless encodings in
> use, including UTF-8 (which was restricted from max. 6 to 4 bytes by
> an ISO standard) and GB18030. But it's not necessarily future-proof.

Right. A good summary of the UTF8 6 -> 4 bytes thing is at:
https://stijndewitt.wordpress.com/2014/08/09/max-bytes-in-a-utf-8-char/

I see that MB_CUR_MAX is 6 for UTF8 on glibc.
I wonder could that be reduced to 4?

I see that one has to be more careful with the _compile time_
constant MB_LEN_MAX, though it would be tempting to reduce to 8 at least,
requiring a recompile for the unlikely case of supporting legacy
stateful encodings.

>> I was worried that it implied that wctomb() might convert a wide char to 
>> _multiple_ encoded chars
>> for some character/encoding combinations?
> 
> No, neither POSIX nor glibc supports locales with encodings where
> a wide char would correspond to multiple characters or a where a
> character would correspond to multiple wide chars.

This was my key question answered.

> In particular,
> this prevented EUC-JISX0213 from being used as a locale encoding in
> glibc [1], thus accelerating the move to UTF-8.

Interesting, though EUC-JISX0213 might now be supported
with newer unicode standards that include the appropriate chars?

>> For example iso-2022-kr can have up to 7 bytes per encoded char,
>> so maybe wctomb() might output two of those for some wide chars,
>> and the extra two bytes were added for alignment?
> 
> Yes, this was part of the considerations regarding stateful encodings.
> 
>> Specifically why I'm wondering about this is to size the
>> output buffer for wctomb() appropriately.
>> Note the linux man page for wctomb() says to use MB_CUR_MAX,
>> while the freebsd man page says to use MB_LEN_MAX
> 
> That's simply because MB_CUR_MAX is not a compile-time constant,
> and therefore for a long time the declaration of a local variable
>   char buf[MB_CUR_MAX];
> required GCC or C++, and the FreeBSD people are not keen adopters
> of GCC extensions.
> 
>> I also asked this at:
>> http://stackoverflow.com/q/30222107/4421
> 
> Bruno
> 
> [1] https://sourceware.org/git/?p=glibc.git;a=blob;f=iconvdata/euc-jisx0213.c

thanks!

Pádraig.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]