bug-gnulib
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: why is MB_LEN_MAX so large (16) on glibc


From: Bruno Haible
Subject: Re: why is MB_LEN_MAX so large (16) on glibc
Date: Thu, 14 May 2015 02:30:14 +0200
User-agent: KMail/4.8.5 (Linux/3.8.0-44-generic; KDE/4.8.5; x86_64; ; )

[CCing bug-gnulib to share the understanding about i18n issues]

Pádraig Brady wrote on 13.05.2015:
> MB_LEN_MAX was changed from 6 to 16 with:
> https://sourceware.org/git/?p=glibc.git;a=commit;f=include/limits.h;h=d64b6ad075
> Do you know why the value 16 is used exactly?

This was motivated either by the desire to be completely future-proof
for the next 30 years (and you don't know what kinds of encodings will
be invented).

Or because for a couple of months Ulrich Drepper & François Pinard were
considering to add locales with stateful encodings such as ISO-2022-JP-2.
This later turned out to be not worth the effort (as the user experience
with filenames and shell in such locales was found to be terrible).

> BTW I see MB_LEN_MAX is 4 on musl libc.

The value of 4 is sufficient to accommodate all stateless encodings in
use, including UTF-8 (which was restricted from max. 6 to 4 bytes by
an ISO standard) and GB18030. But it's not necessarily future-proof.

> I was worried that it implied that wctomb() might convert a wide char to 
> _multiple_ encoded chars
> for some character/encoding combinations?

No, neither POSIX nor glibc supports locales with encodings where
a wide char would correspond to multiple characters or a where a
character would correspond to multiple wide chars. In particular,
this prevented EUC-JISX0213 from being used as a locale encoding in
glibc [1], thus accelerating the move to UTF-8.

> For example iso-2022-kr can have up to 7 bytes per encoded char,
> so maybe wctomb() might output two of those for some wide chars,
> and the extra two bytes were added for alignment?

Yes, this was part of the considerations regarding stateful encodings.

> Specifically why I'm wondering about this is to size the
> output buffer for wctomb() appropriately.
> Note the linux man page for wctomb() says to use MB_CUR_MAX,
> while the freebsd man page says to use MB_LEN_MAX

That's simply because MB_CUR_MAX is not a compile-time constant,
and therefore for a long time the declaration of a local variable
  char buf[MB_CUR_MAX];
required GCC or C++, and the FreeBSD people are not keen adopters
of GCC extensions.

> I also asked this at:
> http://stackoverflow.com/q/30222107/4421

Bruno

[1] https://sourceware.org/git/?p=glibc.git;a=blob;f=iconvdata/euc-jisx0213.c




reply via email to

[Prev in Thread] Current Thread [Next in Thread]