bug-bash
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: builtin printf behaves incorrectly with "c and 'c character-value ar


From: Rich Felker
Subject: Re: builtin printf behaves incorrectly with "c and 'c character-value arguments
Date: Mon, 5 Nov 2007 22:53:27 -0500
User-agent: Mutt/1.4.2.2i

On Mon, Nov 05, 2007 at 10:23:43PM -0500, Chet Ramey wrote:
> Rich Felker wrote:
> 
> > I'm not sure what you mean. For a Latin-1 locale there is no
> > difference, but if the locale is a different legacy locale, the
> > wchar_t value (Unicode scalar value on systems with __STDC_ISO_10646__
> > defined) needs to be returned. If you're doubtful about the intent of
> > the standard, why not file a request for interpretation?
> 
> I'm not doubtful about the standard's intent.  When the user has not
> chosen to use a locale that contains multibyte characters, not only
> should bash not second-guess the user by returning a multibyte
> character, functions such as mbrtowc or mblen/mbrlen will not return
> "multibyte" values (e.g., mbrlen will return `1' and mbrtowc will return
> `-61' -- converted to 195, since it's unsigned -- as its wchar value
> while converting 1 character in your example).

This 195 _is_ its value as a multibyte character in a locale with
ISO-8859-1 as its codeset. In such a locale, it's also the value of
the byte (interpreted as unsigned). So here it doesn't matter which
you use; either is equally correct.

Where something different happens is if your locale has a different
codeset. For instance, in KOI8-R, there is a character "²" which is
placed on a different byte (9B) than in ISO-8859 encodings (B2). But
regardless of your locale,

$ printf %d\\n \'²

should print 179, provided that your system implementation uses the
same values for wchar_t regardless of locale. These semantics are
useful because they actually tell you something about the identity of
the character. But most importantly, it's just illogical for the
function to behave differently based on whether MB_CUR_MAX is 1 or
something greater than 1, rather than being based on the actual locale
encoding. "²" is a "²" in a KOI8-R locale just as much as it is a "²"
in a UTF-8 locale. Bash's printf should not treat the KOI8-R locale
badly just because all characters happen to fit into one byte. The
mbrtowc function will give the correct result for all locales, whether
or not they have characters that take multiple bytes to represent;
special-casing locales that don't just gives illogical (and
non-conformant!) behavior.

Rich


P.S. For my own usage I'd be plenty happy as long as the bug is fixed
in UTF-8 based locales since that's all I ever intend to use. But I
maintain that the current behavior is incorrect and nonconformant in
other locales as well. If you want a compromise, why not make the
correct behavior be dependent on strict posix mode?




reply via email to

[Prev in Thread] Current Thread [Next in Thread]