bug-gnulib
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8


From: Max Horn
Subject: Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8
Date: Fri, 6 Jul 2012 13:36:53 +0200

[resending this email via another email, as there seems to have been a problem 
the first time around...?]

Hi again,

On 23.06.2012, at 18:36, Paul Eggert wrote:

> On 06/23/2012 07:54 AM, Paolo Bonzini wrote:
>> I'm waiting for feedback from the Gnulib guys.
> 
> Can you please summarize the issue, the proposed fixes,
> and the pros and cons of each?  The discussion has been
> spread out for so long that I've forgotten half of it.

Indeed... and now I have been silent for almost two weeks, too *sigh*.

> No need for anything fancy; URLs are fine.  Thanks.

Sure! Let me start by quoting my email from June 6:

> 1) On Mac OS X, nl_langinfo(CODESET) returns "US-ASCII" if LC_ALL is set to C
> (see also 
> <http://www.opensource.apple.com/source/Libc/Libc-763.13/locale/nl_langinfo-fbsd.c>)
> 
> 2) On Mac OS X, localcharset.c uses a hard-coded charset.alias table, which 
> does *not* contain a mapping for US-ASCII, but does contain a catch-all 
> mapping mapping every unknown charset to UTF-8. Hence, US-ASCII is mapped to 
> UTF-8.
> 
> 3) Finally, on Mac OS X, MB_CUR_MAX is define as follows:
>  #define      MB_CUR_MAX      (___mb_cur_max())
> and that evaluates to 1 when LC_ALL is set to C, and more generally, when 
> nl_langinfo(CODESET) return "US-ASCII".
> 

> At this point, GNU sed operates under the assumption that encoding is UTF-8, 
> but uses a MB_CUR_MAX value of 1, which then of course breaks when non-ASCII 
> multibyte chars pop up.

Result: Certain commands fail, e.g. compare Apple/BSD sed with GNU sed (which 
uses gnulib):

> $ echo "Rémi Leblond" | LC_ALL=C /usr/bin/sed -ne "s/.*/'&'/p"
> 'Rémi Leblond'
> 
> But using GNU sed does not:
> $ echo "Rémi Leblond" | LC_ALL=C gsed -ne "s/.*/'&'/p"
> 'R'émi Leblond


My proposed fix: Stop trying to second guess the OS. Just add a table entry for 
"US-ASCII" to the hardcoded one in localcharset.c. This should work fine on all 
Mac OS X 10.4 and newer.

This will *not* break internationalization for most users, as was claimed 
previously, because Apple's Terminal.app by default sets LANG to reflect the 
active locale of the user. The exception is if a user explicitly tells 
Terminal.app not to do that, or manually sets LC_ALL; or if a script does so 
(such as certain parts of git, which hence are broken when being used in 
conjunction with GNU sed on Mac OS X -- ouch).


An alternative patch was suggested by Paul, which I confirmed to also work. 
Personally, I find my solution more logical, but Paul certainly knows tons more 
about these things than I do, and I'll happily defer to him. In the end I don't 
care how this issue affecting people in real life situations is resolved, as 
long as it *is* resolved. Here is his proposed patch:

--- a/lib/localcharset.c
+++ b/lib/localcharset.c
@@ -542,5 +542,12 @@ locale_charset (void)
 if (codeset[0] == '\0')
   codeset = "ASCII";

+#ifdef DARWIN7
+  /* MacOS X sets MB_CUR_MAX to 1 when LC_ALL=C, and "UTF-8"
+     (the default codeset) does not work when MB_CUR_MAX is 1.  */
+  if (strcmp (codeset, "UTF-8") == 0 && MB_CUR_MAX <= 1)
+    codeset = "ASCII";
+#endif
+
 return codeset;
}



Cheers,
Max


reply via email to

[Prev in Thread] Current Thread [Next in Thread]