[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8
From: |
Max Horn |
Subject: |
Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8 |
Date: |
Fri, 6 Jul 2012 13:36:53 +0200 |
[resending this email via another email, as there seems to have been a problem
the first time around...?]
Hi again,
On 23.06.2012, at 18:36, Paul Eggert wrote:
> On 06/23/2012 07:54 AM, Paolo Bonzini wrote:
>> I'm waiting for feedback from the Gnulib guys.
>
> Can you please summarize the issue, the proposed fixes,
> and the pros and cons of each? The discussion has been
> spread out for so long that I've forgotten half of it.
Indeed... and now I have been silent for almost two weeks, too *sigh*.
> No need for anything fancy; URLs are fine. Thanks.
Sure! Let me start by quoting my email from June 6:
> 1) On Mac OS X, nl_langinfo(CODESET) returns "US-ASCII" if LC_ALL is set to C
> (see also
> <http://www.opensource.apple.com/source/Libc/Libc-763.13/locale/nl_langinfo-fbsd.c>)
>
> 2) On Mac OS X, localcharset.c uses a hard-coded charset.alias table, which
> does *not* contain a mapping for US-ASCII, but does contain a catch-all
> mapping mapping every unknown charset to UTF-8. Hence, US-ASCII is mapped to
> UTF-8.
>
> 3) Finally, on Mac OS X, MB_CUR_MAX is define as follows:
> #define MB_CUR_MAX (___mb_cur_max())
> and that evaluates to 1 when LC_ALL is set to C, and more generally, when
> nl_langinfo(CODESET) return "US-ASCII".
>
> At this point, GNU sed operates under the assumption that encoding is UTF-8,
> but uses a MB_CUR_MAX value of 1, which then of course breaks when non-ASCII
> multibyte chars pop up.
Result: Certain commands fail, e.g. compare Apple/BSD sed with GNU sed (which
uses gnulib):
> $ echo "Rémi Leblond" | LC_ALL=C /usr/bin/sed -ne "s/.*/'&'/p"
> 'Rémi Leblond'
>
> But using GNU sed does not:
> $ echo "Rémi Leblond" | LC_ALL=C gsed -ne "s/.*/'&'/p"
> 'R'émi Leblond
My proposed fix: Stop trying to second guess the OS. Just add a table entry for
"US-ASCII" to the hardcoded one in localcharset.c. This should work fine on all
Mac OS X 10.4 and newer.
This will *not* break internationalization for most users, as was claimed
previously, because Apple's Terminal.app by default sets LANG to reflect the
active locale of the user. The exception is if a user explicitly tells
Terminal.app not to do that, or manually sets LC_ALL; or if a script does so
(such as certain parts of git, which hence are broken when being used in
conjunction with GNU sed on Mac OS X -- ouch).
An alternative patch was suggested by Paul, which I confirmed to also work.
Personally, I find my solution more logical, but Paul certainly knows tons more
about these things than I do, and I'll happily defer to him. In the end I don't
care how this issue affecting people in real life situations is resolved, as
long as it *is* resolved. Here is his proposed patch:
--- a/lib/localcharset.c
+++ b/lib/localcharset.c
@@ -542,5 +542,12 @@ locale_charset (void)
if (codeset[0] == '\0')
codeset = "ASCII";
+#ifdef DARWIN7
+ /* MacOS X sets MB_CUR_MAX to 1 when LC_ALL=C, and "UTF-8"
+ (the default codeset) does not work when MB_CUR_MAX is 1. */
+ if (strcmp (codeset, "UTF-8") == 0 && MB_CUR_MAX <= 1)
+ codeset = "ASCII";
+#endif
+
return codeset;
}
Cheers,
Max