[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8
From: |
Max Horn |
Subject: |
Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8 |
Date: |
Wed, 6 Jun 2012 13:43:55 +0200 |
Dear all,
I subscribed to this list solely to respond to this thread, and hopefully help
iron out issues caused by installing GNU sed on Mac OS X (regressions in
behavior compared to the system's sed). My interest stems from that fact that I
am the maintainer of the GNU sed package in the Fink package management system
(for reference, see <http://www.finkproject.org/> and
<http://pdb.finkproject.org/pdb/package.php/sed>).
To recall, a simplified way to reproduce the issue involves setting LC_ALL=C,
and then feeding some UTF-8 data into sed. Using the sed shipped with Mac OS X,
this works fine:
$ echo "Rémi Leblond" | LC_ALL=C /usr/bin/sed -ne "s/.*/'&'/p"
'Rémi Leblond'
But using GNU sed does not:
$ echo "Rémi Leblond" | LC_ALL=C gsed -ne "s/.*/'&'/p"
'R'émi Leblond
Let me sum up the reasons for this:
1) On Mac OS X, nl_langinfo(CODESET) returns "US-ASCII" if LC_ALL is set to C
(see also
<http://www.opensource.apple.com/source/Libc/Libc-763.13/locale/nl_langinfo-fbsd.c>)
2) On Mac OS X, localcharset.c uses a hard-coded charset.alias table, which
does *not* contain a mapping for US-ASCII, but does contain a catch-all mapping
mapping every unknown charset to UTF-8. Hence, US-ASCII is mapped to UTF-8.
3) Finally, on Mac OS X, MB_CUR_MAX is define as follows:
#define MB_CUR_MAX (___mb_cur_max())
and that evaluates to 1 when LC_ALL is set to C, and more generally, when
nl_langinfo(CODESET) return "US-ASCII".
At this point, GNU sed operates under the assumption that encoding is UTF-8,
but uses a MB_CUR_MAX value of 1, which then of course breaks when non-ASCII
multibyte chars pop up.
One way to fix this was already proposed: Nameyl to add a mapping from
"US-ASCII" to "ASCII" into the built-in conversion table. This was rejected by
Bruno Haible in
<http://lists.gnu.org/archive/html/bug-gnulib/2012-01/msg00342.html> with the
argument:
> Nah. "Let's break gettext() based internationalization of all GNU programs
> for most MacOS X users" won't get my approval.
Well, I strongly disagree with this statement. First off, to me, breaking i18n
would actually be preferable over breaking shell scripts that run fine
elsewhere or with Apple's sed. But as a matter of fact, I don't think that i18n
would be broken at all (or if, then only for a small minority of power-users
who know how to deal with it). As far as I can tell, this breakage claim is
based on an incorrect claim farther up in Bruno's email, to quote:
> [...] Therefore the normal situation on MacOS X is this:
> $ env | grep LC_
> $ locale
> LANG=
> LC_COLLATE="C"
> LC_CTYPE="C"
> LC_MESSAGES="C"
> LC_MONETARY="C"
> LC_NUMERIC="C"
> LC_TIME="C"
> LC_ALL=
That's not correct (though it might have been several years ago). Because
Terminal.app on Mac OS X has an option "Set LANG environment variable" which is
enabled by default. As a result, on my system (set to German locale by default)
for example I get this:
$ locale
LANG="de_DE.UTF-8"
LC_COLLATE="de_DE.UTF-8"
LC_CTYPE="de_DE.UTF-8"
LC_MESSAGES="de_DE.UTF-8"
LC_MONETARY="de_DE.UTF-8"
LC_NUMERIC="de_DE.UTF-8"
LC_TIME="de_DE.UTF-8"
LC_ALL=
Switching to french, I get the suitable correct values, too (at least in
freshly opened terminals).
There is another claim that appears to be incorrect to me (but perhaps I simply
misunderstand it):
> There are several systems with locale encoding UTF-8 in the all user
> locales: Plan 9, BeOS, Haiku, MacOS X, Cygwin 1.7, and there will be more,
> because it's a natural choice nowadays. [...]
While Mac OS X defaults to UTF-8, it also supports other encoding in user
locales. For example, the following German locales are supported:
de_DE
de_DE.ISO8859-1
de_DE.ISO8859-15
de_DE.UTF-8
Indeed, on the same Terminal.app preference page on which one can toggle the
"Set LANG environment variable" setting, one can also choose another encoding.
In summary, I don't see any harm with adding the "US-ASCII" => "ASCII" mapping
to the hardcoded charset.alias table. To the contrary, it resolves all issues
with this code known to me; and always defaulting to UTF-8 doesn't seem
sensible either, in light of the fact that we cannot safely rely on the active
encoding to be UTF-8. So, not trying to outguess the OS seems to me to be the
preferable route here...
And it seems to me as if with this change, localization / internationalization
in GNU apps would still work fine, at least under normal circumstances. Only
"Power users" who chose to disable the "Set LANG environment variable" will
have to deal with the consequences; but they should be able to set their LANG /
LC_ALL / etc. env variables according to their needs.
But perhaps I am totally missing out on something -- in that case, I hope you
can teach me about that, and perhaps we can come up with another way to improve
the overall experience for the most important party involved here: The people
using sed :-)
Cheers,
Max
- Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8, (continued)
- Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8, Eric Blake, 2012/06/07
- Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8, Paolo Bonzini, 2012/06/07
- Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8, Pádraig Brady, 2012/06/07
- Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8, Eric Blake, 2012/06/07
- Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8, Paolo Bonzini, 2012/06/07
- Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8, Eric Blake, 2012/06/07
Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8, Max Horn, 2012/06/10
Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8,
Max Horn <=