bug-gnu-utils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8


From: Stephen J. Butler
Subject: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8
Date: Fri, 1 Jun 2012 03:27:30 -0500

SETUP:
$ sw_vers
ProductName:    Mac OS X
ProductVersion: 10.7.4
BuildVersion:   11E53

$ ~/gnu/bin/sed --version
GNU sed version 4.2.1

PROBLEM: With UTF-8 input, but LANG and LC_ALL set to C, sed regular
expressions break on multibyte sequences. For example (constructed
from part of a git command):

$ echo "Rémi Leblond" | LANG=C LC_ALL=C ~/gnu/bin/sed -ne
's/.*/GIT_AUTHOR_NAME='\''&'\''/p'

EXPECTED: GIT_AUTHOR_NAME='Rémi Leblond'
ACTUAL: GIT_AUTHOR_NAME='R'émi Leblond

DISCUSSION: The problem starts in sed/lib/localcharset.c,
locale_charset, line 334

# if HAVE_LANGINFO_CODESET

  /* Most systems support nl_langinfo (CODESET) nowadays.  */
  codeset = nl_langinfo (CODESET);

Since we set LC_ALL to C, we trigger this code in Libc:

http://www.opensource.apple.com/source/Libc/Libc-763.13/locale/nl_langinfo-fbsd.c,
line 54:

        case CODESET:
                ret = "";
                if ((s = querylocale(LC_CTYPE_MASK, loc)) != NULL) {
                        if ((cs = strchr(s, '.')) != NULL)
                                ret = cs + 1;
                        else if (strcmp(s, "C") == 0 ||
                                 strcmp(s, "POSIX") == 0)
                                ret = "US-ASCII";
                        else if (strcmp(s, "UTF-8") == 0)
                                ret = "UTF-8";
                }
                break;

As you can see, querylocale() will return "C", and
nl_langinfo(CODESET) will return "US-ASCII". The other thing to
realize is that on OS X MB_CUR_MAX is a macro for ___mb_cur_max(),
which returns 1 when LC_ALL is C.

Back to sed/lib/localcharset.c, we end up at locale_charset(), line 483:

  /* Resolve alias. */
  for (aliases = get_charset_aliases ();
       *aliases != '\0';
       aliases += strlen (aliases) + 1, aliases += strlen (aliases) + 1)
    if (strcmp (codeset, aliases) == 0
        || (aliases[0] == '*' && aliases[1] == '\0'))
      {
        codeset = aliases + strlen (aliases) + 1;
        break;
      }

This tries to alias our charset, "US-ASCII", to something sed
understands. get_charset_aliases() is at line 112 in the same file. On
OS X 10.7, DARWIN7 is defined (always for OS X 10.3 or newer), so we
end up at line 223:

      /* To avoid the trouble of installing a file that is shared by many
         GNU packages -- many packaging systems have problems with this --,
         simply inline the aliases here.  */
      cp = "ISO8859-1" "\0" "ISO-8859-1" "\0"
           "ISO8859-2" "\0" "ISO-8859-2" "\0"
           "ISO8859-4" "\0" "ISO-8859-4" "\0"
           "ISO8859-5" "\0" "ISO-8859-5" "\0"
           "ISO8859-7" "\0" "ISO-8859-7" "\0"
           "ISO8859-9" "\0" "ISO-8859-9" "\0"
           "ISO8859-13" "\0" "ISO-8859-13" "\0"
           "ISO8859-15" "\0" "ISO-8859-15" "\0"
           "KOI8-R" "\0" "KOI8-R" "\0"
           "KOI8-U" "\0" "KOI8-U" "\0"
           "CP866" "\0" "CP866" "\0"
           "CP949" "\0" "CP949" "\0"
           "CP1131" "\0" "CP1131" "\0"
           "CP1251" "\0" "CP1251" "\0"
           "eucCN" "\0" "GB2312" "\0"
           "GB2312" "\0" "GB2312" "\0"
           "eucJP" "\0" "EUC-JP" "\0"
           "eucKR" "\0" "EUC-KR" "\0"
           "Big5" "\0" "BIG5" "\0"
           "Big5HKSCS" "\0" "BIG5-HKSCS" "\0"
           "GBK" "\0" "GBK" "\0"
           "GB18030" "\0" "GB18030" "\0"
           "SJIS" "\0" "SHIFT_JIS" "\0"
           "ARMSCII-8" "\0" "ARMSCII-8" "\0"
           "PT154" "\0" "PT154" "\0"
         /*"ISCII-DEV" "\0" "?" "\0"*/
           "*" "\0" "UTF-8" "\0";

And here is the root problem. This table does not have an entry for
US-ASCII. So it catches the default entry, "*", which maps everything
to "UTF-8", and that's what get_charset_aliases() returns, and what
locale_charset(), which then sets a UTF-8 flag in sed that gets used
by many parts.

But this is dangerous, because now UTF-8 is set but MB_CUR_MAX is 1
and various parts of sed interpret "Rémi Leblond" as an invalid
character sequence for a UTF-8 character set. This is why /.*/ in the
regular expression only matches the "R" before bailing on the "é".

POSIX says that the "C" locale should treat text data is binary input,
but in this situation sed is trying to treat it as a multibyte
encoding.

FIX: the DARWIN7 table in get_charset_aliases() should not contain a
default that maps everything not defined to "UTF-8". Or at the very
least, it should include an entry for "US-ASCII" that maps to "ASCII",
as a charset.aliases file might.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]