[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Unexpected (?) segfault after unset LANG
From: |
Chet Ramey |
Subject: |
Re: Unexpected (?) segfault after unset LANG |
Date: |
Sun, 11 Feb 2018 17:24:33 -0500 |
User-agent: |
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:52.0) Gecko/20100101 Thunderbird/52.5.2 |
On 2/9/18 11:53 AM, mike b wrote:
> Indeed, with build from devel it doesn't segfault anymore. Just out of pure
> curiosity, which commit introduced a fix for that? Aaand, there's one more
> thing that puzzles me a bit:
>
> # echo $BASH_VERSION
> 5.0.12(1)-alpha
> # echo ${LANG:-bleh}
> bleh
LANG has the lowest priority of the various locale environment variables.
Without LANG set, the locale is either determined by the LC_ variables,
which may or may not be set, or the system's default locale at program
start (probably "C").
> # LANG=UTF-8
If you don't have a "UTF-8" locale on your system (Mac OS X happens to, but
Linux does not), this fails and leaves the locale unchanged from whatever
the default happens to be. Strictly speaking, that's not a valid locale
specification; it's an encoding (codeset).
> # printf '%s\n' $'\u013b'
Bash takes `013b', converts it to a numeric unicode value (315), and tries
to convert it to a character. Since your system probably defines
__STDC_ISO_10646__, as most Linux systems seem to, that value can be
directly used as a wchar_t and converted to a multibyte character using
wctomb().
> Ļ
wctomb() returns the multibyte character sequence you see here.
> # unset LANG
> # : # \o/ no segfaults
Assuming the absence of LC_ALL or any other LC_ variables, this explicitly
sets the locale to the system default (""). Bash does a little more work
that it probably needs to, and explicitly sets all the different parts of
the locale (LC_CTYPE, LC_MESSAGES, etc.) to "" ("C").
> # printf '%s\n' $'\u013b'
> \u013B
The same path through wctomb, but this time wctomb return -1/EILSEQ
(illegal byte sequence), and bash attepts to convert it using iconv().
That fails, so bash chooses to handle the multiple errors by returning
a C99-style escape sequence. That's why the lower-case `b' gets
converted to `B'.
> # LANG=UTF-8
Since this isn't a valid locale, nothing changes.
> # printf '%s\n' $'\u013b'
> \u013B # why it returns just the code?
wctomb() returns -1/EILSEQ.
> When LANG is set to UTF-8, printf returns actual character which coresponds
> to given code after first call, however, after LANG is toggled, printf
> keeps returning just the code. I guess my question here is: why that
> happens? I mean, I would expect it to decode it whenever LANG is set back
> to UTF-8 in this case. Am I missing something here?
The fact that setting LANG=UTF-8 is actually a no-op. I think the real
difference is between whatever the default value for LC_CTYPE is at program
startup, and bash setting it to "" (system default: "C" or "POSIX") when
LANG is unset.
--
``The lyf so short, the craft so long to lerne.'' - Chaucer
``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRU chet@case.edu http://tiswww.cwru.edu/~chet/