bug-gnulib
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8


From: Max Horn
Subject: Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8
Date: Mon, 18 Jun 2012 16:56:27 +0200

Hi again,

it would be really, really nice to get this issue resolved, one way or another 
:-). As mentioned, in the current state of things, GNU sed (via gnulib) does 
not work correctly on Mac OS X when e.g. LANG=C is set, leading to real-world 
errors for users when using e.g. git while GNU sed is installed.

On the other hand, so far I saw no reply to my attempts to refute 
counterarguments against these patches. So, should I just submit git patches 
for one (or both) of them, for inclusion? Or does anybody still have 
reservations about this? 


Cheers,
Max

Am 11.06.2012 um 00:31 schrieb Max Horn:

> Hi again,
> 
> 
> Am 07.06.2012 um 14:07 schrieb Bruno Haible:
> 
> [...]
> 
>> 
>>> But this is dangerous, because now UTF-8 is set but MB_CUR_MAX is 1
>>> and various parts of sed interpret "Rémi Leblond" as an invalid
>>> character sequence for a UTF-8 character set.
>> 
>> Indeed, I can see how this inconsistency leads to bugs like the described
>> ones.
>> 
>> The fix could be to have two different locale_charset() functions,
>> one that returns "US-ASCII" and another one that returns "UTF-8".
>> The first one to be used when MB_CUR_MAX and mbrtowc() are used as
>> well, the second one to be used by gettext(). But the separation
>> line between the two cases is not yet clear to me. Any insights?
> 
> Hum, that sounds quite complicated -- could you explain what this would gain 
> over the idea of simply mapping "US-ASCII" to "ASCII", or over the patch Paul 
> suggested:
> 
>> --- a/lib/localcharset.c
>> +++ b/lib/localcharset.c
>> @@ -542,5 +542,12 @@ locale_charset (void)
>>  if (codeset[0] == '\0')
>>    codeset = "ASCII";
>> 
>> +#ifdef DARWIN7
>> +  /* MacOS X sets MB_CUR_MAX to 1 when LC_ALL=C, and "UTF-8"
>> +     (the default codeset) does not work when MB_CUR_MAX is 1.  */
>> +  if (strcmp (codeset, "UTF-8") == 0 && MB_CUR_MAX <= 1)
>> +    codeset = "ASCII";
>> +#endif
>> +
>>  return codeset;
>> }
> 
> 
> Cheers,
> Max
> 




reply via email to

[Prev in Thread] Current Thread [Next in Thread]