[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: supporting obscure languages
From: |
Albert Cahalan |
Subject: |
Re: supporting obscure languages |
Date: |
Sat, 28 Nov 2009 08:02:40 -0500 |
On Sat, Nov 28, 2009 at 5:34 AM, Bruno Haible <address@hidden> wrote:
> Albert Cahalan wrote:
>> I don't need month name, time display rules, telephone formats...
>>
>> All I care about: LC_MESSAGES for "zam", LC_CTYPE not lobotomized
>
> Then your workaround of doing
> LANGUAGE=zam LC_ALL=fr_FR.UTF-8
> is just fine.
Don't you think that is terribly gross? (French with
different words!)
Don't you think it's doubly gross to have a program
calling setenv() to control a library via environment
variables intended for users instead of a proper API?
>> BTW, we'd like fallback to similar translations in case something
>> is missing. When zh_TW.mo lacks something, zh_CN.mo should be the
>> next place to look.
>
> That's a built-in feature in GNU gettext: just set the LANGUAGE variable to
> zh_TW:zh_CN
> and you're done.
I guess we'll probably do that. Still, setenv as an API
is really disturbing. I greatly prefer to treat the environment
as read-only.
The library doesn't even get immediate notice that there
has been a change unless you have evil hooks into the
setenv and getenv functions. You'd have to either do a
slow getenv each time, or cache the value and hope the
program doesn't try to change things later.
>> setlocale(LC_ALL, loc); // loc="" or loc="zam"
>> ctype_utf8(); // setlocale(LC_CTYPE,x) for many x until iswprint works
>
> Yes, you have no guarantee that a particular locale is installed on the user's
> system. You have to try some. setlocale(LC_ALL, "") is a good first guess.
That guess is just "C" on my system.
>> My current hack: LANGUAGE=zam LC_ALL=fr_FR.UTF-8
>>
>> Yep, I'm telling gettext that this is French. That's disgusting.
>
> No, you are telling the system to use an UTF-8 encoding for strings,
> French rules for time, sorting, numbers etc, and Zapotec for messages.
> If it fits well with your program, all fine.
Eh, the Zapotec dialect of French. It does work, as long as
the user happens to have fr_FR.UTF-8 installed.
That's trouble. I'm depending on some random unrelated locale
just to get normal UTF-8 behavior.
>> There are quite a few design bugs here, none of which would cause
>> huge problems all by itself. Together, they are a disaster.
>>
>> a. The implementation-specific "" locale is "C". (it need not be)
>
> No, when you call setlocale(LC_ALL,"") it uses the locale that the
> user has set, not "C".
I mean when the user has done nothing either. The "" doesn't
get filled in by some environment variable. You make it all the
way to the lowest-priority environment variable ("LANG") and
still have "". At that point, the implementation-specific locale
is chosen... and it is "C".
>> b. The "C" locale is not UTF-8. (this need not be the case)
>
> The "C" locale was defined at a time when there was no UTF-8. This
> choice accommodates for output devices that cannot display arbitrary
> Unicode characters (think of ssh into an older Unix system).
I can sort of understand this. I own a real VT510 terminal.
It's not a working protection though. Linux distributions often
set a UTF-8 locale, then fail to translate or otherwise protect
logins on the serial tty devices. This happens to be why procps
replaces UTF-8 characters containing the 0x9b byte. (but of
course that is potentially hostile data, not translations, and
Red Hat patches out the protection anyway)
Having "C" not be i18n-friendly (serving up UTF-8 messages
and full Unicode on wchar_t) wouldn't be a big deal except
for the fact that the locale so easily ends up being "C".
(when unspecified, when a locale is broken/unknown, etc.)
>> c. The "C" locale makes iswprint((wchar_t)0xf7) be false. (very bad)
>
> I agree with you that wide characters are a mess in ISO C, because the
> meaning of (wchar_t)0xf7 depends on locales: in some locale it may be
> a DIVISION SIGN, in another one a CYRILLIC SMALL LETTER YI, in another
> one a LATIN SMALL LETTER S WITH ACUTE, and in another one it's invalid.
Locales with non-Unicode wchar_t are far worse than locales
with non-UTF-8 char. Lots of software breaks, and nobody will
fix it. There comes a time to deprecate dysfunctional locales.
>> d. The "C" locale ignores LC_MESSAGES, even if not "C".
>
> What do you expect the system to do when you set LC_ALL to "C" and
> then LC_MESSAGES to "zh_CN"? All characters are US-ASCII but messages
> should be in Chinese? In earlier versions of glibc, the Chinese strings
> were converted to "?????? ??? ?????? 32 ?????" before being displayed.
> This was not really helpful; so now the translations are ignored
> entirely in this case.
Just be binary-clean. Remember why UTF-8 was invented.
If glibc were binary clean, messages would normally just work.
They would certainly work for typical GUI stuff using Pango,
and would even work in many terminal situations.
>> e. The locale reverts to "C" if some portion is missing/unknown.
>
> What's wrong with having a fallback if some portion is missing?
Nothing. The problem is how this interacts with the other stuff.
If the fallback were something like "C.UTF-8" or the "C" locale
wasn't severely limited, there would be no problem.
It's only the combination of all these design issues that results in
a problem. Individually, no one design issue is really a problem.
>> The result is that none of these work:
>>
>> a. setlocale(LC_ALL,"zam");
>> b. setlocale(LC_MESSAGES,"zam");
>> c. setlocale(LC_MESSAGES,"zam"); setlocale(LC_CTYPE,"UTF-8");
>
> None of these work because you don't have a "zam" locale installed in
> the first place. setlocale is about designating locales to use.
I have a piece of a locale installed. (my "zam.mo" file)
To use that, I mainly just need a binary-clean library.
Getting iswprint() and towupper() would be nice too, but
it's not a huge problem for me to write my own.
Basically: use what is there, and assume something close
to "C.UTF-8" for anything missing/broken. Maybe you could
find choices that are more generic than "C", like 24-hour time
and PA4 paper size. Maybe round-trip the case for U+1E9E,
avoiding expansion troubles. You could call it "default.UTF-8".
The details aren't terribly critical; the main thing is to let a
random loose UTF-8 *.mo file work without hacks or fuss,
along with the wchar_t functions working beyond ASCII.
>> There just doesn't seem to be any reasonable way to kick gettext into
>> UTF-8 mode and feed it a *.mo file.
>
> You found the way and showed it to us.
Trying random unrelated locales and calling putenv() is
pretty far from reasonable IMHO.
- supporting obscure languages, Albert Cahalan, 2009/11/27
- Re: supporting obscure languages, Bruno Haible, 2009/11/27
- Re: German uppercasing rules (was: supporting obscure languages), Bruno Haible, 2009/11/28
- Re: German uppercasing rules (was: supporting obscure languages), Albert Cahalan, 2009/11/28
- Re: German uppercasing rules (was: supporting obscure languages), Bruno Haible, 2009/11/28
- Re: German uppercasing rules (was: supporting obscure languages), John Cowan, 2009/11/28