bug-gettext
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: GNU gettext 0.22 broke non-Unicode msgids


From: Robert Clausecker
Subject: Re: GNU gettext 0.22 broke non-Unicode msgids
Date: Tue, 28 Nov 2023 04:31:27 +0100

Hi Bruno,

Thank you for your response.

Am Tue, Nov 28, 2023 at 03:34:51AM +0100 schrieb Bruno Haible:
> > Thank you for your input.  It is unfortunate that you are hostile to people
> > who wish to have their translation strings in foreign languages that cannot 
> > be
> > represented in ASCII, but so be it.
> 
> Oh, I'm not hostile to such people :-) Simply, when this documentation was
> written, in 2001, it was a sensible restriction because
>   - It makes sense to go through English as the pivot language (rather than,
>     say, Latin or Esperanto), i.e. provide all msgids in English. Even if
>     the author of the program is German or Chinese. This is because English
>     is the language for which one can find the most translators.
>   - English can be mostly written with US-ASCII only.
>   - There is much less charset-conversion complexity in the runtime and in the
>     tools if we can assume that the msgid is US-ASCII only.
> 
> It would have been possible to design the gettext() function in such a way 
> that
> it does charset conversion on the msgid when no translation is found. But it
> wasn't designed this way.

There's lots of reasons not to use English for msgids:

 - the developer base may not mainly be English speaking and may have trouble
   coming up with grammatically correct English msgids
 - a program that was originally written with messages in another language is
   internationalised by insertion of gettext() in the appropriate places
 - English has very sparse grammar and is almost isolating.  It often happens
   that an English msgid is translated in a contextually incorrect manner as the
   target language distinguishes some grammatical features that English does 
not.

> At some point in the future, we may assume a UTF-8-only world, where the
> msgid and the gettext() result are both in UTF-8 in all cases. Then it will be
> possible to use German or Chinese or whatever as language of the msgids. But
> this would still have the drawback of making it harder to find translators.

I don't mind.

> > We have carefully designed our
> > translations to ensure that a translated string is always present, so this
> > advisory does not apply to use.
> 
> How can you ensure this? The translation is looked up from a .mo file in a
> location that encodes the locale. And it is impossible to enumerate all 
> locales,
> because
>   - new ones are being added every year (from Igbo to Aztec),
>   - the user is free to create their own locales, through POSIX 'localedef'.

We ensure this by installing message catalogues for all present languages.  If
new languages are created, we add them to the install.  Users who define custom
locales are on their own.  We may possibly fall back to selecting a different
message catalogue just to get the encoding right in this case.

> > Fixing the display of this name is largely why a message catalogue is used 
> > in
> > the first place.
> 
> A message catalog is not enough for this purpose, precisely because of the
> charset encoding of the msgid.

It was good enough until your incompatible change broke it.

> > The other reason is to serve as a way to test if NLS support
> > was implemented correctly in the tools.  We plan to roll out localised 
> > messages
> > throughout all of the tools in the future.  The msgid strings will likely be
> > in ISO-8859-1 for the future, too, as that is the character set used 
> > throughout
> > the project.
> 
> In this case, to make the output work right in UTF-8 locales, you will need to
> create your own variant of the gettext() function, that performs an iconv()
> conversion if gettext() has returned the original msgid untranslated.

We may consider such a change, thank you for the recommendation.

> > > Another problem of the practice of using literal ISO-8859-1 strings in 
> > > source
> > > code is that it does not work well for developers in a UTF-8 locale. For
> > > example:
> > > $ grep 'g Schilling"' `find . -name '*.c'`
> > > /bin/grep: ./translit/translit.c: binary file matches
> > > /bin/grep: ./p/p.c: binary file matches
> > > ./cdrecord/cdrecord.c:                                                    
> > >       _("Joerg Schilling"));
> > > ./scgcheck/scgcheck.c:                                                    
> > >       _("Joerg Schilling"));
> > > ./scgcheck/scgcheck.c:                                          _("Joerg 
> > > Schilling"));
> > > /bin/grep: ./sformat/fmt.c: binary file matches
> > > /bin/grep: ./mdigest/mdigest.c: binary file matches
> > > ...
> > > It requires an extra option to 'grep', namely '-a', in this situation.
> > 
> > Yes, this project assumes you work in the ISO-8859-1 locale.  We do not 
> > plan to
> > change that.  Likewise, projects with source code forms in UTF-8 locales are
> > hard to work with for people in other locales.  We recognise this limitation
> > but would like to stay with our choice.
> 
> Your choice. But new co-developers will tell you that they have a hard time
> working in a ISO-8859-1 codebase, when their locale is UTF-8.

So far we had no problem finding new co-developers.  Thank you for your 
concerns.

> > > The world migrated from ISO-8859-* locales to UTF-8 locales from ca.
> > > 2000 to 2013. For at least 10 years, more than 99% of the Linux/Unix
> > > users are either in the "C" locale or in a UTF-8 locale.
> > 
> > Note that this depends on country.  E.g. CJK users are not entirely happy 
> > with
> > Unicode as it butchers Han characters in various unpleasant ways.
> 
> Your information is outdated.
> 1. It was only the Japanese community which was upset about Han unification.
>    (Chinese and Korean people were happy with it.)
> 2. Their issues were addressed in Unicode, through the addition of variation
>    selectors (and probably also specialized fonts and/or tailoring in the
>    rendering engines). The Japanese complaints have since then silenced down.

The problems persist as typefaces often implement variant selectors incorrectly
and/or feel obliged to select incorrect variants by default as to preserve a
distinction between default and compatibility encodings.

What seems to be happening is rather that CJK communities have largely given up
on the fight and accepted that CJK will continue to suck in the future.

> > We too have no plans to change from ISO-8859-1 as Unicode
> > is incompatible with old systems supported by Schilytools.
> 
> I made the change in gettext 0.22 because ISO-8859-1 is incompatible with
> modern systems such as musl libc.

So because musl refuses to support non-Unicode (a reasonable design choice for
them), you have to break it for anybody else.  Makes sense I guess.

You could have avoided the issue by e.g. keeping msgids untranscoded.  I don't
see any case where transcoding the msgid is the right call, as these need to
exactly match what the source code has.  And the source code is not transcoded 
by
anything.

> > Thank you for notifying me of this upcoming breaking change.  It should be
> > communicated more clearly as cross-encoding use of gettext was one of its
> > major selling points over previous solutions, which required separate 
> > message
> > catalogues for each encoding.  We will evaluate what our options are.  We
> > would like to not migrate to Unicode.
> 
> "We would like to not migrate to Unicode."
> 
> You are aware that the mail you sent me was labelled as
>   "Content-Type: text/plain; charset=utf-8" ?
> 
> You are aware that when you copy&paste text in an X11 GUI, it happens through
> a property named UTF8_STRING, with UTF-8 encoded contents? And that this 
> happens
> flawlessly? And before this UTF8_STRING existed, another property was used,
> with CompoundText encoded contents. And with this property, copy&paste 
> sometimes
> did not work as expected, because CompoundText in various places meant 
> different
> things and/or there were differences between the behaviours of different 
> charset
> converters.

Of course I am aware.  However, schilytools supports very old systems and we'd
like to continue supporting them.  For example, the tools (except for one C++
tool) all build on Ultrix 4.7.  This is an operating system with no conception
of Unicode.  Other projects as well as my correspondence with you are not bound
by such demanding requirements, so there's no problem using Unicode there.
Likewise you'll find that I am not opposed to Unicode in anyway.  I have even
previously contributed high-performance Unicode transcoding functions to
simdutf that are now used by mainstream software.  However, I am of the strong
opinion that the option to use different character sets and encodings must be
preserved indefinitely.  Moving to Unicode-only is a terrible move.

> The fewer charset conversions need to be done, the more reliable the programs
> become, and the more maintainable the code can become.

I agree.  This is why wide strings exist.

> > > > ()  ascii ribbon campaign - for an 8-bit clean world 
> > > 
> > > This is hopelessly outdated as well. The goal of an 8-bit clean world was
> > > relevant from 1987 to 2000. Since 2000, the goal is to support 
> > > multilingual
> > > text through Unicode (for which the 8-bit clean world is a prerequisite).
> > 
> > Given that you seem to be taking a step back from an encoding-agnostic world
> > to one that is Unicode-only, maybe it's indeed time to update it.
> 
> Yes, i18n through the "let's be encoding agnostic" approach was predominant
> until ca. 2000 or 2001, because people feared that Unicode would not last and
> would be replaced by something else within a few years.
> But then, people (me included :-)) started popularizing the "i18n through
> Unicode" approach, and it is successful because
>   - it is simpler in the code — no charset conversion in many places,
>   - the Unicode consortium does a good job at responding to complaints and
>     new feature requests (from Han unification mitigation, to Emojis).

Unicode is also a complicated mess that is hard to implement correctly.
Things as easy as "find the next character boundary" require complex tables
and good knowledge of how it works.  And without the multi-megabyte ICU
package (or a suitable replacement), you are basically lost trying to do any
non-trivial operations on Unicode strings.  The attack surface is huge, too,
with multiple high-profile Unicode rendering related crashes in the last years.
Other encodings will keep being relevant for as long as people write in
non-ASCII scripts and need to represent these on systems that cannot / need not
deal with the full complexity of Unicode.

> Bruno

Yours,
Robert Clausecker

-- 
()  ascii ribbon campaign - for an 8-bit clean world 
/\  - against html email  - against proprietary attachments



reply via email to

[Prev in Thread] Current Thread [Next in Thread]