[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: GB18030
|
From: |
Bruno Haible |
|
Subject: |
Re: GB18030 |
|
Date: |
Wed, 17 May 2023 15:35:10 +0200 |
Paul Eggert wrote:
> I'd remove "or GB18030"; it's not that popular due to the fact that it's
> not ASCII-safe like UTF-8 is, its popularity is limited primarily to one
> country (admittedly a large one, but still), and even in China it's much
> less popular than UTF-8, judging at least from what's published on the
> Web. w3techs says that less than 0.003% of the world's websites use
> GB18030, which is less than even the 0.11% of websites that use its
> predecessor GB2312 (and is waaaay less than the 97.9% of websites that
> use UTF-8). Although there's no way to peer into Unix installations in
> China, I'd be surprised if GB18030 is all that popular on the world
> stage, even counting its use within China.
I don't think the percentage of GB18030 encoded texts on the web and the
percentage of computers using GB18030 as a locale encoding are necessarily
related. Publishing a text on the web is typically done through a program
that has a "Export as HTML" action, and you can assume that this action
will most often convert to UTF-8.
I was under the impression that GB18030 was used in China, because
- Wikipedia [1] says "Since 1 May 2006, support for the mandatory subset
is officially required for all software products sold in the PRC."
- I heard some years ago that "If you want to sell software to a
governmental institution in China, it must support GB18030",
- Some software companies have/had integrated this requirement in their
QA processes.
- There was a new revision of GB18030 in 2022 [1].
But now I did a survey, what locale gets used when user installs an
enterprise Linux distro with "Simplified Chinese" as installation
language, or a Chinese Linux distro outright. The result is:
* All of these distros put the user in a UTF-8 locale.
* None of them has even an option(!) to put the user in a GB18030 locale.
In detail:
* For enterprise Linux distros, I chose
+ Alma Linux 9.0 (RHEL 9.0 clone).
- If, at installation time, I pick "Simplified Chinese (China)",
after the installation, the environment variables are:
LANG=zh_CN.UTF-8
- If, at installation time, I pick "Traditional Chinese (Taiwan)",
after the installation, the environment variables are:
LANG=zh_TW.UTF-8
+ openSUSE 15.4 (which should be close to SLES 15.4).
After installation, one can change the "primary language" [2]
at YaST > System > Language.
- Choosing "Chinese Simplified" offers a checkbox "Use UTF-8 Encoding"
that is on by default. When I turn it off, after the installation, the
environment variables are:
LANG=en_US.UTF-8, LC_CTYPE=zh_CN.
The effective encoding is EUC-CN (or GBK?), not GB18030.
- Choosing "Chinese Traditional" offers a checkbox "Use UTF-8 Encoding"
that is on by default. When I turn it off, after the installation, the
environment variables are:
LANG=en_US.UTF-8, LC_CTYPE=zh_TW
The effective encoding is CP950, not Big5 or Big5-2003.
In both cases this combination of LANG and LC_CTYPE is not supported by
glibc (because of the LC_COLLATE and other locale categories).
* On distrowatch.com, I found two Chinese Linux distros.
+ deepin
Here, after installation, the environment variables are:
LANG=zh_CN.UTF-8, LANGUAGE=zh_CN
+ Ubuntu Kylin
Here, after installation, the environment variables are:
LANG=zh_CN.UTF-8, LANGUAGE=zh_CN:
* Then, there is also IBM AIX. What Chinese locales does it have for zh_CN?
On AIX 7.2, a GB18030 locale is officially supported [3]. But running
"locale -a"
on an AIX 7.2 machine reveals that it has
zh_Hans_CN.UTF-8
zh_Hans_SG.UTF-8
zh_Hant_HK.UTF-8
zh_Hant_TW.UTF-8
but no locale with GB18030 encoding.
So, it seems that GB18030 is no longer important, even on the Chinese market.
> Besides, we're better off not taking a stand in the GB18030 vs Big5 vs
> other-national-encoding controversies.
I don't see any controversies there. GB18030 was intended for users who
choose "Simplified Chinese", and Big5 was intended for users who choose
"Traditional Chinese". Big5 and its standardized variant Big5-2003 were
apparently obsoleted by Microsoft's CP950. And now, as a locale encoding,
UTF-8 is predominantly used for both groups of users.
There may be a controversy regarding whether "Simplified Chinese" or
"Traditional Chinese" is the proper choice in territories like Hong Kong
or Macao. But I have no concrete data on this, and it is irrelevant here.
Bruno
[1] https://en.wikipedia.org/wiki/GB_18030
[2]
https://doc.opensuse.org/documentation/leap/startup/html/book-startup/cha-yast-lang.html
[3]
https://www.ibm.com/docs/en/aix/7.2?topic=globalization-supported-languages-locales