bug-gnu-libiconv
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gnu-libiconv] gb18030 for 0x215d7


From: Bruno Haible
Subject: Re: [bug-gnu-libiconv] gb18030 for 0x215d7
Date: Sun, 28 May 2023 00:41:23 +0200

> > > A bijective 1-1 conversion table does not provide the best user
> > experience
> > in this situation.

> I don't insist on 1-1 conversion any more since the one in PUA should
> retire some day.

Good. Now we are in the same boat.

> Figured out a little history:
> GB18030/2000 (up to Ext-A): 0xFE6C -> U+E831 (PUA)
> Character adopted by Unicode U+215D7 (Ext-B)
> GB18030/2005 adopted Ext-B: 0x9536B937 -> U+215D7

Yep. Based on the second printing of GB18030-2000 (the first printing had
multiple mistakes), I had noted:

 *   p. 81        0xFE51..0xFE53      U+E816..U+E818
 *   p. 81        0xFE59              U+E81E
 *   p. 81        0xFE61              U+E826
 *   p. 81        0xFE66..0xFE67      U+E82B..U+E82C
 *   p. 81        0xFE6C..0xFE6D      U+E831..U+E832
 *   p. 81        0xFE76              U+E83B
 *   p. 81        0xFE7E              U+E843
 *   p. 81        0xFE90..0xFE91      U+E854..U+E855
 *   p. 81        0xFEA0              U+E864

So, it mapped 0xFE6C -> U+E831.

U+215D7 was added in Unicode 3.1 (2001).

The mapping table for GB18030-2005 in GNU libiconv is made to help transition
from the PUA code point to the U+215D7 code point:

0xFE6C     -> U+215D7
0x9536B937 -> U+215D7

In the other direction libiconv does this:
U+215D7 -> 0xFE6C
U+E831  -> 0xFE6C

Or should it better be this?
U+215D7 -> 0x9536B937
U+E831  -> 0x9536B937

For comparison [1]:
* glibc (up to version 2.35 at least), which implements GB18030-2005, maps
  U+215D7 -> 0xFE6C
  U+E831  -> 0xFE6C
* JDK 5 maps
  U+215D7 -> 0x9536B937
  U+E831  -> 0xFE6C

> The real question here:
> U+215D7 -> GB18030: 0xFE6C or 0x9536B937?
> I think 0x9536B937 is the better choice, because Ext-B characters in
> GB18030 are all coded in 4 bytes.

That would still be quite arbitrary.

I think, before 2010, when one could not assume that the GB18030 fonts
had a glyph for 0x9536B937, it was probably best to map
  U+215D7 -> 0xFE6C
  U+E831  -> 0xFE6C

However, meanwhile the GB18030-2022 standard has been released, and it
effectively retires the PUA mappings (making them optional in the fonts). [2]
It maps
  0xFE6C     <-> U+E831
  0x9536B937 <-> U+215D7
So, the underlying assumptions are that
  - the fonts now all have glyphs for 0x9536B937,
  - uses of these characters in files should have migrated (or should migrate?)
    to 0x9536B937.

Thus, now, libiconv's GB18030-2005 converter should better map
  U+215D7 -> 0x9536B937
  U+E831  -> 0x9536B937
for a seamless transition.

Bruno

[1] https://www.haible.de/bruno/charsets/conversion-tables/GB18030.html
[2] https://ken-lunde.medium.com/the-gb-18030-2022-standard-3d0ebaeb4132






reply via email to

[Prev in Thread] Current Thread [Next in Thread]