[bug-libunistring] GNU libunistring's Korean canonical composition bug r

Hello, My name is DaeHyun Sung(성대현,成大鉉).

I'm Korean and also, GNOME Foundation member in Korea.

My mother tongue is Korean Language.

I found a Korean Syllables canonical decomposition bug on GNU libunistring.

When I found a Korean Syllables canonical decomposition bug on GNONE characters, I also found GNU libunistring bug.

It depends on GNU libunistring.

libunistring/lib/uninorm/canonical-decomposition.c

/* Hangul syllable. See Unicode standard, chapter 3, section
   "Hangul Syllable Decomposition", See also the clarification at
   <http://www.unicode.org/versions/Unicode5.1.0/>, section
   "Clarification of Hangul Jamo Handling". */
#if 1 /* Return the pairwise decomposition, not the full decomposition. */
decomposition[0] = 0xAC00 + uc - t; /* = 0xAC00 + (l * 21 + v) * 28; */
decomposition[1] = 0x11A7 + t;
return 2;
#else

unsigned int v, l;

uc = uc / 28;

decomposition[1] = 0x1161 + v;

decomposition[2] = 0x11A7 + t;

return 3;

#endif

I watched That source comment 'he clarification at <http://www.unicode.org/versions/Unicode5.1.0/>, section "Clarification of Hangul Jamo Handling"'.

It's a misleading description of people who do not know Korean well.

I found Korean Syllables Canonical Decomposition bug Not fully decompose Hangul Syllables.

Expected: U+D4DB → <U+1111, U+1171, U+11B6> = Full canonical composition result. correct!

Result: U+D4DB → <U+D4CC,U+11B6> = only intermediate step. incorrect

If you check the Unicode Standard Version 10.0 - core specification, Chapter3.12. Conjoining Jamo Behavior

Hangul Decomposition.
The Hangul Decomposition Algorithm as specified above directly decomposes precomposed Hangul syllable characters into a sequence of either two or three Hangul jamo characters.
The Hangul Decomposition Algorithm could also be expressed equivalently as a recursion of binary decompositions, as is the case for other non-Hangul characters.
All LVT syllables would decompose into an LV syllable plus a T jamo.
The LV syllables themselves would in turn decompose into an L jamo plus a V jamo.
This approach can be used to produce somewhat more compact code than what is illustrated in this sample method.

That code is not recursion of decompositions. So It can't fully decomposition of Hangul Syllables.

If you use that code, recursively use it the source code.

So, I suggest removing the source code part of #if 1. and use the source code part of #else.

That code(the source code part of #if 1) is not Korean hangul fully decomposition.

Korean Alphabet Hangul Canonical Decomposition Explain

Hangul elements are commonly referred to as jamo(자모/字母), meaning “alphabet”

Korean has special term for the jamo that are used to construct hangul syllable, depending on where in the syllable they appear:

- Choseong(초성/初聲) for the initial sound, usually a consonant

- Jungseong(중성/中聲) for the middle sound, usually a vowel

- Jongseong(종성/終聲) for the final sound, usually a consonant

Hangul syllables are the characters that are used to express contemporary Korean texts in writing.

ex1) Decomposition of hangul syllable

Unicode codepoint: U+AC00

Hangul(한글) ‘가’

jamo(자모/字母): ㄱ plus ㅏ

choseong(초성/初聲): ㄱ (codepoint: U+1100)

jungseong(중성/中聲): ㅏ(codepoint: U+1161)

Selected Hangul syllable ‘가’(U+AC00)

Present

Canonical decomposition:

ㄱ U+1100 HANGUL CHOSEONG KIYEOK

ㅏ U+1161 HANGUL JUNGSEONG A

Expected result

Canonical decomposition:

ㄱ U+1100 HANGUL CHOSEONG KIYEOK

ㅏ U+1161 HANGUL JUNGSEONG A

Hangul Choseong:ᄀ

Hangul Jungseong:ᅡ

ex2) Decomposition of hangul syllable

Unicode code point: U+AC01

Hangul(한글) ‘각’

jamo(자모/字母): ‘ᄀ’ plus ‘ᅡ’ plus ‘ᆨ’

choseong(초성/初聲):ㄱ (codepoint: U+1100)

jungseong(중성/中聲):ㅏ(codepoint: U+1161)

jongseong(종성/終聲):ᆨ (codepoint: U+11A8)

Selected Hangul syllable ‘각’(U+AC01)

Present

Canonical decomposition:

‘가 U+AC00 HANGUL SYLLABLE GA' It's intermediate step.

'ᆨ U+11A8 HANGUL JONGSEONG KIYEOK'

Expected Result

Canonical decomposition(Fully):

ㄱ U+1100 HANGUL CHOSEONG KIYEOK

ㅏ U+1161 HANGUL JUNGSEONG A

ᆨ U+11A8 HANGUL JONGSEONG KIYEOK

Hangul Choseong:ᄀ

Hangul Jungseong:ᅡ

Hangul Jongseong:ᆨ

---

I attached diff files on mail.

canonical-decomposition.c.diff -> libunistring/lib/uninorm/canonical-decomposition.c

test-canonical-decomposition.c.diff -> libunistring/tests/uninorm/test-canonical-decomposition.c

Also checked Hangul decomposition of GNOME and KDE

GNOME gucharmap, my suggestion: https://bugzilla.gnome.org/show_bug.cgi?id=777829

GNOME gucharmap's Korean Hangul decomposition source code https://github.com/GNOME/gucharmap/blob/master/gucharmap/gucharmap-unicode-info.c

else if (wc >= 0xac00 && wc <= 0xd7af)

{

/* compute hangul syllable name as per UAX #15 */

gint SIndex = wc - SBase;

gint LIndex, VIndex, TIndex;

if (SIndex < 0 || SIndex >= SCount)

return "";

LIndex = SIndex / NCount;

VIndex = (SIndex % NCount) / TCount;

TIndex = SIndex % TCount;

g_snprintf (buf, sizeof (buf), "HANGUL SYLLABLE %s%s%s", JAMO_L_TABLE[LIndex], JAMO_V_TABLE[VIndex], JAMO_T_TABLE[TIndex]);

return buf;

}

KDE kwidgetsaddons, kcharselect: https://git.reviewboard.kde.org/r/129943/diff/1#index_header

Check the documentation

The Unicode® Standard Version 10.0 – Core Specification
http://www.unicode.org/versions/Unicode10.0.0/ch03.pdf
3.12 Conjoining Jamo Behavior

Unicode® Standard Annex #15 - UNICODE NORMALIZATION FORMS

http://unicode.org/reports/tr15/

Unicode Normalization forms http://unicode.org/reports/tr15/

Unicode Normalization forms #14.1.4. Hangul Decomposition and Composition http://unicode.org/reports/tr15/#

Hangul_Composition Hangul Jamo (Range: U+1100-U+11FF) http://www.unicode.org/charts/PDF/U1100.pdf

Hangul Syllables (Range: U+AC00-U+D7AF) http://www.unicode.org/charts/PDF/UAC00.pdf

Please, check the mail, ASAP!

Thanks!

Sincerely,

DaeHyun Sung(성대현,成大鉉)

Korean Open Source Developer, Contributor, Translator.

GNOME Foundation Member & KDE Contributor in Korea.

Interested in GNOME, KDE, Web, etc

From:	DaeHyun Sung
Subject:	[bug-libunistring] GNU libunistring's Korean canonical composition bug report.
Date:	Sun, 19 Nov 2017 03:50:47 +0900