Hello, My name is DaeHyun Sung(성대현,成大鉉).
I'm Korean and also, GNOME Foundation member in Korea.
My mother tongue is Korean Language.
I found a Korean Syllables canonical decomposition bug on GNU libunistring.
When I found a Korean Syllables canonical decomposition bug on GNONE characters, I also found GNU libunistring bug.
It depends on GNU libunistring.
libunistring/lib/uninorm/canonical-decomposition.c
/* Hangul syllable. See Unicode standard, chapter 3, section
"Hangul Syllable Decomposition", See also the clarification at
<http://www.unicode.org/versions/Unicode5.1.0/>, section
"Clarification of Hangul Jamo Handling". */
#if 1 /* Return the pairwise decomposition, not the full decomposition. */
decomposition[0] = 0xAC00 + uc - t; /* = 0xAC00 + (l * 21 + v) * 28; */
decomposition[1] = 0x11A7 + t;
return 2;
#else
unsigned int v, l;
uc = uc / 28;
decomposition[1] = 0x1161 + v;
decomposition[2] = 0x11A7 + t;
return 3;
#endif
It's a misleading description of people who do not know Korean well.
I found Korean Syllables Canonical Decomposition bug Not fully decompose Hangul Syllables.
Expected: U+D4DB → <U+1111, U+1171, U+11B6> = Full canonical composition result. correct!
Result: U+D4DB → <U+D4CC,U+11B6> = only intermediate step. incorrect
If you check the Unicode Standard Version 10.0 - core specification, Chapter3.12. Conjoining Jamo Behavior
Hangul Decomposition.
The Hangul Decomposition Algorithm as specified above directly decomposes precomposed Hangul syllable characters into a sequence of either two or three Hangul jamo characters.
The Hangul Decomposition Algorithm could also be expressed equivalently as a recursion of binary decompositions, as is the case for other non-Hangul characters.
All LVT syllables would decompose into an LV syllable plus a T jamo.
The LV syllables themselves would in turn decompose into an L jamo plus a V jamo.
This approach can be used to produce somewhat more compact code than what is illustrated in this sample method.
That code is not recursion of decompositions. So It can't fully decomposition of Hangul Syllables.
If you use that code, recursively use it the source code.
So, I suggest removing the source code part of #if 1. and use the source code part of #else.
That code(the source code part of #if 1) is not Korean hangul fully decomposition.
Korean Alphabet Hangul Canonical Decomposition Explain
Hangul elements are commonly referred to as jamo(자모/字母), meaning “alphabet”
Korean has special term for the jamo that are used to construct hangul syllable, depending on where in the syllable they appear:
- Choseong(초성/初聲) for the initial sound, usually a consonant
- Jungseong(중성/中聲) for the middle sound, usually a vowel
- Jongseong(종성/終聲) for the final sound, usually a consonant
Hangul syllables are the characters that are used to express contemporary Korean texts in writing.
ex1) Decomposition of hangul syllable
Unicode codepoint: U+AC00
Hangul(한글) ‘가’
jamo(자모/字母): ㄱ plus ㅏ
choseong(초성/初聲): ㄱ (codepoint: U+1100)
jungseong(중성/中聲): ㅏ(codepoint: U+1161)
Selected Hangul syllable ‘가’(U+AC00)
Present
Canonical decomposition:
ㄱ U+1100 HANGUL CHOSEONG KIYEOK
ㅏ U+1161 HANGUL JUNGSEONG A
Expected result
Canonical decomposition:
ㄱ U+1100 HANGUL CHOSEONG KIYEOK
ㅏ U+1161 HANGUL JUNGSEONG A
Hangul Choseong:ᄀ
Hangul Jungseong:ᅡ
ex2) Decomposition of hangul syllable
Unicode code point: U+AC01
Hangul(한글) ‘각’
jamo(자모/字母): ‘ᄀ’ plus ‘ᅡ’ plus ‘ᆨ’
choseong(초성/初聲):ㄱ (codepoint: U+1100)
jungseong(중성/中聲):ㅏ(codepoint: U+1161)
jongseong(종성/終聲):ᆨ (codepoint: U+11A8)
Selected Hangul syllable ‘각’(U+AC01)
Present
Canonical decomposition:
‘가 U+AC00 HANGUL SYLLABLE GA' It's intermediate step.
'ᆨ U+11A8 HANGUL JONGSEONG KIYEOK'
Expected Result
Canonical decomposition(Fully):
ㄱ U+1100 HANGUL CHOSEONG KIYEOK
ㅏ U+1161 HANGUL JUNGSEONG A
ᆨ U+11A8 HANGUL JONGSEONG KIYEOK
Hangul Choseong:ᄀ
Hangul Jungseong:ᅡ
Hangul Jongseong:ᆨ
---
I attached diff files on mail.
canonical-decomposition.c.diff -> libunistring/lib/uninorm/canonical-decomposition.c
test-canonical-decomposition.c.diff -> libunistring/tests/uninorm/test-canonical-decomposition.c
Also checked Hangul decomposition of GNOME and KDE
else if (wc >= 0xac00 && wc <= 0xd7af)
{
/* compute hangul syllable name as per UAX #15 */
gint SIndex = wc - SBase;
gint LIndex, VIndex, TIndex;
if (SIndex < 0 || SIndex >= SCount)
return "";
LIndex = SIndex / NCount;
VIndex = (SIndex % NCount) / TCount;
TIndex = SIndex % TCount;
g_snprintf (buf, sizeof (buf), "HANGUL SYLLABLE %s%s%s", JAMO_L_TABLE[LIndex], JAMO_V_TABLE[VIndex], JAMO_T_TABLE[TIndex]);
return buf;
}
Check the documentation
Unicode® Standard Annex #15 - UNICODE NORMALIZATION FORMS
Please, check the mail, ASAP!
Thanks!
Sincerely,
DaeHyun Sung(성대현,成大鉉)
--
Korean Open Source Developer, Contributor, Translator.
GNOME Foundation Member & KDE Contributor in Korea.
Interested in GNOME, KDE, Web, etc