bug-libunistring
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[bug-libunistring] GNU libunistring's Korean canonical composition bug r


From: DaeHyun Sung
Subject: [bug-libunistring] GNU libunistring's Korean canonical composition bug report.
Date: Sun, 19 Nov 2017 03:50:47 +0900

Hello, My name is DaeHyun Sung(성대현,成大鉉).

I'm Korean and also, GNOME Foundation member in Korea.
My mother tongue is Korean Language.

I found a Korean Syllables canonical decomposition bug on GNU libunistring.

When I found a Korean Syllables canonical decomposition bug on GNONE characters, I also found GNU libunistring bug.
It depends on GNU libunistring.

libunistring/lib/uninorm/canonical-decomposition.c


 /* Hangul syllable.  See Unicode standard, chapter 3, section
         "Hangul Syllable Decomposition",  See also the clarification at
         <http://www.unicode.org/versions/Unicode5.1.0/>, section
         "Clarification of Hangul Jamo Handling".  */
#if 1 /* Return the pairwise decomposition, not the full decomposition.  */
          decomposition[0] = 0xAC00 + uc - t; /* = 0xAC00 + (l * 21 + v) * 28; */
          decomposition[1] = 0x11A7 + t;
          return 2;
#else  
          unsigned int v, l; 
          uc = uc / 28; 
          decomposition[1] = 0x1161 + v; 
          decomposition[2] = 0x11A7 + t; 
          return 3; 
#endif
 
 

I watched That source comment 'he clarification at  <http://www.unicode.org/versions/Unicode5.1.0/>, section "Clarification of Hangul Jamo Handling"'.
It's a misleading description of people who do not know Korean well.


I found Korean Syllables Canonical Decomposition bug Not fully decompose Hangul Syllables. 
Expected: U+D4DB → <U+1111, U+1171, U+11B6> = Full canonical composition result. correct! 
Result: U+D4DB → <U+D4CC,U+11B6> = only intermediate step. incorrect


If you check the Unicode Standard Version 10.0 - core specification, Chapter3.12. Conjoining Jamo Behavior
Hangul Decomposition. 
The Hangul Decomposition Algorithm as specified above directly decomposes precomposed Hangul syllable characters into a sequence of either two or three Hangul jamo characters. 
The Hangul Decomposition Algorithm could also be expressed equivalently as a recursion of binary decompositions, as is the case for other non-Hangul characters.
 All LVT syllables would decompose into an LV syllable plus a T jamo. 
The LV syllables themselves would in turn decompose into an L jamo plus a V jamo. 
This approach can be used to produce somewhat more compact code than what is illustrated in this sample method.

That code is not recursion of decompositions. So It can't fully decomposition of Hangul Syllables.
If you use that code, recursively use it the source code.
So, I suggest removing the source code part of #if 1. and use the source code part of #else.

That code(the source code part of #if 1) is not Korean hangul fully decomposition.


Korean Alphabet Hangul Canonical Decomposition Explain 
Hangul elements are commonly referred to as jamo(자모/字母), meaning “alphabet”

Korean has special term for the jamo that are used to construct hangul syllable, depending on where in the syllable they appear:
- Choseong(초성/初聲) for the initial sound, usually a consonant
- Jungseong(중성/中聲) for the middle sound, usually a vowel
- Jongseong(종성/終聲) for the final sound, usually a consonant

Hangul syllables are the characters that are used to express contemporary Korean texts in writing.

ex1) Decomposition of hangul syllable 
Unicode codepoint: U+AC00
Hangul(한글) ‘가’ 
jamo(자모/字母): ㄱ plus ㅏ
choseong(초성/初聲): ㄱ (codepoint: U+1100)
jungseong(중성/中聲): ㅏ(codepoint: U+1161)

Selected Hangul syllable ‘가’(U+AC00)
Present
Canonical decomposition: 
ㄱ U+1100 HANGUL CHOSEONG KIYEOK 
ㅏ U+1161 HANGUL JUNGSEONG A

Expected result
Canonical decomposition: 
ㄱ U+1100 HANGUL CHOSEONG KIYEOK 
ㅏ U+1161 HANGUL JUNGSEONG A

Hangul Choseong:ᄀ
Hangul Jungseong:ᅡ

ex2) Decomposition of hangul syllable 
Unicode code point: U+AC01
Hangul(한글) ‘각’
jamo(자모/字母):  ‘ᄀ’  plus ‘ᅡ’  plus ‘ᆨ’ 
choseong(초성/初聲):ㄱ (codepoint: U+1100)
jungseong(중성/中聲):ㅏ(codepoint: U+1161)
jongseong(종성/終聲):ᆨ (codepoint: U+11A8)


Selected Hangul syllable ‘각’(U+AC01)
Present  
Canonical decomposition: 
‘가 U+AC00 HANGUL SYLLABLE GA'   It's intermediate step. 
'ᆨ U+11A8 HANGUL JONGSEONG KIYEOK' 

Expected Result
Canonical decomposition(Fully): 
ㄱ U+1100 HANGUL CHOSEONG KIYEOK 
ㅏ U+1161 HANGUL JUNGSEONG A 
ᆨ U+11A8 HANGUL JONGSEONG KIYEOK

Hangul Choseong:ᄀ
Hangul Jungseong:ᅡ
Hangul Jongseong:ᆨ

---


I attached diff files on mail.

canonical-decomposition.c.diff -> libunistring/lib/uninorm/canonical-decomposition.c
test-canonical-decomposition.c.diff -> libunistring/tests/uninorm/test-canonical-decomposition.c
 
Also checked Hangul decomposition of GNOME and KDE 
GNOME gucharmap, my suggestion: https://bugzilla.gnome.org/show_bug.cgi?id=777829 
GNOME gucharmap's Korean Hangul decomposition source code https://github.com/GNOME/gucharmap/blob/master/gucharmap/gucharmap-unicode-info.c

else if (wc >= 0xac00 && wc <= 0xd7af) 
    /* compute hangul syllable name as per UAX #15 */ 
    gint SIndex = wc - SBase; 
    gint LIndex, VIndex, TIndex; 
    if (SIndex < 0 || SIndex >= SCount) 
        return ""; 
    LIndex = SIndex / NCount; 
    VIndex = (SIndex % NCount) / TCount; 
    TIndex = SIndex % TCount; 
    g_snprintf (buf, sizeof (buf), "HANGUL SYLLABLE %s%s%s", JAMO_L_TABLE[LIndex], JAMO_V_TABLE[VIndex], JAMO_T_TABLE[TIndex]); 
    return buf; 
}


KDE kwidgetsaddons, kcharselect: https://git.reviewboard.kde.org/r/129943/diff/1#index_header


Check the documentation 
The Unicode® Standard Version 10.0 – Core Specification
http://www.unicode.org/versions/Unicode10.0.0/ch03.pdf
3.12 Conjoining Jamo Behavior 

Unicode® Standard Annex #15 - UNICODE NORMALIZATION FORMS
http://unicode.org/reports/tr15/ 

Unicode Normalization forms http://unicode.org/reports/tr15/ 
Unicode Normalization forms #14.1.4. Hangul Decomposition and Composition http://unicode.org/reports/tr15/#
Hangul_Composition Hangul Jamo (Range: U+1100-U+11FF) http://www.unicode.org/charts/PDF/U1100.pdf 
Hangul Syllables (Range: U+AC00-U+D7AF) http://www.unicode.org/charts/PDF/UAC00.pdf 

Please, check the mail, ASAP!

Thanks!


Sincerely,
DaeHyun Sung(성대현,成大鉉)

--
Korean Open Source Developer, Contributor, Translator.
GNOME Foundation Member & KDE Contributor in Korea.
Interested in GNOME, KDE, Web, etc

Attachment: canonical-decomposition.c.diff
Description: Text document

Attachment: test-canonical-decomposition.c.diff
Description: Text document


reply via email to

[Prev in Thread] Current Thread [Next in Thread]