[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [bug-gawk] Thai UTF-8 length bug
From: |
Eli Zaretskii |
Subject: |
Re: [bug-gawk] Thai UTF-8 length bug |
Date: |
Tue, 21 Jun 2016 18:25:28 +0300 |
> From: PePa <address@hidden>
> Date: Tue, 21 Jun 2016 13:25:47 +0700
>
> Couldn't find any report about this. I read that gawk as of 3.1.5 is
> supposed to report length in characters now. That is not true for Thai
> characters (Ubuntu 16.04 gawk 4.1.3):
>
> LC_ALL=th_TH.UTF-8 gawk 'BEGIN {print length("ค้ม")}'
> 3
> (should be 2)
I think you are confusing characters with grapheme clusters. The
above string has 3 codepoints: u+0E04, u+0E49, and u+0E21. On display
(assuming the display supports complex script shaping), we should see
2 grapheme clusters, because the first two characters combine to form
a single grapheme cluster.
But Gawk doesn't count grapheme clusters, it counts characters.