bug-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#12803: 24.3.50; accented Thai Unicode characters are turned into dec


From: Kenichi Handa
Subject: bug#12803: 24.3.50; accented Thai Unicode characters are turned into decomposed ones on Mac OS X by replace-regexp
Date: Mon, 05 Nov 2012 23:41:58 +0900

In article <DF4C7EEF-CE55-4363-A91A-0577DD28AEED@freenet.de>, Peter Dyballa 
<peter_dyballa@freenet.de> writes:

> I wanted to get the unique Thai characters from such an eMail subject:

>       FW:grcthai สร้างรายได้แบบไร้ขีดจำกัด กับการทำงานแบบไร้ขอบเขต..

> So I marked the Thai text and invoked replace-regexp with "\(.\)" -> ”\1 " to 
> later do replace-string " " -> "C-qC-j" and then [g]sort -u the result. I had 
> in buffer *Shell Command Output* decomposed Thai Unicode characters…

> But actually it is already the function replace-regexp which produces the 
> decomposed characters (originally 41 characters, after replace-regexp not 82 
> but 89 according to column-number-mode).

There's no such a character as "accented Thai Unicode character".

Your example is not originally 41 characters, it's just
originally 41 columns on display.

For Thai, Unicode doesn't assign a character code, for
instance, to "ร้".  It's a two characters sequence, and on
displaying, it's composed into one grapheme cluster
occupying one column on display.

The more strangely looking example is "จำ".  It's a two
characters sequence, but the first character is จ and the
second is ำ.  Unicode doesn't have a character "จ with
small-circle-above".

---
Kenichi Handa
handa@gnu.org





reply via email to

[Prev in Thread] Current Thread [Next in Thread]