[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: X11 Compound Text vs ISO 2022
From: |
James Cloos |
Subject: |
Re: X11 Compound Text vs ISO 2022 |
Date: |
Wed, 14 Jul 2010 17:07:18 -0400 |
User-agent: |
Gnus/5.110011 (No Gnus v0.11) Emacs/24.0.50 (gnu/linux) |
I used this as a --batch file to generate a list of how Emacs converts
each UCS code point to compound-text and compound-text-with-extensions:
;;; emacs -q --batch --script to-n-from-ctext.el
(setq num 0)
(while (< num 1114112)
(princ (format "%04X\t%S\n" num (decode-coding-string
(encode-coding-string (format "%c" num)
'compound-text-with-extensions) 'compound-text)))
(setq num (+ 1 num)))
;;;;;;
(Change 'compound-text-with-extensions to 'ctext to see how converting
to ctext works.)
The reuslts of converting to ctext are:
,----< tab-separated charset count >
| ipa 6
| lao 94
| tibetan 193
| chinese-big5-1 415
| chinese-big5-2 29
| chinese-cns11643-1 2257
| chinese-cns11643-2 6594
| chinese-cns11643-3 5705
| chinese-cns11643-4 7217
| chinese-cns11643-5 8599
| chinese-cns11643-6 6384
| chinese-cns11643-7 6539
| arabic-digit 9
| chinese-gb2312 7299
| latin-iso8859-1 96
| latin-iso8859-13 3
| latin-iso8859-14 27
| latin-iso8859-15 2
| latin-iso8859-16 4
| latin-iso8859-2 57
| latin-iso8859-3 22
| latin-iso8859-4 35
| cyrillic-iso8859-5 93
| arabic-iso8859-6 48
| greek-iso8859-7 77
| hebrew-iso8859-8 30
| katakana-jisx0201 63
| japanese-jisx0208 316
| japanese-jisx0212 124
| japanese-jisx0213-1 507
| japanese-jisx0213-2 250
| korean-ksc5601 2907
| thai-tis620 96
| mule-unicode-0100-24ff 7851
| mule-unicode-2500-33ff 3005
| mule-unicode-e000-ffff 7219
| vietnamese-viscii-lower 46
| vietnamese-viscii-upper 46
`----
As you can see, that is of no value. It also fails to convert the vast
majority of non-bmp characters.
Converting to ctext-with-extensions gives somewhat better results:
,----< tab-separated charset count >
| latin-iso8859-1 96
| latin-iso8859-2 57
| latin-iso8859-3 22
| latin-iso8859-4 35
| cyrillic-iso8859-5 93
| arabic-iso8859-6 48
| greek-iso8859-7 77
| hebrew-iso8859-8 30
| thai-tis620 96
| latin-iso8859-13 3
| latin-iso8859-14 27
| latin-iso8859-15 2
| latin-iso8859-16 4
| katakana-jisx0201 63
| chinese-gb2312 7299
| japanese-jisx0208 316
| japanese-jisx0212 124
| korean-ksc5601 2907
| chinese-cns11643-1 2044
| chinese-cns11643-2 3307
| chinese-cns11643-3 1714
| chinese-cns11643-4 755
| chinese-cns11643-5 89
| chinese-cns11643-6 39
| chinese-cns11643-7 31
| utf-8 1093949
| japanese-jisx0213-1 507
| japanese-jisx0213-2 250
`----
As you can see, 8859-9 and 8859-10 are not generated, but that is
bacause all of their characters can be found in 8859-1 through -8
and is therefore not a problem.
But japanese-jisx0213-1 and japanese-jisx0213-2 need to go; they are
simply unknown by other COMPOUND_TEXT users.
It is clear that the current deffinition of compound-text is wrong;
I'd replace it with the current compound-text-with-extensions and make
that an alias for backwards compatibility.
Then, we need to determine how to prevent Emacs from considering the
jisx0213-? charsets when convertign to ctext.
And, perhaps, to prefer utf8 over the gb, cns, ksc, and jisx charsets
when converting "narrow" characters (and ambiguous chacters when in a
"narrow" or "non-cjk" locale). Handa-san already did some comparable
work for font selection; what he did there is also needed here.
-JimC
--
James Cloos <address@hidden> OpenPGP: 1024D/ED7DAEA6
Re: X11 Compound Text vs ISO 2022, David De La Harpe Golden, 2010/07/06
Re: X11 Compound Text vs ISO 2022, Kenichi Handa, 2010/07/29