emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: X11 Compound Text vs ISO 2022


From: James Cloos
Subject: Re: X11 Compound Text vs ISO 2022
Date: Wed, 14 Jul 2010 17:07:18 -0400
User-agent: Gnus/5.110011 (No Gnus v0.11) Emacs/24.0.50 (gnu/linux)

I used this as a --batch file to generate a list of how Emacs converts
each UCS code point to compound-text and compound-text-with-extensions:

;;; emacs -q --batch --script to-n-from-ctext.el
(setq num 0)
(while (< num 1114112)
  (princ (format "%04X\t%S\n" num (decode-coding-string
    (encode-coding-string (format "%c" num)
    'compound-text-with-extensions) 'compound-text)))
  (setq num (+ 1 num)))
;;;;;;

(Change 'compound-text-with-extensions to 'ctext to see how converting
to ctext works.)

The reuslts of converting to ctext are:

,----< tab-separated charset count >
| ipa   6
| lao   94
| tibetan       193
| chinese-big5-1        415
| chinese-big5-2        29
| chinese-cns11643-1    2257
| chinese-cns11643-2    6594
| chinese-cns11643-3    5705
| chinese-cns11643-4    7217
| chinese-cns11643-5    8599
| chinese-cns11643-6    6384
| chinese-cns11643-7    6539
| arabic-digit  9
| chinese-gb2312        7299
| latin-iso8859-1       96
| latin-iso8859-13      3
| latin-iso8859-14      27
| latin-iso8859-15      2
| latin-iso8859-16      4
| latin-iso8859-2       57
| latin-iso8859-3       22
| latin-iso8859-4       35
| cyrillic-iso8859-5    93
| arabic-iso8859-6      48
| greek-iso8859-7       77
| hebrew-iso8859-8      30
| katakana-jisx0201     63
| japanese-jisx0208     316
| japanese-jisx0212     124
| japanese-jisx0213-1   507
| japanese-jisx0213-2   250
| korean-ksc5601        2907
| thai-tis620   96
| mule-unicode-0100-24ff        7851
| mule-unicode-2500-33ff        3005
| mule-unicode-e000-ffff        7219
| vietnamese-viscii-lower       46
| vietnamese-viscii-upper       46
`----

As you can see, that is of no value.  It also fails to convert the vast
majority of non-bmp characters.

Converting to ctext-with-extensions gives somewhat better results:

,----< tab-separated charset count >
| latin-iso8859-1       96
| latin-iso8859-2       57
| latin-iso8859-3       22
| latin-iso8859-4       35
| cyrillic-iso8859-5    93
| arabic-iso8859-6      48
| greek-iso8859-7       77
| hebrew-iso8859-8      30
| thai-tis620   96
| latin-iso8859-13      3
| latin-iso8859-14      27
| latin-iso8859-15      2
| latin-iso8859-16      4
| katakana-jisx0201     63
| chinese-gb2312        7299
| japanese-jisx0208     316
| japanese-jisx0212     124
| korean-ksc5601        2907
| chinese-cns11643-1    2044
| chinese-cns11643-2    3307
| chinese-cns11643-3    1714
| chinese-cns11643-4    755
| chinese-cns11643-5    89
| chinese-cns11643-6    39
| chinese-cns11643-7    31
| utf-8 1093949
| japanese-jisx0213-1   507
| japanese-jisx0213-2   250
`----

As you can see, 8859-9 and 8859-10 are not generated, but that is
bacause all of their characters can be found in 8859-1 through -8
and is therefore not a problem.

But japanese-jisx0213-1 and japanese-jisx0213-2 need to go; they are
simply unknown by other COMPOUND_TEXT users.

It is clear that the current deffinition of compound-text is wrong;
I'd replace it with the current compound-text-with-extensions and make
that an alias for backwards compatibility.

Then, we need to determine how to prevent Emacs from considering the
jisx0213-? charsets when convertign to ctext.

And, perhaps, to prefer utf8 over the gb, cns, ksc, and jisx charsets
when converting "narrow" characters (and ambiguous chacters when in a
"narrow" or "non-cjk" locale).  Handa-san already did some comparable
work for font selection; what he did there is also needed here.

-JimC
-- 
James Cloos <address@hidden>         OpenPGP: 1024D/ED7DAEA6



reply via email to

[Prev in Thread] Current Thread [Next in Thread]