[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: clean mess encrusted on to Chinese pastes cut from outside emacs
From: |
Kenichi Handa |
Subject: |
Re: clean mess encrusted on to Chinese pastes cut from outside emacs |
Date: |
Thu, 5 Jul 2001 19:47:35 +0900 (JST) |
User-agent: |
SEMI/1.14.3 (Ushinoya) FLIM/1.14.2 (Yagi-Nishiguchi) APEL/10.2 Emacs/21.0.104 (sparc-sun-solaris2.6) MULE/5.0 (SAKAKI) |
Eli Zaretskii <eliz@is.elta.co.il> writes:
Eli> What is your selection-coding-system set to, and what was the
>> >> chinese-big5
>>
Eli> Why not compound-text?
>> Hmm, just tried that and got [cat -v output:]
>> ^[%/2M-^IBIG5-0^BM-^IBIG5-0^BM-^AM-%| 07 5 01:51:28 CST
>> where this is the ^ japanese yen symbol
>> ^[%/2M-^IBIG5-0^BM-^IBIG5-0^BM-%|^[(B 07 5 01:51:28 CST
>> instead of the chinese ^ "four" [si4] symbol
This is the same problem as what reported to me quite
recently from a Debian developer. It seems that the Big-5
user community started to use the above special encoding
following "Non-Standard Character Set Encodings" of Compound
Text. I'll attach the full spec at the tail.
Emacs 21.1 still can't handle this kind of extension to
Compound Text. To handle it, we need the followings.
(1) Make encoding and decoding functions for such an
encoding (e.g. ctext-big5-encode-region and
ctext-big5-decode-region). For instance,
ctext-big5-decode-region should decode the region
matching ESC % / 2 M L BIG5-0 ... specially whereas the
other regions by normal `ctext'.
(2) Make a new coding system, say ctext-big5 as below:
(make-coding-system
'ctext-big5
0 ?B "Compound text with BIG5 extension of XFree86."
nil
'((post-read-conversion . ctext-big5-post-read-conversion)
(pre-write-conversion . ctext-big5-pre-write-conversion)))
Here, ctext-big5-pre-write-conversion and
ctext-big5-pre-write-conversion calls
ctext-big5-encode-region and ctext-big5-decode-region
respectively with proper arguments.
(3) Modify x_encode_text of xfns.c so that it uses
code_convert_string to encode the data instead of
calling encode_coding directly. This is to handle
pre-write-conversion correctly.
(4) Modify selection_data_to_lisp_data of xselect.c so that
it calls code_convert_string to decode Lisp string
instead of calling decode_coding directly. This is to
handle post-read-conversion correctly.
Actually, (3) and (4) are bug fixes.
---
Ken'ichi HANDA
handa@etl.go.jp
6. Non-Standard Character Set Encodings
Character set encodings that are not in the list of approved
standard encodings can be included using ``extended seg-
ments''. An extended segment begins with one of the follow-
ing sequences:
01/11 02/05 02/15 03/00 M L variable number of octets per character
01/11 02/05 02/15 03/01 M L 1 octet per character
01/11 02/05 02/15 03/02 M L 2 octets per character
01/11 02/05 02/15 03/03 M L 3 octets per character
01/11 02/05 02/15 03/04 M L 4 octets per character
[This uses the ``other coding system'' of ISO 2022, using
private Final characters.]
The ``M'' and ``L'' octets represent a 14-bit unsigned value
giving the number of octets that appear in the remainder of
the segment. The number is computed as ((M - 128) * 128) +
(L - 128). The most significant bit M and L are always set
to one. The remainder of the segment consists of two parts,
the name of the character set encoding and the actual text.
The name of the encoding comes first and is separated from
the text by the octet 00/02 (STX, START OF TEXT). Note that
the length defined by M and L includes the encoding name and
separator.
[The encoding of the length is chosen to avoid having zero
octets in Compound Text when possible, because embedded NUL
values are problematic in many C language routines. The use
of zero octets cannot be ruled out entirely however, since
some octets in the actual text of the extended segment may
have to be zero.]
The name of the encoding should be registered with the X
Consortium to avoid conflicts and should when appropriate
match the CharSet Registry and Encoding registration used in
the X Logical Font Description. The name itself should be
encoded using ISO 8859-1 (Latin 1), should not use question
mark (03/15) or asterisk (02/10), and should use hyphen
(02/13) only in accordance with the X Logical Font Descrip-
tion.
Extended segments are not to be used for any character set
encoding that can be constructed from a GL/GR pair of
approved standard encodings. For example, it is incorrect to
use an extended segment for any of the ISO 8859 family of
encodings.
It should be noted that the contents of an extended segment
are arbitrary; for example, they may contain octets in the
C0 and C1 ranges, including 00/00, and octets comprising a
given character may differ in their most significant bit.
[ISO-registered ``other coding systems'' are not used in
Compound Text; extended segments are the only mechanism for
non-2022 encodings.]