bug-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: clean mess encrusted on to Chinese pastes cut from outside emacs


From: Kenichi Handa
Subject: Re: clean mess encrusted on to Chinese pastes cut from outside emacs
Date: Thu, 5 Jul 2001 19:47:35 +0900 (JST)
User-agent: SEMI/1.14.3 (Ushinoya) FLIM/1.14.2 (Yagi-Nishiguchi) APEL/10.2 Emacs/21.0.104 (sparc-sun-solaris2.6) MULE/5.0 (SAKAKI)

Eli Zaretskii <eliz@is.elta.co.il> writes:
Eli>  What is your selection-coding-system set to, and what was the
>>  >> chinese-big5
>>  
Eli>  Why not compound-text?
>>  Hmm, just tried that and got  [cat -v output:]
>>  ^[%/2M-^IBIG5-0^BM-^IBIG5-0^BM-^AM-%| 07  5 01:51:28 CST
>>  where this is the               ^ japanese yen symbol
>>  ^[%/2M-^IBIG5-0^BM-^IBIG5-0^BM-%|^[(B 07  5 01:51:28 CST
>>  instead of the chinese         ^ "four" [si4] symbol

This is the same problem as what reported to me quite
recently from a Debian developer.  It seems that the Big-5
user community started to use the above special encoding
following "Non-Standard Character Set Encodings" of Compound
Text.  I'll attach the full spec at the tail.

Emacs 21.1 still can't handle this kind of extension to
Compound Text.  To handle it, we need the followings.

(1) Make encoding and decoding functions for such an
    encoding (e.g. ctext-big5-encode-region and
    ctext-big5-decode-region).  For instance,
    ctext-big5-decode-region should decode the region
    matching ESC % / 2 M L BIG5-0 ... specially whereas the
    other regions by normal `ctext'.

(2) Make a new coding system, say ctext-big5 as below:

(make-coding-system
 'ctext-big5
 0 ?B "Compound text with BIG5 extension of XFree86."
 nil
 '((post-read-conversion . ctext-big5-post-read-conversion)
   (pre-write-conversion . ctext-big5-pre-write-conversion)))

Here, ctext-big5-pre-write-conversion and
ctext-big5-pre-write-conversion calls
ctext-big5-encode-region and ctext-big5-decode-region
respectively with proper arguments.

(3) Modify x_encode_text of xfns.c so that it uses
    code_convert_string to encode the data instead of
    calling encode_coding directly.  This is to handle
    pre-write-conversion correctly.

(4) Modify selection_data_to_lisp_data of xselect.c so that
    it calls code_convert_string to decode Lisp string
    instead of calling decode_coding directly.  This is to
    handle post-read-conversion correctly.

Actually, (3) and (4) are bug fixes.

---
Ken'ichi HANDA
handa@etl.go.jp


6.  Non-Standard Character Set Encodings

Character set encodings that are not in the list of approved
standard  encodings  can  be  included using ``extended seg-
ments''.  An extended segment begins with one of the follow-
ing sequences:

     01/11 02/05 02/15 03/00 M L   variable number of octets per character
     01/11 02/05 02/15 03/01 M L   1 octet per character
     01/11 02/05 02/15 03/02 M L   2 octets per character
     01/11 02/05 02/15 03/03 M L   3 octets per character
     01/11 02/05 02/15 03/04 M L   4 octets per character

[This uses the ``other coding system'' of  ISO  2022,  using
private Final characters.]

The ``M'' and ``L'' octets represent a 14-bit unsigned value
giving  the number of octets that appear in the remainder of
the segment.  The number is computed as ((M - 128) * 128)  +
(L  - 128).  The most significant bit M and L are always set
to one.  The remainder of the segment consists of two parts,
the  name of the character set encoding and the actual text.
The name of the encoding comes first and is  separated  from
the text by the octet 00/02 (STX, START OF TEXT).  Note that
the length defined by M and L includes the encoding name and
separator.

[The encoding of the length is chosen to avoid  having  zero
octets  in Compound Text when possible, because embedded NUL
values are problematic in many C language routines.  The use
of  zero  octets cannot be ruled out entirely however, since
some octets in the actual text of the extended  segment  may
have to be zero.]

The name of the encoding should be  registered  with  the  X
Consortium  to  avoid  conflicts and should when appropriate
match the CharSet Registry and Encoding registration used in
the  X  Logical Font Description.  The name itself should be
encoded using ISO 8859-1 (Latin 1), should not use  question
mark  (03/15)  or  asterisk  (02/10),  and should use hyphen
(02/13) only in accordance with the X Logical Font  Descrip-
tion.

Extended segments are not to be used for any  character  set
encoding  that  can  be  constructed  from  a  GL/GR pair of
approved standard encodings. For example, it is incorrect to
use  an  extended  segment for any of the ISO 8859 family of
encodings.

It should be noted that the contents of an extended  segment
are  arbitrary;  for example, they may contain octets in the
C0 and C1 ranges, including 00/00, and octets  comprising  a
given character may differ in their most significant bit.

[ISO-registered ``other coding systems''  are  not  used  in
Compound  Text; extended segments are the only mechanism for
non-2022 encodings.]




reply via email to

[Prev in Thread] Current Thread [Next in Thread]