emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: eight-bit char handling in emacs-unicode


From: Kenichi Handa
Subject: Re: eight-bit char handling in emacs-unicode
Date: Fri, 21 Nov 2003 09:41:47 +0900 (JST)
User-agent: SEMI/1.14.3 (Ushinoya) FLIM/1.14.2 (Yagi-Nishiguchi) APEL/10.2 Emacs/21.3 (sparc-sun-solaris2.6) MULE/5.0 (SAKAKI)

In article <jwvptfp139w.fsf-monnier+emacs/address@hidden>, Stefan Monnier 
<address@hidden> writes:
>>  I see.  Apart from the design itself, I agree that it's difficult to
>>  introduce a new type.  But, when I discussed with Richard about the
>>  Character type object a few year ago, he was not that negative provided
>>  that it gives sure improvement.

> Sounds about right to me: we have one free tag that we could use for chars

Yes, and as that is the last free tag, I still hesitate to
consume it for the Character object.

>>  Then, we can't use make-string-unibyte for the current case
>>  because, in emacs-unicode, (concat '(?a 192)) returns a
>>  multibyte string whose second element is A-grave, not an
>>  eight-bit-char.  Am I missing something?

> Well, obviously we need to make it accept this case (i.e. accept both the
> latin-1 192 and the eight-bit-char 192).

Then, I see your intention.  But, isn't the semantics of
such a function very weird?

>>>  To do what your string-make-unibyte does you should use
>>>  `encode-coding-string' where the coding system is passed explicitly.

>>  Those are conceptually different things (I remember the
>>  similar discussion we had a while ago).

>>  encode-coding-string does:
>>  char-sequence --CCS-set--> (CCS/codepoint-pair)-sequence
>>    --CES-->  encoded-byte-sequence

>>  string-make-unibyte does:
>>  char-sequence --CCS--> code-point-sequence
>>    --concat-->  code-point-sequence

>>  These two yield the same result only when CCS support all
>>  chars in "char-sequence" and CES is stateless
>>  (e.g. iso-latin-1) and .

> You lost me here (I'm a poor soul whose doesn't know much outside of the
> latin-1 world).

CCS: Coded Character Set
CES: Character Encoding Scheme
coding-system of Emacs: Set of CCSs and CES.
   iso-latin-1: CCSs are ascii and latin-iso8859-1, 
                CES is 8-bit version of ISO-2022
   iso-2022-jp: CCSs are ascii, japanese-jisx0208, ...
                CES is 7-bit version of ISO-2022

> I thought that string-make-unibyte only behaves meaningfully for
> "normal 8bit coding-systems" such as latin-1.

Yes, but it doesn't mean it is conceptually the same as
encode-coding-string.  The result of string-make-unibyte
should still be regarded as a sequence of character, but the
result of encode-coding-string is a sequence of byte.
Here exists an ambiguity of a unibyte string.

The number 192 can be regarded as:
(1) just a number, a byte
(2) a code point of some character set.
(3) a character code

A unibyte string can contain (1) and (2) without
distinguishing them, but a multibyte string can contain (1)
and (3) while distinguishing them.

---
Ken'ichi HANDA
address@hidden




reply via email to

[Prev in Thread] Current Thread [Next in Thread]