emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Emacs 23 character code space


From: Kenichi Handa
Subject: Re: Emacs 23 character code space
Date: Thu, 27 Nov 2008 10:29:50 +0900
User-agent: SEMI/1.14.3 (Ushinoya) FLIM/1.14.2 (Yagi-Nishiguchi) APEL/10.2 Emacs/23.0.60 (i686-pc-linux-gnu) MULE/6.0 (HANACHIRUSATO)

In article <address@hidden>, Eli Zaretskii <address@hidden> writes:

> > For instance, to get a glyph-code of X font, we decode a
> > character by a charset with that the font encodes glyph
> > codes.

> But that's not really "decoding", is it?  By "decoding" we usually
> mean conversion _to_ the Emacs internal representation, whereas in
> your example, we convert _from_ the internal representation to some
> other.

Oops, sorry, I myself confused decoding and encoding.  Yes
the above is encoding.  And I did the same mistake in my
followup mail.

> To avoid confusion, I suggest to talk about "conversion" of Emacs
> characters to code points of a charset.  Do you agree?

As we have functions encode-char and decode-char, I think it
is better to keep using the words "encoding" and "decoding"
for both kind of conversions; i.e. character <->
(charset . code-point), and string/buffer <-> byte-sequence.

> > From: Kenichi Handa <address@hidden>
[...]
> > I'll explain it a little bit more.  To decode a character
> > sequence to a byte sequence, Emacs actually does two kinds
> > of decoding as below:

As I wrote above, I made a mistake here.  So, I'll
paraphrase it as below.

To convert between a character sequence and a byte sequence,
Emacs actually does two steps of conversions as below.


characters --(1)-> (charset code-point) pairs --(3)-> bytes
           <-(2)--                            <-(4)--     

For the encoding of (1), Emacs uses infomaiton of coding
system to decide which charset to use, and then uses
informaiton of the selected charset to get a code point.
For the decoding of (2), Emacs uses informaiton of charset
to get character codes. 

For the encoding of (3) and the decoding of (4), Emacs uses
only information of coding system.

> Can you give a couple of examples, for some popular charsets, and how
> we decode bytes into characters thru these pairs of charsets and code
> points?

Ok.

Ex.1  utf-8

(1) and (2) are straight forward because charset is
`unicode' and Emacs character code and the code-point in
`unicode' are the same.  (3) decodes each (unicode
CODE-POINT) to utf-8 byte sequence, (4) does the reverse
conversion.

 "a\x3042x" -(1)-> (unicode #x61) (unicode #x3042) (unicode #x78)
            -(3)-> "#x61 #xE3 #x81 #x82 #x78"

Ex.2 iso-8859-2

(1) encodes each charater to code points of the charset
iso-8859-2 by the information of that charset, and (2) does
the reverse conversion.  (3) and (4) are straight forward
because the code-point sequence and the byte sequence are
the same.

Ex.3 iso-2022-jp (japanese)

(1) at first decides which charset (among what supported by
iso-2022-jp) to use for each character, and then encode the
charater to the correspoding (charset code-point) pair.  (2)
does the decoding using information of charset only.  (3)
generates a byte sequence from each code-point (one byte for
a charset of dimension 1, two bytes for a charset of
dimension 2), and also inserts a proper designation byte
sequence at charset boundary.
 "a\x3042x" -(1)-> (ascii #x61) (japanese-jisx0208 #x2422) (aciii #x78)
            -(3)-> "#x61 ESC $ B #x24 #x22 ESC ( B #x78"

Ex.4 gb2312 (chinese)

 "a\x3042x" -(1)-> (ascii #x61) (chinese-gb2312 #x2422) (aciii #x78)
            -(3)-> "#x61 #xA4 #xA2 #x78"

> Thanks.  What confuses me is that, roughly, there's a charset in Emacs
> 23 for every coding-system, and they both have almost identical names.

But there are coding-systems that have multiple charsets.
For instance, big5 coding-system support both ASCII and BIG5
charsets, iso-2022-7bit supports many many charsets.

> For example, the code point of a-umlaut in the iso-8859-1 charset is
> exactly identical to the byte value produced by encoding that
> character with iso-8859-1 coding-system.  So I wonder why we need
> both in Emacs.  Why can't we, for example, decode bytes directly into
> Emacs characters?

Getting a code point from byte sequence and getting a
character code from a code point are different generally
(the above example of iso-8859-1 is rather rare example).  I
hope you understand why by seeing the above examples.

---
Kenichi Handa
address@hidden




reply via email to

[Prev in Thread] Current Thread [Next in Thread]