emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: utf-8.el


From: Kenichi Handa
Subject: Re: utf-8.el
Date: Wed, 19 Jan 2005 15:15:05 +0900 (JST)
User-agent: SEMI/1.14.3 (Ushinoya) FLIM/1.14.2 (Yagi-Nishiguchi) APEL/10.2 Emacs/21.3.50 (sparc-sun-solaris2.6) MULE/5.0 (SAKAKI)

In article <address@hidden>, Stefan Monnier <address@hidden> writes:

>>  subst-tables are not preloaded.  They are automatically
>>  loaded in utf-8-post-read-conversion but it runs after
>>  ccl-decode-mule-utf-8 is executed.  And the arg hash-table
>>  becomes non-nil only when subst-tables are loaded.

> Oh, so the elisp code indeed does the same thing.  And that means it's only
> really used at most once per Emacs session (since after it's executed, the
> hash-table will be active directly in ccl-decode-mule-utf-8).  Right?

Right except for the case that a user turn
utf-translate-cjk-mode off once.

>>>  I also don't understand the following part of
>>>  the code:

>>>  (if (= l 2)
>>>  (put-text-property (point) (min (point-max) (+ l (point)))
>>>  'display (format "\\%03o" ch))
>>>  (compose-region (point) (+ l (point)) ?�))

>>>  what does it mean for l (the number of bytes) to be equal to 2?

>>  The docstring of ccl-untranslated-to-ucs is not clear.  In
>>  "Set r1 to the byte length", the byte length means how many
>>  of r0, r1, r2, r3 (each of them contains a byte) contribute
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>  to a unicode character (or an invalid byte).

"^^^^" part is not accuate.  "The first few of them that
contribute to a unicode character or an invalid byte contain
eight-bit characters (thus are byte values)."

> So it's the number of bytes used in the buffer's internal representation
> (i.e. emacs-mule), not the number of bytes used in the utf-8 representation?

No, it's the number of characters.  r0..r3 are the same as
utf-8-ccl-regs[0]..[3] set by utf-8-untranslated-to-ucs.

>>  If l is 2, that means an invalid byte was converted to
>>  two-char sequence of eight-bit-graphic (#xC2 or #xC3) and
>>  eight-bit-control/graphic.

> And that's because any other utf-8 char maps to either a 3-byte sequence
> (in a mule-unicode-NNNN-MMMM charset) or if it maps to a 2-byte sequence
> (like latin-1) it won't pass through this code anyway?

Yes.

>>  In that case, it is better to
>>  display that sequence by octal instead of showing ?�.

> Yes, I understand this part.  I just have a hard time following the
> reasoning that gets us to the point where we know that (= l 2) implies that
> it's a single eight-bit-control or eight-bit-graphic char.

Not acculate.  As I wrote above, (= l 2) implies it's an
originally invalid byte represented by 2-byte sequence of
eight-bit-graphic and eight-bit-control char.

>>>  -      ;; Can't do eval-when-compile to insert a multibyte constant
>>>  -      ;; version of the string in the loop, since it's always loaded as
>>>  -      ;; unibyte from a byte-compiled file.
>>>  -      (let ((range (string-as-multibyte "^\xc0-\xc3\xe1-\xf7"))
>>>  +      (let ((range "^\xc0-\xc3\xe1-\xf7")

>>  This change is not good because range is set to a unibyte
>>  string and regexp search converts it to a multibyte
>>  string by `make-multibyte-string'.  Here what we need is a
>>  multibyte string that contains eight-bit-graphci/control
>>  chars.

> I know that's what the comment says, but my tests lead me to believe that
> the comment is not correct and that the string's multibyteness is
> correctly preserved.

Ah!  I've forgotten that "\x" notation in a string forces
the string to be read as multibyte in the latest emacs.  It
wasn't in 21.3.

So, yes, now your change is ok.

---
Ken'ichi HANDA
address@hidden




reply via email to

[Prev in Thread] Current Thread [Next in Thread]