[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: utf-8.el
From: |
Kenichi Handa |
Subject: |
Re: utf-8.el |
Date: |
Wed, 19 Jan 2005 15:15:05 +0900 (JST) |
User-agent: |
SEMI/1.14.3 (Ushinoya) FLIM/1.14.2 (Yagi-Nishiguchi) APEL/10.2 Emacs/21.3.50 (sparc-sun-solaris2.6) MULE/5.0 (SAKAKI) |
In article <address@hidden>, Stefan Monnier <address@hidden> writes:
>> subst-tables are not preloaded. They are automatically
>> loaded in utf-8-post-read-conversion but it runs after
>> ccl-decode-mule-utf-8 is executed. And the arg hash-table
>> becomes non-nil only when subst-tables are loaded.
> Oh, so the elisp code indeed does the same thing. And that means it's only
> really used at most once per Emacs session (since after it's executed, the
> hash-table will be active directly in ccl-decode-mule-utf-8). Right?
Right except for the case that a user turn
utf-translate-cjk-mode off once.
>>> I also don't understand the following part of
>>> the code:
>>> (if (= l 2)
>>> (put-text-property (point) (min (point-max) (+ l (point)))
>>> 'display (format "\\%03o" ch))
>>> (compose-region (point) (+ l (point)) ?�))
>>> what does it mean for l (the number of bytes) to be equal to 2?
>> The docstring of ccl-untranslated-to-ucs is not clear. In
>> "Set r1 to the byte length", the byte length means how many
>> of r0, r1, r2, r3 (each of them contains a byte) contribute
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> to a unicode character (or an invalid byte).
"^^^^" part is not accuate. "The first few of them that
contribute to a unicode character or an invalid byte contain
eight-bit characters (thus are byte values)."
> So it's the number of bytes used in the buffer's internal representation
> (i.e. emacs-mule), not the number of bytes used in the utf-8 representation?
No, it's the number of characters. r0..r3 are the same as
utf-8-ccl-regs[0]..[3] set by utf-8-untranslated-to-ucs.
>> If l is 2, that means an invalid byte was converted to
>> two-char sequence of eight-bit-graphic (#xC2 or #xC3) and
>> eight-bit-control/graphic.
> And that's because any other utf-8 char maps to either a 3-byte sequence
> (in a mule-unicode-NNNN-MMMM charset) or if it maps to a 2-byte sequence
> (like latin-1) it won't pass through this code anyway?
Yes.
>> In that case, it is better to
>> display that sequence by octal instead of showing ?�.
> Yes, I understand this part. I just have a hard time following the
> reasoning that gets us to the point where we know that (= l 2) implies that
> it's a single eight-bit-control or eight-bit-graphic char.
Not acculate. As I wrote above, (= l 2) implies it's an
originally invalid byte represented by 2-byte sequence of
eight-bit-graphic and eight-bit-control char.
>>> - ;; Can't do eval-when-compile to insert a multibyte constant
>>> - ;; version of the string in the loop, since it's always loaded as
>>> - ;; unibyte from a byte-compiled file.
>>> - (let ((range (string-as-multibyte "^\xc0-\xc3\xe1-\xf7"))
>>> + (let ((range "^\xc0-\xc3\xe1-\xf7")
>> This change is not good because range is set to a unibyte
>> string and regexp search converts it to a multibyte
>> string by `make-multibyte-string'. Here what we need is a
>> multibyte string that contains eight-bit-graphci/control
>> chars.
> I know that's what the comment says, but my tests lead me to believe that
> the comment is not correct and that the string's multibyteness is
> correctly preserved.
Ah! I've forgotten that "\x" notation in a string forces
the string to be read as multibyte in the latest emacs. It
wasn't in 21.3.
So, yes, now your change is ok.
---
Ken'ichi HANDA
address@hidden
Re: utf-8.el, Andreas Schwab, 2005/01/19