emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: utf-8.el


From: Kenichi Handa
Subject: Re: utf-8.el
Date: Wed, 19 Jan 2005 11:51:14 +0900 (JST)
User-agent: SEMI/1.14.3 (Ushinoya) FLIM/1.14.2 (Yagi-Nishiguchi) APEL/10.2 Emacs/21.3.50 (sparc-sun-solaris2.6) MULE/5.0 (SAKAKI)

In article <address@hidden>, Stefan Monnier <address@hidden> writes:

> Does anyone see a problem with the simple patch below?

See the comment below.

> Also, could anyone confirm that the docstring of mule-utf-8 is correct in
> saying that invalid utf-8 sequences are not always correctly preserved?
> Why is that?  Can't we fix it?

I remember I fixed ccl-mule-utf-8-encode-untrans to preserve
invalid utf-8 sequence as far as possible.  So perhaps the
current version preserves even invalid sequence correctly.

I've just run this code for a fairly long time and saw no error.

(defun temp ()
  (let ((count 0))
    (while t
      (setq count (1+ count))
      (message "%d" count)
      (let* ((len (+ 6 (random 6)))
             (str (make-string len 0)))
        (dotimes (i len)
          (aset str i (+ 128 (random 128))))
        (or (equal str
                   (encode-coding-string
                    (decode-coding-string str 'utf-8) 'utf-8))
            (error "%s caused error" (setq error-string str)))))))

> Also could anyone explain to me why `utf-8-compose' needs to lookup the
> hashtable (get 'utf-subst-table-for-decode 'translation-hash-table), since
> it looks to me like ccl-decode-mule-utf-8 already takes care of decoding
> chars that are in this table.

subst-tables are not preloaded.  They are automatically
loaded in utf-8-post-read-conversion but it runs after
ccl-decode-mule-utf-8 is executed.  And the arg hash-table
becomes non-nil only when subst-tables are loaded.

> I also don't understand the following part of
> the code:

>         (if (= l 2)
>             (put-text-property (point) (min (point-max) (+ l (point)))
>                                'display (format "\\%03o" ch))
>           (compose-region (point) (+ l (point)) ?�))

> what does it mean for l (the number of bytes) to be equal to 2?

The docstring of ccl-untranslated-to-ucs is not clear.  In
"Set r1 to the byte length", the byte length means how many
of r0, r1, r2, r3 (each of them contains a byte) contribute
to a unicode character (or an invalid byte).

If l is 2, that means an invalid byte was converted to
two-char sequence of eight-bit-graphic (#xC2 or #xC3) and
eight-bit-control/graphic.  In that case, it is better to
display that sequence by octal instead of showing ?�.

> --- orig/lisp/international/utf-8.el
> +++ mod/lisp/international/utf-8.el
> @@ -2,7 +2,7 @@
 
>  ;; Copyright (C) 2001, 2004 Electrotechnical Laboratory, JAPAN.
>  ;; Licensed to the Free Software Foundation.
> -;; Copyright (C) 2001, 2002 Free Software Foundation, Inc.
> +;; Copyright (C) 2001, 2002, 2005  Free Software Foundation, Inc.
 
>  ;; Author: TAKAHASHI Naoto  <address@hidden>
>  ;; Maintainer: FSF
> @@ -259,7 +259,7 @@
>                                (funcall decode-char-no-trans (car x))
>                                (funcall decode-char-no-trans (cdr x))))
>                    ranges "")))
> -  ;; These forces loading and settting tables for
> +  ;; This forces loading and setting tables for
>    ;; utf-translate-cjk-mode.
>    (setq utf-translate-cjk-lang-env nil
>       ucs-mule-cjk-to-unicode (make-hash-table :test 'eq)
> @@ -951,10 +951,7 @@
>    (save-excursion
>      (save-restriction
>        (narrow-to-region (point) (+ (point) length))
> -      ;; Can't do eval-when-compile to insert a multibyte constant
> -      ;; version of the string in the loop, since it's always loaded as
> -      ;; unibyte from a byte-compiled file.
> -      (let ((range (string-as-multibyte "^\xc0-\xc3\xe1-\xf7"))
> +      (let ((range "^\xc0-\xc3\xe1-\xf7")

This change is not good because range is set to a unibyte
string and regexp search converts it to a multibyte
string by `make-multibyte-string'.  Here what we need is a
multibyte string that contains eight-bit-graphci/control
chars.  Anyway it is better to change string-as-multibyte to
string-to-multibyte.

>           (buffer-multibyte enable-multibyte-characters)
>           hash-table ch)
>       (set-buffer-multibyte t)
> @@ -1036,8 +1033,7 @@
>      mule-unicode-0100-24ff
>      mule-unicode-2500-33ff
>      mule-unicode-e000-ffff
> -    ,@(if utf-translate-cjk-mode
> -       utf-translate-cjk-charsets))
> +    ,@utf-translate-cjk-charsets)

This change is ok.

>     (mime-charset . utf-8)
>     (coding-category . coding-category-utf-8)
>     (valid-codes (0 . 255))
> @@ -1054,23 +1050,23 @@
>  ;; I think this needs special private charsets defined for the
>  ;; untranslated sequences, if it's going to work well.
 
> -;;; (defun utf-8-compose-function (pos to pattern &optional string)
> -;;;   (let* ((prop (get-char-property pos 'composition string))
> -;;;   (l (and prop (- (cadr prop) (car prop)))))
> -;;;     (cond ((and l (> l (- to pos)))
> -;;;     (delete-region pos to))
> -;;;    ((and (> (char-after pos) 224)
> -;;;          (< (char-after pos) 256)
> -;;;          (save-restriction
> -;;;            (narrow-to-region pos to)
> -;;;            (utf-8-compose)))
> -;;;     t))))
> -
> -;;; (dotimes (i 96)
> -;;;   (aset composition-function-table
> -;;;  (+ 128 i)
> -;;;  `((,(string-as-multibyte "[\200-\237\240-\377]")
> -;;;     . utf-8-compose-function))))
> +;; (defun utf-8-compose-function (pos to pattern &optional string)
> +;;   (let* ((prop (get-char-property pos 'composition string))
> +;;    (l (and prop (- (cadr prop) (car prop)))))
> +;;     (cond ((and l (> l (- to pos)))
> +;;      (delete-region pos to))
> +;;     ((and (> (char-after pos) 224)
> +;;           (< (char-after pos) 256)
> +;;           (save-restriction
> +;;             (narrow-to-region pos to)
> +;;             (utf-8-compose)))
> +;;      t))))
> +
> +;; (dotimes (i 96)
> +;;   (aset composition-function-table
> +;;   (+ 128 i)
> +;;   `((,(string-as-multibyte "[\200-\237\240-\377]")
> +;;      . utf-8-compose-function))))
 
>  ;; arch-tag: b08735b7-753b-4ae6-b754-0f3efe4515c5
>  ;;; utf-8.el ends here

This change is ok if that is the correct coding style for
comments.

---
Ken'ichi HANDA
address@hidden




reply via email to

[Prev in Thread] Current Thread [Next in Thread]