[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: UTF-8 decoding error for characters U+10000 and above (hopefully fix
From: |
Joe Wells |
Subject: |
Re: UTF-8 decoding error for characters U+10000 and above (hopefully fixed already) |
Date: |
Mon, 13 Feb 2006 02:56:09 +0000 |
User-agent: |
Gnus/5.11 (Gnus v5.11) Emacs/22.0.50 (gnu/linux) |
Kenichi Handa <address@hidden> writes:
> In article <address@hidden>, Joe Wells <address@hidden> writes:
>
>> I'm using the Gentoo ebuild app-editors/emacs-22.0.50_pre20050225
>> which is based on a CVS snapshot from last year.
>
>> Try evaluating this:
>
>> (let ((unicode-char-hex-string
>> (format "%x"
>> (encode-char
>> (aref (decode-coding-string
>> ;; UTF-8 for U+1D161 (MUSICAL SYMBOL SIXTEENTH
>> NOTE):
>> "\355\205\241"
>> 'utf-8) 0)
>> 'ucs))))
>> (if (equal "d161" unicode-char-hex-string)
>> (error "Oh no! Emacs dropped 17th bit when decoding the
>> character!")))
>
> That version of Emacs supports only BMP as written in the
> documenation of utf-8 coding system.
Yes, but it should handle the character in the same way as any other
character outside of its range. There is this comment in utf-8.el:
;; We compose the untranslatable sequences into a single character,
;; and move point to the next character.
;; This is infelicitous for editing, because there's currently no
;; mechanism for treating compositions as atomic, but is OK for
;; display. They are composed to U+FFFD with help-echo which
;; indicates the unicodes they represent. ...
In my case, this seemed not to be working. Instead, it seemed it was
translating the sequence to the wrong character.
However, I have since discovered the real problem. I was editing the
file /usr/lib/X11/locale/en_US.UTF-8/Compose and it has a line that
reads like this:
----------------------------------------------------------------------
<Multi_key> <U1d15f> <U1d16f> : "텡" U1D161 # MUSICAL SYMBOL SIXTEENTH NOTE
----------------------------------------------------------------------
However, although it claims on the line that the code of the character
in the quotes is U+1D161, in fact the character there is actually
U+D161 encoded in UTF-8 as ED 85 A1. The correct UTF-8 encoding of
U+1D161 would be F0 9D 85 A1.
Sorry for the false alarm! The bug is in the xorg-x11 distribution
on my machine. I was wrong to believe this file was correct.
--
Joe
> u -- utf-8 (alias of mule-utf-8)
>
> UTF-8 encoding for Emacs-supported Unicode characters.
> It supports Unicode characters of these ranges:
> U+0000..U+33FF, U+E000..U+FFFF.
> They correspond to these Emacs character sets:
> ascii, latin-iso8859-1, mule-unicode-0100-24ff,
> mule-unicode-2500-33ff, mule-unicode-e000-ffff
> [...]
>
> ---
> Kenichi Handa
> address@hidden