[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Lynx-dev] Dumps Unicode file in broken encoding.
From: |
Atsuhito Kohda |
Subject: |
[Lynx-dev] Dumps Unicode file in broken encoding. |
Date: |
Mon, 29 Sep 2008 19:41:28 +0900 (JST) |
Hi all,
I got the following bug report in the Debian BTS (Bug#498985).
As I have no knowledge on this, I'd like to forward the report
to this lists.
On Mon, 15 Sep 2008 16:10:38 +0900, Charles Plessy wrote:
> I have severe problems when converting HTML messages with Lynx while
> using Mutt, and it seems to me that the reason is that the output
> encoding is broken. Here is a simple example:
>
> aqwa『~』$ cat test.html
> <ul>
> <li>é</li>
> <li>à</li>
> </ul>
>
> aqwa『~』$ hexdump -C test.html
> 00000000 3c 75 6c 3e 0a 3c 6c 69 3e c3 a9 3c 2f 6c 69 3e |<ul>.<li>..</li>|
> 00000010 0a 3c 6c 69 3e c3 a0 3c 2f 6c 69 3e 0a 3c 2f 75 |.<li>..</li>.</u|
> 00000020 6c 3e 0a |l>.|
> 00000023
>
> aqwa『~』$ lynx.cur --dump test.html
> * é
> *
>
>
> aqwa『~』$ lynx.cur --dump test.html > test.txt
>
> aqwa『~』$ hexdump -C test.txt
> 00000000 20 20 20 20 20 2a 20 c3 a9 0a 20 20 20 20 20 2a | * ... *|
> 00000010 20 c3 0a 0a | ...|
> 00000014
>
> Here are the expected files in latin and unicode encodings:
>
> aqwa『~』$ cat test.unicode.txt
> * é
> * à
>
>
> aqwa『~』$ hexdump -C test.unicode.txt
> 00000000 20 20 20 20 20 2a 20 c3 a9 0a 20 20 20 20 20 2a | * ... *|
> 00000010 20 c3 a0 0a 0a | ....|
> 00000015
>
> aqwa『~』$ cat test.iso.txt
> *
> *
>
>
> aqwa『~』$ hexdump -C test.iso.txt
> 00000000 20 20 20 20 20 2a 20 e9 0a 20 20 20 20 20 2a 20 | * .. * |
> 00000010 e0 0a 0a |...|
> 00000013
>
> So apparently, « à » is C3A0 in UTF-8, E0 in ISO 8859-1, but Lynx dumps it as
> C3. This causes encoding misdetection, and many downstream problems.
Thanks in advance.
Regards, 2008-9-29(Mon)
--
Debian Developer - much more I18N of Debian
Atsuhito Kohda <kohda AT debian.org>
Department of Math., Univ. of Tokushima
- [Lynx-dev] Dumps Unicode file in broken encoding.,
Atsuhito Kohda <=