[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Lynx-dev] Dumps Unicode file in broken encoding.
From: |
Thomas Dickey |
Subject: |
Re: [Lynx-dev] Dumps Unicode file in broken encoding. |
Date: |
Mon, 29 Sep 2008 07:02:14 -0400 (EDT) |
On Mon, 29 Sep 2008, Atsuhito Kohda wrote:
Hi all,
I got the following bug report in the Debian BTS (Bug#498985).
As I have no knowledge on this, I'd like to forward the report
to this lists.
This seems to be what I discussed with Plessy week-before-last:
I explained the expected behavior.
lynx seems to be matching that.
He replied that it did not do that before.
I verified that it worked as expected in 2.8.5.
Without some charset in the document, or override via command-line or lynx
configuration, the file will be treated as ISO-8859-1. He seems to be
expecting lynx to treat it as UTF-8.
(without some more details, I don't know where to look).
On Mon, 15 Sep 2008 16:10:38 +0900, Charles Plessy wrote:
I have severe problems when converting HTML messages with Lynx while
using Mutt, and it seems to me that the reason is that the output
encoding is broken. Here is a simple example:
aqwa???~???$ cat test.html
<ul>
<li>??</li>
<li>??</li>
</ul>
aqwa???~???$ hexdump -C test.html
00000000 3c 75 6c 3e 0a 3c 6c 69 3e c3 a9 3c 2f 6c 69 3e |<ul>.<li>..</li>|
00000010 0a 3c 6c 69 3e c3 a0 3c 2f 6c 69 3e 0a 3c 2f 75 |.<li>..</li>.</u|
00000020 6c 3e 0a |l>.|
00000023
aqwa???~???$ lynx.cur --dump test.html
* ??
*
aqwa???~???$ lynx.cur --dump test.html > test.txt
aqwa???~???$ hexdump -C test.txt
00000000 20 20 20 20 20 2a 20 c3 a9 0a 20 20 20 20 20 2a | * ... *|
00000010 20 c3 0a 0a | ...|
00000014
Here are the expected files in latin and unicode encodings:
aqwa???~???$ cat test.unicode.txt
* ??
* ??
aqwa???~???$ hexdump -C test.unicode.txt
00000000 20 20 20 20 20 2a 20 c3 a9 0a 20 20 20 20 20 2a | * ... *|
00000010 20 c3 a0 0a 0a | ....|
00000015
aqwa???~???$ cat test.iso.txt
*
*
aqwa???~???$ hexdump -C test.iso.txt
00000000 20 20 20 20 20 2a 20 e9 0a 20 20 20 20 20 2a 20 | * .. * |
00000010 e0 0a 0a |...|
00000013
So apparently, ?????????? is C3A0 in UTF-8, E0 in ISO 8859-1, but Lynx dumps it
as
C3. This causes encoding misdetection, and many downstream problems.
Thanks in advance.
Regards, 2008-9-29(Mon)
--
Debian Developer - much more I18N of Debian
Atsuhito Kohda <kohda AT debian.org>
Department of Math., Univ. of Tokushima
--
Thomas E. Dickey
http://invisible-island.net
ftp://invisible-island.net