[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Lynx-dev] bugreport: dumping utf8 html to utf8 text malforms \c5\a0 cha
From: |
Pavel Smerk |
Subject: |
[Lynx-dev] bugreport: dumping utf8 html to utf8 text malforms \c5\a0 character before a new line |
Date: |
Tue, 6 Oct 2009 23:52:57 +0200 |
User-agent: |
Mutt/1.4.2.2i |
Hello all,
having the following HTML code
<html><head>
<meta http-equiv=Content-Type content="text/html; charset=utf-8">
</head><body>
ŠŠ
</body></html>
in the file in.html and running the following command
lynx -dump -display_charset=utf-8 -assume_charset=utf-8 -nomargins in.html >
out.txt
one gets back the following five bytes in the file out.txt
C5 A0 C5 0A 0A
where the second C5 is only a beginning of the correct two-byte utf-8
character C5 A0. May be the A0 byte is deleted because of some end-of-line
spaces trimming, which, however, would be rather surprising as the A0 itself
is not a correct utf-8 character, but in this case both the input and the
output are utf-8. And, of course, neither C5 itself is a correct utf-8
character, which means that the output is not even a correct utf-8 file.
Nevertheless, thank you for the great piece of software. :-)
With regards,
Pavel Smerk
- [Lynx-dev] bugreport: dumping utf8 html to utf8 text malforms \c5\a0 character before a new line,
Pavel Smerk <=