[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Lynx-dev] Lynx bug report: mangled UTF-8
From: |
Tom Christiansen |
Subject: |
Re: [Lynx-dev] Lynx bug report: mangled UTF-8 |
Date: |
Wed, 06 Oct 2010 08:01:10 -0600 |
On Tuesday, 5 October 2010, Thomas Dickey wrote at 5:00pm EDT:
>> I've verified this bug using the following version of Lynx, whose
>> release is notably dated just yesterday:
>>
>> $ ./lynx -version
>> Lynx Version 2.8.8dev.6 (04 Oct 2010)
>> libwww-FM 2.14, ncurses 5.7.20081102
>> Built on darwin10.4.0 Oct 5 2010 10:23:40
> Your bug might still be present, but right away I notice that it's not
> built with wide-character library of ncurses (and is not likely to work
> as well):
> Lynx Version 2.8.8dev.4 (21 Jun 2010)
> libwww-FM 2.14, SSL-MM 1.4.1, OpenSSL 0.9.8o, ncurses 5.7.20101002(wide)
> Built on linux-gnu Jun 21 2010 17:17:20
Thanks for the suggestion. I've rebuilt with ncursesw, and my version
now reads:
Lynx Version 2.8.8dev.6 (04 Oct 2010)
libwww-FM 2.14, ncurses 5.7.20081102(wide)
Built on darwin10.4.0 Oct 5 2010 17:49:29
For more detailed library info, Darwin doesn't have ldd(1), so one
instead uses:
$ otool -L ./lynx
./lynx:
/opt/local/lib/libidn.11.dylib (compatibility version 17.0.0, current
version 17.45.0)
/opt/local/lib/libncursesw.5.dylib (compatibility version 5.0.0, current
version 5.0.0)
/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version
125.2.0)
But the problem continued to occur--at least on Darwin; haven't
tried on OpenBSD.
My test case is to run
$ ~/lynx2-8-8/lynx -width=80 -display_charset=utf-8 \
-assume_local_charset=utf-8 -dump test.html > test.utf8
Looking at test.utf8 in less(1) with LESSCHARSET set to UTF-8 reveals
this paragraph:
Values in parentheses are for the high resolution bin: 2.71–2.59 Å for
the SeMet GMP-bound, 2.03–2.00 Å for the native GMP-bound, 2.38–2.30 <C3>
for the native apo form.
where less places <C3> in reverse-video to indicate a charcter that's
either non-printable or out of repertoire; here, invalid UTF-8.
What's happening is that the UTF-8 encoding of code point U+00C5, LATIN
CAPITAL LETTER A WITH RING ABOVE, is the two-byte pair, 0xC3 and 0x85.
But considered in ISO 8859-1, an isolated 0x85 is NEXT LINE (NEL), which
is considered white space. The 0x85 byte is erroneously replaced by a
newline and a lone 0xC3 byte left dangling, which is illegal as UTF-8.
A more programmatic approach to locating the problem can be had via perl(1),
feeding the input via stdin:
$ perl -CS -Mwarnings=FATAL,all -lne 'print if /nonesuch/' < test.utf8
utf8 "\xC3" does not map to Unicode at -e line 1, <> line 633.
Exit 255
or this way with an explicitly named file:
$ perl -CSD -Mwarnings=FATAL,all -lne 'print if /nonesuch/' test.utf8
utf8 "\xC3" does not map to Unicode at -e line 1, <> line 633.
Exit 25
Where "Exit 255" and "Exit 25" are printed by tcsh because I have
printexitvalue set.
One can also set up an explicit "warning handler" if one wishes more
control over the error message and behavior, perhaps for generating a
unit-test used in regression testing.
Thank you all for all your help and suggestions.
--tom