[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Lynx-dev] Lynx bug report: mangled UTF-8
From: |
Tom Christiansen |
Subject: |
[Lynx-dev] Lynx bug report: mangled UTF-8 |
Date: |
Tue, 05 Oct 2010 11:22:45 -0600 |
I've verified this bug using the following version of Lynx, whose
release is notably dated just yesterday:
$ ./lynx -version
Lynx Version 2.8.8dev.6 (04 Oct 2010)
libwww-FM 2.14, ncurses 5.7.20081102
Built on darwin10.4.0 Oct 5 2010 10:23:40
This bug also occurs in all prior versions of Lynx I was able to test.
==== BEGIN LYNX BUG
When considering line wrapping, Lynx misconstrues all text as ISO
8859-1, even when producing UTF-8. All code points whose multibyte
UTF-8 encoding includes bytes which are white space in 8859-1
[see attached program] erroneously become candidates for line wrapping.
Multibyte expansions containing either of bytes 0x85 or 0xA0 may have
that byte replaced by \n, a substitution which not only irrecoverably
mangles the intended text but also generates illegal UTF-8 sequences.
==== END LYNX BUG
I was unable to locate any mention of this bug, whether in the CHANGES or
PROBLEMS file, or via Googling. I am also unaware of any bugs database for
Lynx, or I would have submitted this there. I trust my simple description
should suffice to locate the offending code, but if not, sample input
file(s) manifesting the problem are available upon request.
Hope this helps. Send mail if you have any advice or need more details.
thanks,
--tom
As of Unicode 5.2, 1,776 named code points are vulnerable to this
Lynx bug. These can be enumerated by running the following program.
#!/usr/bin/env perl
#
# spacenc - find code points with multibyte UTF-8 encodings containing
# bytes that would be spaces if misunderstood to be ISO 8859-1
#
# Tom Christiansen <address@hidden>
# Tue Oct 5 10:51:18 MDT 2010
#
# NB: works best with Unicode version >= 5.2, hence Perl version >= 5.12
use strict;
use warnings FATAL => qw[all];
use diagnostics;
use charnames qw[ ];
use Encode qw[encode decode];
# omit code points < 128, as those don't multibyte-encode
for my $cp (0x00_0080 .. 0x10_FFFF) {
# gaggy UTF-16 surrogates are illegal UTF-8 code points
next if $cp >= 0x00_D800 && $cp <= 0x00_DFFF;
# see "Unicode non-character %s is illegal for interchange" in perldiag(1)
$_ = do { no warnings "utf8"; chr($cp) };
# won't find string names for any of these, so don't bother printing
next if m{ \p{Unassigned} }x;
next if m{ \p{PrivateUse} }x;
next if m{ \p{Han} }x;
next if m{ \p{InHangulSyllables} }x;
# cast individual utf8 bytes into latin1 code points
my $as_latin = decode("latin1", encode("utf8", $_));
if ($as_latin =~ m{ \s }x) {
printf("U+%05X in UTF8 is %v02X", $cp, $as_latin);
printf(" %s\n", charnames::viacode($cp) || "<unnamed code point>");
}
}
END {
close(STDOUT) || die "can't close stdout: $!";
}
- [Lynx-dev] Lynx bug report: mangled UTF-8,
Tom Christiansen <=