[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Lynx-dev] fix for decoding utf-8 in CDATA sections
From: |
Thomas Dickey |
Subject: |
Re: [Lynx-dev] fix for decoding utf-8 in CDATA sections |
Date: |
Fri, 28 Jul 2023 04:26:36 -0400 |
On Thu, Jul 27, 2023 at 10:25:13PM +0200, Hiltjo Posthuma wrote:
> Hi,
>
> I use lynx to convert HTML to plain-text, but noticed an issue where part of
> the output is missing with UTF-8 in CDATA sections.
>
> Below is a small test-case to reproduce it:
>
> <p>Works correctly:</p>
> <p>a’b</p>
>
> <p>Doesn't work correctly:</p>
> <p><![CDATA[a’b]]></p>
agreed - I see
Works correctly:
a’b
Doesn't work correctly:
a?b
> This byte sequence for the UTF-8 codepoint is: printf '\342\200\231'
>
>
> I use the following command to convert HTML to text:
>
> lynx -stdin -dump \
> -underline_links -image_links \
> -display_charset="utf-8" -assume_charset="utf-8"
>
>
> My system information:
> I tested on the latest lynx-cur: lynx2.9.0dev.12
>
> $ locale
> LANG=en_US.UTF-8
> LC_CTYPE="en_US.UTF-8"
> LC_NUMERIC="en_US.UTF-8"
> LC_TIME="en_US.UTF-8"
> LC_COLLATE="en_US.UTF-8"
> LC_MONETARY="en_US.UTF-8"
> LC_MESSAGES="en_US.UTF-8"
> LC_PAPER="en_US.UTF-8"
> LC_NAME="en_US.UTF-8"
> LC_ADDRESS="en_US.UTF-8"
> LC_TELEPHONE="en_US.UTF-8"
> LC_MEASUREMENT="en_US.UTF-8"
> LC_IDENTIFICATION="en_US.UTF-8"
> LC_ALL=en_US.UTF-8
>
>
> What I found:
>
> I think it only prints the first byte instead of printing the processed
> codepoint (clong). I noticed in the file WWW/Library/Implementation/SGML.c
> there is a similar case for comments for example for "S_comment_put_c:".
>
> Below is a patch. I'm not sure it covers all lynx options though. I hope it
> does:
thanks - will review, etc
--
Thomas E. Dickey <dickey@invisible-island.net>
https://invisible-island.net
signature.asc
Description: PGP signature