Re: [Lynx-dev] fix for decoding utf-8 in CDATA sections

lynx-dev

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Lynx-dev] fix for decoding utf-8 in CDATA sections

From:	Thomas Dickey
Subject:	Re: [Lynx-dev] fix for decoding utf-8 in CDATA sections
Date:	Fri, 28 Jul 2023 04:26:36 -0400

On Thu, Jul 27, 2023 at 10:25:13PM +0200, Hiltjo Posthuma wrote:
> Hi,
> 
> I use lynx to convert HTML to plain-text, but noticed an issue where part of
> the output is missing with UTF-8 in CDATA sections.
> 
> Below is a small test-case to reproduce it:
> 
> <p>Works correctly:</p>
> <p>a’b</p>
> 
> <p>Doesn't work correctly:</p>
> <p><![CDATA[a’b]]></p>

agreed - I see

   Works correctly:

   a’b

   Doesn't work correctly:

   a?b

> This byte sequence for the UTF-8 codepoint is: printf '\342\200\231'
> 
> 
> I use the following command to convert HTML to text:
> 
>       lynx -stdin -dump \
>               -underline_links -image_links \
>               -display_charset="utf-8" -assume_charset="utf-8"
> 
> 
> My system information:
> I tested on the latest lynx-cur: lynx2.9.0dev.12
> 
> $ locale
> LANG=en_US.UTF-8
> LC_CTYPE="en_US.UTF-8"
> LC_NUMERIC="en_US.UTF-8"
> LC_TIME="en_US.UTF-8"
> LC_COLLATE="en_US.UTF-8"
> LC_MONETARY="en_US.UTF-8"
> LC_MESSAGES="en_US.UTF-8"
> LC_PAPER="en_US.UTF-8"
> LC_NAME="en_US.UTF-8"
> LC_ADDRESS="en_US.UTF-8"
> LC_TELEPHONE="en_US.UTF-8"
> LC_MEASUREMENT="en_US.UTF-8"
> LC_IDENTIFICATION="en_US.UTF-8"
> LC_ALL=en_US.UTF-8
> 
> 
> What I found:
> 
> I think it only prints the first byte instead of printing the processed
> codepoint (clong).  I noticed in the file WWW/Library/Implementation/SGML.c
> there is a similar case for comments for example for "S_comment_put_c:".
> 
> Below is a patch. I'm not sure it covers all lynx options though. I hope it 
> does:

thanks - will review, etc

-- 
Thomas E. Dickey <dickey@invisible-island.net>
https://invisible-island.net

signature.asc
Description: PGP signature

[Prev in Thread]

Current Thread

[Next in Thread]

[Lynx-dev] fix for decoding utf-8 in CDATA sections, Hiltjo Posthuma, 2023/07/27
- Re: [Lynx-dev] fix for decoding utf-8 in CDATA sections, Thomas Dickey <=

Prev by Date: [Lynx-dev] fix for decoding utf-8 in CDATA sections
Next by Date: [Lynx-dev] Lynx development history in article on Lou Montulli
Previous by thread: [Lynx-dev] fix for decoding utf-8 in CDATA sections
Next by thread: [Lynx-dev] Lynx development history in article on Lou Montulli
Index(es):
- Date
- Thread