lynx-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Lynx-dev] fix for decoding utf-8 in CDATA sections


From: Hiltjo Posthuma
Subject: Re: [Lynx-dev] fix for decoding utf-8 in CDATA sections
Date: Tue, 3 Oct 2023 23:29:07 +0200

On Thu, Jul 27, 2023 at 10:25:13PM +0200, Hiltjo Posthuma wrote:
> Hi,
> 
> I use lynx to convert HTML to plain-text, but noticed an issue where part of
> the output is missing with UTF-8 in CDATA sections.
> 
> Below is a small test-case to reproduce it:
> 
> <p>Works correctly:</p>
> <p>a’b</p>
> 
> <p>Doesn't work correctly:</p>
> <p><![CDATA[a’b]]></p>
> 
> This byte sequence for the UTF-8 codepoint is: printf '\342\200\231'
> 
> 
> I use the following command to convert HTML to text:
> 
>       lynx -stdin -dump \
>               -underline_links -image_links \
>               -display_charset="utf-8" -assume_charset="utf-8"
> 
> 
> My system information:
> I tested on the latest lynx-cur: lynx2.9.0dev.12
> 
> $ locale
> LANG=en_US.UTF-8
> LC_CTYPE="en_US.UTF-8"
> LC_NUMERIC="en_US.UTF-8"
> LC_TIME="en_US.UTF-8"
> LC_COLLATE="en_US.UTF-8"
> LC_MONETARY="en_US.UTF-8"
> LC_MESSAGES="en_US.UTF-8"
> LC_PAPER="en_US.UTF-8"
> LC_NAME="en_US.UTF-8"
> LC_ADDRESS="en_US.UTF-8"
> LC_TELEPHONE="en_US.UTF-8"
> LC_MEASUREMENT="en_US.UTF-8"
> LC_IDENTIFICATION="en_US.UTF-8"
> LC_ALL=en_US.UTF-8
> 
> 
> What I found:
> 
> I think it only prints the first byte instead of printing the processed
> codepoint (clong).  I noticed in the file WWW/Library/Implementation/SGML.c
> there is a similar case for comments for example for "S_comment_put_c:".
> 
> Below is a patch. I'm not sure it covers all lynx options though. I hope it 
> does:
> 
> 
> diff --git a/WWW/Library/Implementation/SGML.c 
> b/WWW/Library/Implementation/SGML.c
> index 2534606..8632670 100644
> --- a/WWW/Library/Implementation/SGML.c
> +++ b/WWW/Library/Implementation/SGML.c
> @@ -3502,9 +3502,13 @@ static void SGML_character(HTStream *me, int c_in)
>           me->state = S_text;
>           break;
>       }
> -     HTChunkPutc(string, c);
> -     break;
>  
> +     if (me->T.decode_utf8) {
> +          HTChunkPutUtf8Char(string, clong);
> +     } else {
> +          HTChunkPutc(string, c);
> +     }
> +     break;
>      case S_sgmlent:          /* Expecting ENTITY. - FM */
>       if (!me->first_dash && c == '-') {
>           HTChunkPutc(string, c);
> 
> 
> Thank you for lynx,
> 
> -- 
> Kind regards,
> Hiltjo
> 

Hi,

Any updates on the status / review of this patch?

Thank you,

-- 
Kind regards,
Hiltjo



reply via email to

[Prev in Thread] Current Thread [Next in Thread]