[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: wget2 | html hex entities are not correctly decoded (#637)
From: |
@rockdaboot |
Subject: |
Re: wget2 | html hex entities are not correctly decoded (#637) |
Date: |
Sun, 27 Aug 2023 18:46:29 +0000 |
Tim Rühsen commented:
https://gitlab.com/gnuwget/wget2/-/issues/637#note_1531241091
I had to read it up, was too long ago :smile:
So yes, URLs from HTML/XML documents are supposed to contain HTML/XML entities
including the `&#dddd;` and the `&#xhhhh;` forms.
The latter was not implemented in `wget_xml_decode_entities_inline()`. Not it
is (pushed to master) :).
The IRI unescape does URI/IRI unescaping, which is something different. So
there are two layers of unescaping when reading+parsing a URL from an HTML or
XML document.
--
Reply to this email directly or view it on GitLab:
https://gitlab.com/gnuwget/wget2/-/issues/637#note_1531241091
You're receiving this email because of your account on gitlab.com.