bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] URL encoding issues (Was: GNU wget 1.17.1 released)


From: Eli Zaretskii
Subject: Re: [Bug-wget] URL encoding issues (Was: GNU wget 1.17.1 released)
Date: Tue, 15 Dec 2015 18:52:01 +0200

> From: Tim Ruehsen <address@hidden>
> Cc: Eli Zaretskii <address@hidden>
> Date: Tue, 15 Dec 2015 11:02:21 +0100
> 
> I pushed a conversion fix to master.

Thanks!

> There is another bug in wget that comes out with
> wget -d --local-encoding=cp1255 
> 'http://he.wikipedia.org/wiki/%F9._%F9%F4%F8%E4'
> 
> Wget double escapes/converts to UTF-8... Maybe you can address this when you 
> are working on the code !?

You mean, because http redirects to https?  Yes, I've seen that
already.  The simple patch below fixes that.  The problem seems to be
that wget assumes the redirected URL to be encoded in the same
encoding as the original one (which, as described earlier, starts with
the local encoding), whereas it is much more reasonable to use the
value provided by --remote-encoding.

And if the 'if' in the patch looks strange to you, it's rightfully
so.  Look at this strange logic in set_uri_encoding:

  /* Set uri_encoding of struct iri i. If a remote encoding was specified, use
     it unless force is true. */
  void
  set_uri_encoding (struct iri *i, const char *charset, bool force)
  {
    DEBUGP (("URI encoding = %s\n", charset ? quote (charset) : "None"));
    if (!force && opt.encoding_remote)
      return;

I understand the reason to prefer opt.encoding_remote when the 'force'
flag is false -- the user-provided remote encoding should take
preference.  But why return without making sure the URI's encoding is
in fact set to that??  I guess there's some assumption that
iri->uri_encoding is already set to opt.encoding_remote, but this
assumption is certainly false in this case.  So I tyhink this function
should be changed to actually use opt.encoding_remote, if non-NULL,
and otherwise use 'charset' even if 'force' is false.  Then the patch
below could be simplify to avoid the test.  WDYT?

Here's the patch I promised.  With it, wget survives redirection from
http to https and successful retrieves that page.


diff --git a/src/retr.c b/src/retr.c
index a6a9bd7..6af26a0 100644
--- a/src/retr.c
+++ b/src/retr.c
@@ -872,9 +872,11 @@ retrieve_url (struct url * orig_parsed, const char 
*origurl, char **file,
       xfree (mynewloc);
       mynewloc = construced_newloc;
 
-      /* Reset UTF-8 encoding state, keep the URI encoding and reset
+      /* Reset UTF-8 encoding state, set the URI encoding and reset
          the content encoding. */
       iri->utf8_encode = opt.enable_iri;
+      if (opt.encoding_remote)
+       set_uri_encoding (iri, opt.encoding_remote, true);
       set_content_encoding (iri, NULL);
       xfree (iri->orig_url);
 



reply via email to

[Prev in Thread] Current Thread [Next in Thread]