bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] URL encoding issues (Was: GNU wget 1.17.1 released)


From: Eli Zaretskii
Subject: Re: [Bug-wget] URL encoding issues (Was: GNU wget 1.17.1 released)
Date: Mon, 14 Dec 2015 18:33:38 +0200

> Date: Sun, 13 Dec 2015 20:04:31 +0100
> From: "Andries E. Brouwer" <address@hidden>
> Cc: "Andries E. Brouwer" <address@hidden>, address@hidden
> 
> On Sun, Dec 13, 2015 at 08:01:27PM +0200, Eli Zaretskii wrote:
> 
> > If no one is going to pick up the gauntlet, I will sit down and do it
> > myself, although I'm terribly busy with Emacs 25.1 release.
> 
> Good!

While working on this, I bumped into 2 related issues:

 1. The functions that call 'iconv' (in iri.c) don't make a point of
    flushing the last portion of the converted URL after 'iconv'
    returns successfully having converted the input string in its
    entirety.  IME, you need then to call 'iconv' one last time with
    either the 2nd or the 3rd argument set to NULL, otherwise
    sometimes the last converted character doesn't get output.  In my
    case, some URLs converted from CP1255 to UTF-8 lost their last
    character.  It sounds like no one has actually used this
    conversion in iri.c, except for trivially converting UTF-8 to
    itself.  Is that possible/reasonable?

 2. Wget assumes that the URL given on its command line is encoded in
    the locale's encoding.  This is a good assumption when the user
    herself types the URL at the shell prompt, but not when the URL is
    copy-pasted from a browser's address bar.  In the latter case, the
    URL tends to be in UTF-8 (sometimes hex-encoded).  At least that's
    what I get from Firefox.  We don't seem to have in wget any
    facilities to specify a separate (3rd) encoding for the URLs on
    the command line, do we?

Thanks.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]