bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] Support non-ASCII URLs


From: Tim Rühsen
Subject: Re: [Bug-wget] Support non-ASCII URLs
Date: Sun, 20 Dec 2015 21:11:57 +0100
User-agent: KMail/4.14.10 (Linux/4.3.0-1-amd64; KDE/4.14.14; x86_64; ; )

Am Sonntag, 20. Dezember 2015, 19:23:05 schrieb Eli Zaretskii:
> > From: Tim Rühsen <address@hidden>
> > Date: Sun, 20 Dec 2015 16:26:20 +0100
> > 
> > > Tim sent me the tarball and the log off-list (thanks!).  I didn't yet
> > > try to build Wget, but just looking at the test, I guess I don't
> > > understand its idea.  It has an index.html page that's encoded in
> > > ISO-8859-15, but Wget is invoked with --remote-encoding=iso-8859-1,
> > > and the URLs themselves in "my %urls" are all encoded in UTF-8.  How's
> > > this supposed to work?
> > 
> > Regarding the wget man page, --remote-encoding just sets the *default*
> > server encoding. This only comes into play when the HTTP header does not
> > contain a Content-type with charset set *and* the HTML page does not
> > contain a <meta http-equiv="Content-Type" with 'content=... charset=...'.
> 
> Makes sense.
> 
> > 'index.html' in this test is correctly having a meta tag with
> > charset=utf-8
> > and the URLs encoded in utf-8.
> 
> That's not what I see: index.html says
> 
>   "Content-type" => "text/html; charset=ISO-8859-15"
> 
> and its contents indeed has URLs encoded in ISO-8859-15.

Yep, my blindness :-)

Has to be fixed as well in either way.

> > > Also, I'm not following the logic of overriding Content-type by the
> > > remote encoding: p1_fran%C3%A7ais.html states "charset=UTF-8", but
> > > includes a link encoded in ISO-8859-1, and the test seems to expect
> > > Wget to use the remote encoding in preference to what "charset=" says.
> > 
> > Either the test is wrong here or the man page. I would say the man page
> > should be correct here - it makes the most sense to me. In this case the
> > test is wrong, also the comment.
> 
> OK.
> 
> > > Does the remote encoding override the encoding for the _contents_ of
> > > the URL, not just for the URL itself?  That seems to make little sense
> > > to me: the contents and the name can legitimately be encoded
> > > differently, I think.
> > 
> > The filenames in %expected_downloaded_files depend on --local-encoding.
> > Since this is not given on the command line, this test will behave
> > differently with different settings for LC_ALL ('make check' use
> > LC_ALL=C, contrib/check- hard will also 'make check' with turkish UTF-8
> > locale).
> > 
> > To fix the test, we should use --local-encoding to some kind of UTF-8
> > locale (or something else, but than we have to fix the filenames
> > regarding that locale).
> 
> But then what would be the point of repeating the test with the
> turkish locale? verify that when given --local-encoding the locale is
> ignored?

No special point. contrib/check-hard does configure and 'make check' runs with 
different combinations of flags and environment settings. We took turkish 
utf-8 locale because of the issues with upper- and lowercase I.

With different locales (C and tr_TR.utf8 used in contrib/check-hard) we either 
have to

1. check the locale in the perl test and amend the filenames in 
%expected_downloaded_files

or

2. use --local-encoding

Regards, Tim




reply via email to

[Prev in Thread] Current Thread [Next in Thread]