bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] What ought to be a simple use of wget


From: Tim Rühsen
Subject: Re: [Bug-wget] What ought to be a simple use of wget
Date: Thu, 04 Aug 2016 23:00:20 +0200
User-agent: KMail/5.2.3 (Linux/4.6.0-1-amd64; KDE/5.23.0; x86_64; ; )

On Donnerstag, 4. August 2016 11:35:58 CEST Dale R. Worley wrote:
> Tim Ruehsen <address@hidden> writes:
> > Sounds like "download everything from www.iana.org/assignments/ plus all
> > page requisites on www.iana.org". Page requisites from other domains
> > shouldn't be pulled in !?
> > 
> > Then your first try was very close, it was basically:
> > wget -r --no-parent --page-requisites http://www.iana.org/assignments/
> > index.html
> > 
> > With -d you can see that this page is being redirected to /protocols and
> > thus no further downloading takes place since /protocols would escape the
> > / assignments/ directory  (not allowed due to --no-parent).
> 
> I'm getting something different than that...
> 
> First off, let's drop --page-requisites.  That seems to be working
> exactly as I want it, and it just complicates the discussion.
> 
> I'm also using wget 1.16.1, which is a couple of years old.
> 
> If I run the command quoted above, I get output which shows the
> redirection happening, and the file is fetched successfully:
> 
> [Quote characters ASCIIized.]
> 
>     $ wget -r --no-parent http://www.iana.org/assignments/index.html
>     --2016-08-04 11:22:48--  http://www.iana.org/assignments/index.html
>     Resolving www.iana.org (www.iana.org)... 192.0.32.8, 2620:0:2d0:200::8
>     Connecting to www.iana.org (www.iana.org)|192.0.32.8|:80... connected.
>     HTTP request sent, awaiting response... 302 Found
>     Location: /protocols [following]
>     --2016-08-04 11:22:48--  http://www.iana.org/protocols
>     Reusing existing connection to www.iana.org:80.
>     HTTP request sent, awaiting response... 200 OK
>     Length: unspecified [text/html]
>     Saving to: 'www.iana.org/assignments/index.html'
> 
>     www.iana.org/assign     [      <=>             ] 727.79K   578KB/s   in
> 1.3s
> 
>     2016-08-04 11:22:52 (578 KB/s) - 'www.iana.org/assignments/index.html'
> saved [745252]
> 
>     FINISHED --2016-08-04 11:22:52--
>     Total wall clock time: 4.7s
>     Downloaded: 1 files, 728K in 1.3s (578 KB/s)
>     $ ls -lR .
>     .:
>     total 4
>     drwxr-xr-x. 3 worley worley 4096 Aug  4 11:22 www.iana.org
> 
>     ./www.iana.org:
>     total 4
>     drwxr-xr-x. 2 worley worley 4096 Aug  4 11:22 assignments
> 
>     ./www.iana.org/assignments:
>     total 728
>     -rw-r--r--. 1 worley worley 745252 Aug  4 11:22 index.html
>     $
> 
> I can argue from the wording of the man page that this is correct, as
> --no-parent is described as "Do not ever ascend to the parent directory
> when retrieving recursively."
> 
> What *seems* to be happening is that index.html is fetched, but its
> links are not fetched recursively, despite the -r and qualifying under
> --no-parent.  E.g., line 23441 of that file is
> 
>     <td><a
> href="/assignments/yang-parameters/yang-parameters.xhtml#yang-parameters-1"
> >YANG Module Names</a></td>
> 
> which specifies a target URL of
> http://www.iana.org//assignments/yang-parameters/yang-parameters.xhtml.
> And yet, that file is not fetched.
> 
> 
> OK, using -d shows what the internal logic is:  After fetching
> index.html, the wget output is:
> 
>     2016-08-04 11:31:13 (576 KB/s) - 'www.iana.org/assignments/index.html'
> saved [745252]
> 
>     Deciding whether to enqueue "http://www.iana.org/protocols";.
>     Going to "" would escape "assignments" with no_parent on.
>     Decided NOT to load it.
>     Redirection "http://www.iana.org/protocols"; failed the test.
> 

This is basically what I said, sorry being not clear.

> I'm going to have to think about that, as the behavior is rather
> counter-intuitive.  It seems to me that if wget is willing to *fetch* a
> page, it should look at the links on the page for potential recursion.

This is what I meant with:
[It is debatable if this behavior regarding redirections should be changed or 
not, so feel free to open a bug report at https://savannah.gnu.org/bugs/?
func=additem&group=wget.]

If you or someone comes up with a patch, that would be very nice.
Or just open a bug report, so it won't be forgotten.

Tim

Attachment: signature.asc
Description: This is a digitally signed message part.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]