[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Bug-wget] What ought to be a simple use of wget
From: |
Tim Rühsen |
Subject: |
Re: [Bug-wget] What ought to be a simple use of wget |
Date: |
Thu, 04 Aug 2016 23:00:20 +0200 |
User-agent: |
KMail/5.2.3 (Linux/4.6.0-1-amd64; KDE/5.23.0; x86_64; ; ) |
On Donnerstag, 4. August 2016 11:35:58 CEST Dale R. Worley wrote:
> Tim Ruehsen <address@hidden> writes:
> > Sounds like "download everything from www.iana.org/assignments/ plus all
> > page requisites on www.iana.org". Page requisites from other domains
> > shouldn't be pulled in !?
> >
> > Then your first try was very close, it was basically:
> > wget -r --no-parent --page-requisites http://www.iana.org/assignments/
> > index.html
> >
> > With -d you can see that this page is being redirected to /protocols and
> > thus no further downloading takes place since /protocols would escape the
> > / assignments/ directory (not allowed due to --no-parent).
>
> I'm getting something different than that...
>
> First off, let's drop --page-requisites. That seems to be working
> exactly as I want it, and it just complicates the discussion.
>
> I'm also using wget 1.16.1, which is a couple of years old.
>
> If I run the command quoted above, I get output which shows the
> redirection happening, and the file is fetched successfully:
>
> [Quote characters ASCIIized.]
>
> $ wget -r --no-parent http://www.iana.org/assignments/index.html
> --2016-08-04 11:22:48-- http://www.iana.org/assignments/index.html
> Resolving www.iana.org (www.iana.org)... 192.0.32.8, 2620:0:2d0:200::8
> Connecting to www.iana.org (www.iana.org)|192.0.32.8|:80... connected.
> HTTP request sent, awaiting response... 302 Found
> Location: /protocols [following]
> --2016-08-04 11:22:48-- http://www.iana.org/protocols
> Reusing existing connection to www.iana.org:80.
> HTTP request sent, awaiting response... 200 OK
> Length: unspecified [text/html]
> Saving to: 'www.iana.org/assignments/index.html'
>
> www.iana.org/assign [ <=> ] 727.79K 578KB/s in
> 1.3s
>
> 2016-08-04 11:22:52 (578 KB/s) - 'www.iana.org/assignments/index.html'
> saved [745252]
>
> FINISHED --2016-08-04 11:22:52--
> Total wall clock time: 4.7s
> Downloaded: 1 files, 728K in 1.3s (578 KB/s)
> $ ls -lR .
> .:
> total 4
> drwxr-xr-x. 3 worley worley 4096 Aug 4 11:22 www.iana.org
>
> ./www.iana.org:
> total 4
> drwxr-xr-x. 2 worley worley 4096 Aug 4 11:22 assignments
>
> ./www.iana.org/assignments:
> total 728
> -rw-r--r--. 1 worley worley 745252 Aug 4 11:22 index.html
> $
>
> I can argue from the wording of the man page that this is correct, as
> --no-parent is described as "Do not ever ascend to the parent directory
> when retrieving recursively."
>
> What *seems* to be happening is that index.html is fetched, but its
> links are not fetched recursively, despite the -r and qualifying under
> --no-parent. E.g., line 23441 of that file is
>
> <td><a
> href="/assignments/yang-parameters/yang-parameters.xhtml#yang-parameters-1"
> >YANG Module Names</a></td>
>
> which specifies a target URL of
> http://www.iana.org//assignments/yang-parameters/yang-parameters.xhtml.
> And yet, that file is not fetched.
>
>
> OK, using -d shows what the internal logic is: After fetching
> index.html, the wget output is:
>
> 2016-08-04 11:31:13 (576 KB/s) - 'www.iana.org/assignments/index.html'
> saved [745252]
>
> Deciding whether to enqueue "http://www.iana.org/protocols".
> Going to "" would escape "assignments" with no_parent on.
> Decided NOT to load it.
> Redirection "http://www.iana.org/protocols" failed the test.
>
This is basically what I said, sorry being not clear.
> I'm going to have to think about that, as the behavior is rather
> counter-intuitive. It seems to me that if wget is willing to *fetch* a
> page, it should look at the links on the page for potential recursion.
This is what I meant with:
[It is debatable if this behavior regarding redirections should be changed or
not, so feel free to open a bug report at https://savannah.gnu.org/bugs/?
func=additem&group=wget.]
If you or someone comes up with a patch, that would be very nice.
Or just open a bug report, so it won't be forgotten.
Tim
signature.asc
Description: This is a digitally signed message part.