bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] Unexpected result with -H and -D


From: Friso van Vollenhoven
Subject: Re: [Bug-wget] Unexpected result with -H and -D
Date: Thu, 18 Jan 2018 11:20:34 +0100

Hi,
Thanks for confirming it's a bug. I'm currently not fluent enough in C to
provide a fix myself, but I see a patch was already posted, so I hope
that's satisfactory.

Cheers,
Friso


On Wed, Jan 17, 2018 at 3:01 PM, Darshit Shah <address@hidden> wrote:

> Hi,
>
> This is a bug in Wget, apparently a really old one! Seems like the bug has
> been
> around since atleast 1997.
>
> Looking at the source, the issue is that Wget does a very simple suffix
> matching on the actual domain and accepted domains list. This is obviously
> wrong as you have just found out.
>
> I'm going to try and implement this correctly, but I'm currently a little
> short
> on time, so if anyone else wants to pick it up, please feel free to. It's
> simple, use libpsl to get the proper domain name and match against that.
>
>
> Of course, this change will require libpsl to no longer be an optional
> dependency
>
> * Friso van Vollenhoven <address@hidden> [180117 14:40]:
> > Hello all,
> >
> > I am trying to do a recursive download of a webpage and span multiple
> hosts
> > within the same domain, but not cross to other domains. The issue is that
> > the crawl does extend to other domains. My full command is this:
> >
> > wget \
> > --recursive \
> > --no-clobber \
> > --page-requisites \
> > --adjust-extension \
> > --span-hosts \
> > --domains=scapino.nl \
> > --no-parent \
> > --tries=2 \
> > --wait=1 \
> > --random-wait \
> > --waitretry=2 \
> > --header='User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_2)
> > AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132
> Safari/537.36' \
> > https://www.scapino.nl/winkels/scapino-utrecht-510061
> >
> > From this combination of --span-hosts and --domains, I would expect to
> > download assets from cdn.scapino.nl and www.scapino.nl, but not other
> > domains. For some reason that I don't understand, wget also starts to do
> > what looks like a full crawl of the domain werkenbijscapino.nl, which is
> > referenced from the original page.
> >
> > Any thoughts or direction would be much appreciated.
> >
> > I am using wget 1.18 on Debian.
> >
> >
> > Best regards,
> > Friso
>
> --
> Thanking You,
> Darshit Shah
> PGP Fingerprint: 7845 120B 07CB D8D6 ECE5 FF2B 2A17 43ED A91A 35B6
>


reply via email to

[Prev in Thread] Current Thread [Next in Thread]