[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Bug-wget] --page-requisites and robot exclusion issue
From: |
markk |
Subject: |
[Bug-wget] --page-requisites and robot exclusion issue |
Date: |
Sun, 4 Dec 2011 11:57:31 -0000 |
User-agent: |
SquirrelMail/1.4.21 |
Hi,
I'm using wget 1.13.4. There seems to be a problem with wget
over-zealously obeying robot exclusion when --page-requisites is used,
even when only downloading a single URL.
I attempted to download a single web page, specifying --page-requisites so
that images, css and javascript files required by the page are also
downloaded:
wget -x -S --page-requisites http://www.example.com/path/file.html
In the HTML page downloaded, there was this line:
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
The presence of that line causes wget to not download the page requisites.
(And there is nothing in the log output to indicate it is ignoring
--page-requisites.)
I think wget should not pay attention to robot exclusion when downloading
page requisites.
Typically, you won't know whether a particular page you're about to
download has a robots line in its HTML source. So you need to specify "-e
robots=off" whenever you use --page-requisites, to ensure all requisites
are downloaded.
But in cases where you *are* recursively downloading and using
--page-requisites, it would be polite to otherwise obey the robots
exclusion standard by default. Which you can't do if you have to use -e
robots=off to ensure all requisites are downloaded.
Mark
- [Bug-wget] --page-requisites and robot exclusion issue,
markk <=