[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Bug-wget] Re: How to ignore errors with time stamping
From: |
Morten Lemvigh |
Subject: |
[Bug-wget] Re: How to ignore errors with time stamping |
Date: |
Fri, 12 Dec 2008 09:03:58 +0100 |
User-agent: |
Thunderbird 2.0.0.18 (X11/20081125) |
Andre Majorel wrote:
On 2008-12-11 09:17 +0100, Morten Lemvigh wrote:
I'm having a problem retrieving a page, when I use the time
stamping option.
When I run wget with:
wget -N 'http://eur-lex.europa.eu/JOHtml.do?uri=OJ:C:2007:306:SOM:EN:HTML'
the file is downloaded, but I get the message:
"Last-modified header missing -- time-stamps turned off."
If I run the command a second time, I get an "ERROR 500: Internal Server
Error." and wget exits. If I leave the time stamping option out, the
document is retrieved again.
Is there a way to make wget ignore missing Last-modified headers, and
just retrieve the document?
I believe it's what it does by default. Wget only checks for the
Last-modified header here because you told it to (-N).
Some of the documents on the site, is send with a Last-modified header,
and I don't wont to retrieve those if I already got them, hence the -N.
But on some documents the header is missing, and in that situation wget
doesn't do anything with the page, it just continues with the next page.
I would like wget to continue looking at the links on that page.
When mirroring a site wget will stop and not follow any links
on a page, which doesn't send a Last-modified header.
Do you have a log showing that behaviour ? Recursive retrieval of
sites that don't return Last-modified works for me.
No links on a page with a missing last-modified header are scanned, if
the page is on the disk already. If I run:
wget -r -N http://eur-lex.europa.eu/JOHtml.do?uri=OJ:L:2008:321:SOM:DA:HTML
--08:51:24--
http://eur-lex.europa.eu/JOHtml.do?uri=OJ:L:2008:321:SOM:DA:HTML
=> `eur-lex.europa.eu/JOHtml.do?uri=OJ:L:2008:321:SOM:DA:HTML'
Resolving eur-lex.europa.eu... 147.67.136.2, 147.67.136.102,
147.67.119.2, ...
Connecting to eur-lex.europa.eu|147.67.136.2|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 9.709 (9.5K) [text/html]
100%[=====================================================================================================================================>]
9.709 --.--K/s
Last-modified header missing -- time-stamps turned off.
08:51:24 (82.42 KB/s) -
`eur-lex.europa.eu/JOHtml.do?uri=OJ:L:2008:321:SOM:DA:HTML' saved
[9709/9709]
[....]
wget will retrieve the page and continue recursively getting all the
linked pages, as I would expect. If I issue this command a second time,
all I get is this:
wget -r -N http://eur-lex.europa.eu/JOHtml.do?uri=OJ:L:2008:321:SOM:DA:HTML
--08:53:18--
http://eur-lex.europa.eu/JOHtml.do?uri=OJ:L:2008:321:SOM:DA:HTML
=> `eur-lex.europa.eu/JOHtml.do?uri=OJ:L:2008:321:SOM:DA:HTML'
Resolving eur-lex.europa.eu... 147.67.119.2, 147.67.119.102,
147.67.136.2, ...
Connecting to eur-lex.europa.eu|147.67.119.2|:80... connected.
HTTP request sent, awaiting response... 500 Internal Server Error
08:53:18 ERROR 500: Internal Server Error.
FINISHED --08:53:18--
Downloaded: 0 bytes in 0 files
So all the pages linked from this page are ignored to. It's fine if wget
skips the problematic document, but I would prefer wget to continue the
recursive scan.