bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Bug-wget] Re: How to ignore errors with time stamping


From: Morten Lemvigh
Subject: [Bug-wget] Re: How to ignore errors with time stamping
Date: Fri, 12 Dec 2008 09:03:58 +0100
User-agent: Thunderbird 2.0.0.18 (X11/20081125)

Andre Majorel wrote:
On 2008-12-11 09:17 +0100, Morten Lemvigh wrote:

I'm having a problem retrieving a page, when I use the time
stamping option.

When I run wget with:
wget -N 'http://eur-lex.europa.eu/JOHtml.do?uri=OJ:C:2007:306:SOM:EN:HTML'

the file is downloaded, but I get the message:
"Last-modified header missing -- time-stamps turned off."

If I run the command a second time, I get an "ERROR 500: Internal Server Error." and wget exits. If I leave the time stamping option out, the document is retrieved again.

Is there a way to make wget ignore missing Last-modified headers, and just retrieve the document?

I believe it's what it does by default. Wget only checks for the
Last-modified header here because you told it to (-N).

Some of the documents on the site, is send with a Last-modified header, and I don't wont to retrieve those if I already got them, hence the -N. But on some documents the header is missing, and in that situation wget doesn't do anything with the page, it just continues with the next page. I would like wget to continue looking at the links on that page.

When mirroring a site wget will stop and not  follow any links
on a page, which doesn't send a Last-modified header.

Do you have a log showing that behaviour ? Recursive retrieval of
sites that don't return Last-modified works for me.

No links on a page with a missing last-modified header are scanned, if the page is on the disk already. If I run:

wget -r -N http://eur-lex.europa.eu/JOHtml.do?uri=OJ:L:2008:321:SOM:DA:HTML

--08:51:24-- http://eur-lex.europa.eu/JOHtml.do?uri=OJ:L:2008:321:SOM:DA:HTML
           => `eur-lex.europa.eu/JOHtml.do?uri=OJ:L:2008:321:SOM:DA:HTML'
Resolving eur-lex.europa.eu... 147.67.136.2, 147.67.136.102, 147.67.119.2, ...
Connecting to eur-lex.europa.eu|147.67.136.2|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 9.709 (9.5K) [text/html]

100%[=====================================================================================================================================>] 9.709 --.--K/s

Last-modified header missing -- time-stamps turned off.
08:51:24 (82.42 KB/s) - `eur-lex.europa.eu/JOHtml.do?uri=OJ:L:2008:321:SOM:DA:HTML' saved [9709/9709]
[....]

wget will retrieve the page and continue recursively getting all the linked pages, as I would expect. If I issue this command a second time, all I get is this:

wget -r -N http://eur-lex.europa.eu/JOHtml.do?uri=OJ:L:2008:321:SOM:DA:HTML
--08:53:18-- http://eur-lex.europa.eu/JOHtml.do?uri=OJ:L:2008:321:SOM:DA:HTML
           => `eur-lex.europa.eu/JOHtml.do?uri=OJ:L:2008:321:SOM:DA:HTML'
Resolving eur-lex.europa.eu... 147.67.119.2, 147.67.119.102, 147.67.136.2, ...
Connecting to eur-lex.europa.eu|147.67.119.2|:80... connected.
HTTP request sent, awaiting response... 500 Internal Server Error
08:53:18 ERROR 500: Internal Server Error.


FINISHED --08:53:18--
Downloaded: 0 bytes in 0 files

So all the pages linked from this page are ignored to. It's fine if wget skips the problematic document, but I would prefer wget to continue the recursive scan.








reply via email to

[Prev in Thread] Current Thread [Next in Thread]