wget-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: wget2 | ... not followed (disallowed by robots.txt) (#653)


From: Tim Rühsen
Subject: Re: wget2 | ... not followed (disallowed by robots.txt) (#653)
Date: Sun, 28 Jan 2024 12:00:58 +0100
User-agent: Mozilla Thunderbird

Hi,

On 1/27/24 12:24, Luuk n/a (@Luuk34) via Public discussion list for GNU Wget development wrote:


Luuk n/a created an issue: https://gitlab.com/gnuwget/wget2/-/issues/653



A download started using:
wget2.exe --no-parent -r --wait 5 --random-wait 
https://ghostscript.readthedocs.io/en/gs10.02.0/

produces some lines ending in `not followed (disallowed by robots.txt)`

```
URL 'https://ghostscript.readthedocs.io/en/gs10.02.0/_images/icon-docx.svg' not 
followed (disallowed by robots.txt)
Adding URL: https://ghostscript.readthedocs.io/en/gs10.02.0/_images/icon-odt.svg
URL 'https://ghostscript.readthedocs.io/en/gs10.02.0/_images/icon-odt.svg' not 
followed (disallowed by robots.txt)
Adding URL: 
https://ghostscript.readthedocs.io/en/gs10.02.0/_images/icon-xlsx.svg
URL 'https://ghostscript.readthedocs.io/en/gs10.02.0/_images/icon-xlsx.svg' not 
followed (disallowed by robots.txt)
Adding URL: 
https://ghostscript.readthedocs.io/en/gs10.02.0/_images/icon-pptx.svg
URL 'https://ghostscript.readthedocs.io/en/gs10.02.0/_images/icon-pptx.svg' not 
followed (disallowed by robots.txt)
Adding URL: https://ghostscript.readthedocs.io/en/gs10.02.0/_images/icon-txt.svg
URL 'https://ghostscript.readthedocs.io/en/gs10.02.0/_images/icon-txt.svg' not 
followed (disallowed by robots.txt)
```

I can't reproduce this with the latest code.
Can you share the output of `wget2.exe --version`?

Possibly try the wget2.exe from https://github.com/rockdaboot/wget2/releases/download/v2.1.0/wget2.exe


Trying to download a sing file from above result, will succeed:
```
wget2 https://ghostscript.readthedocs.io/en/gs10.02.0/_images/icon-txt.svg
[0] Downloading 
'https://ghostscript.readthedocs.io/en/gs10.02.0/_images/icon-txt.svg' ...
Saving 'icon-txt.svg'
HTTP response 200  
[https://ghostscript.readthedocs.io/en/gs10.02.0/_images/icon-txt.svg]
```

The file `robots.txt` looks like:
```
D:\TEMP\gs2>dir robots.txt /s/b
D:\TEMP\gs2\ghostscript.readthedocs.io\robots.txt

D:\TEMP\gs2>type ghostscript.readthedocs.io\robots.txt
User-agent: *

Disallow: # Allow everything

Sitemap: https://ghostscript.readthedocs.io/sitemap.xml
```


Same here, the _image directory contains this after your executing the wget2 command line from above (I did this on Linux, though. But the OS should not matter).

$ l ghostscript.readthedocs.io/en/gs10.02.0/_images/
total 2552
drwxr-xr-x 2 tim tim   4096 Jan 28 11:49 .
drwxr-xr-x 4 tim tim   4096 Jan 28 11:49 ..
-rw-r--r-- 1 tim tim 257596 Sep 12 15:35 cm-fig1.png
-rw-r--r-- 1 tim tim  83208 Sep 12 15:35 cm-fig2.png
-rw-r--r-- 1 tim tim 451111 Sep 12 15:35 cm-fig3.png
-rw-r--r-- 1 tim tim 855244 Sep 12 15:35 cm-fig4.png
-rw-r--r-- 1 tim tim 449076 Sep 12 15:35 cm-fig5.png
-rw-r--r-- 1 tim tim 364278 Sep 12 15:35 cm-fig6.png
-rw-r--r-- 1 tim tim  82695 Sep 12 15:35 cm-fig7.png
-rw-r--r-- 1 tim tim    988 Sep 12 15:35 discord-mark-blue.svg
-rw-r--r-- 1 tim tim  23033 Sep 12 15:35 ghostscript-logo.png
-rw-r--r-- 1 tim tim   1103 Sep 12 15:35 icon-docx.svg
-rw-r--r-- 1 tim tim    720 Sep 12 15:35 icon-odt.svg
-rw-r--r-- 1 tim tim   1152 Sep 12 15:35 icon-pptx.svg
-rw-r--r-- 1 tim tim    617 Sep 12 15:35 icon-txt.svg
-rw-r--r-- 1 tim tim   1070 Sep 12 15:35 icon-xlsx.svg

Can you possibly add '--debug -olog log.txt' to the command and share the file 'log.txt' afterwards?

Regards, Tim

Attachment: OpenPGP_signature.asc
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]