[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: wget2 | ... not followed (disallowed by robots.txt) (#653)
|
From: |
Tim Rühsen |
|
Subject: |
Re: wget2 | ... not followed (disallowed by robots.txt) (#653) |
|
Date: |
Sun, 28 Jan 2024 12:00:58 +0100 |
|
User-agent: |
Mozilla Thunderbird |
Hi,
On 1/27/24 12:24, Luuk n/a (@Luuk34) via Public discussion list for GNU
Wget development wrote:
Luuk n/a created an issue: https://gitlab.com/gnuwget/wget2/-/issues/653
A download started using:
wget2.exe --no-parent -r --wait 5 --random-wait
https://ghostscript.readthedocs.io/en/gs10.02.0/
produces some lines ending in `not followed (disallowed by robots.txt)`
```
URL 'https://ghostscript.readthedocs.io/en/gs10.02.0/_images/icon-docx.svg' not
followed (disallowed by robots.txt)
Adding URL: https://ghostscript.readthedocs.io/en/gs10.02.0/_images/icon-odt.svg
URL 'https://ghostscript.readthedocs.io/en/gs10.02.0/_images/icon-odt.svg' not
followed (disallowed by robots.txt)
Adding URL:
https://ghostscript.readthedocs.io/en/gs10.02.0/_images/icon-xlsx.svg
URL 'https://ghostscript.readthedocs.io/en/gs10.02.0/_images/icon-xlsx.svg' not
followed (disallowed by robots.txt)
Adding URL:
https://ghostscript.readthedocs.io/en/gs10.02.0/_images/icon-pptx.svg
URL 'https://ghostscript.readthedocs.io/en/gs10.02.0/_images/icon-pptx.svg' not
followed (disallowed by robots.txt)
Adding URL: https://ghostscript.readthedocs.io/en/gs10.02.0/_images/icon-txt.svg
URL 'https://ghostscript.readthedocs.io/en/gs10.02.0/_images/icon-txt.svg' not
followed (disallowed by robots.txt)
```
I can't reproduce this with the latest code.
Can you share the output of `wget2.exe --version`?
Possibly try the wget2.exe from
https://github.com/rockdaboot/wget2/releases/download/v2.1.0/wget2.exe
Trying to download a sing file from above result, will succeed:
```
wget2 https://ghostscript.readthedocs.io/en/gs10.02.0/_images/icon-txt.svg
[0] Downloading
'https://ghostscript.readthedocs.io/en/gs10.02.0/_images/icon-txt.svg' ...
Saving 'icon-txt.svg'
HTTP response 200
[https://ghostscript.readthedocs.io/en/gs10.02.0/_images/icon-txt.svg]
```
The file `robots.txt` looks like:
```
D:\TEMP\gs2>dir robots.txt /s/b
D:\TEMP\gs2\ghostscript.readthedocs.io\robots.txt
D:\TEMP\gs2>type ghostscript.readthedocs.io\robots.txt
User-agent: *
Disallow: # Allow everything
Sitemap: https://ghostscript.readthedocs.io/sitemap.xml
```
Same here, the _image directory contains this after your executing the
wget2 command line from above (I did this on Linux, though. But the OS
should not matter).
$ l ghostscript.readthedocs.io/en/gs10.02.0/_images/
total 2552
drwxr-xr-x 2 tim tim 4096 Jan 28 11:49 .
drwxr-xr-x 4 tim tim 4096 Jan 28 11:49 ..
-rw-r--r-- 1 tim tim 257596 Sep 12 15:35 cm-fig1.png
-rw-r--r-- 1 tim tim 83208 Sep 12 15:35 cm-fig2.png
-rw-r--r-- 1 tim tim 451111 Sep 12 15:35 cm-fig3.png
-rw-r--r-- 1 tim tim 855244 Sep 12 15:35 cm-fig4.png
-rw-r--r-- 1 tim tim 449076 Sep 12 15:35 cm-fig5.png
-rw-r--r-- 1 tim tim 364278 Sep 12 15:35 cm-fig6.png
-rw-r--r-- 1 tim tim 82695 Sep 12 15:35 cm-fig7.png
-rw-r--r-- 1 tim tim 988 Sep 12 15:35 discord-mark-blue.svg
-rw-r--r-- 1 tim tim 23033 Sep 12 15:35 ghostscript-logo.png
-rw-r--r-- 1 tim tim 1103 Sep 12 15:35 icon-docx.svg
-rw-r--r-- 1 tim tim 720 Sep 12 15:35 icon-odt.svg
-rw-r--r-- 1 tim tim 1152 Sep 12 15:35 icon-pptx.svg
-rw-r--r-- 1 tim tim 617 Sep 12 15:35 icon-txt.svg
-rw-r--r-- 1 tim tim 1070 Sep 12 15:35 icon-xlsx.svg
Can you possibly add '--debug -olog log.txt' to the command and share
the file 'log.txt' afterwards?
Regards, Tim
OpenPGP_signature.asc
Description: OpenPGP digital signature