wget-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

wget2 | ... not followed (disallowed by robots.txt) (#653)


From: Luuk n/a (@Luuk34)
Subject: wget2 | ... not followed (disallowed by robots.txt) (#653)
Date: Sat, 27 Jan 2024 11:24:33 +0000


Luuk n/a created an issue: https://gitlab.com/gnuwget/wget2/-/issues/653



A download started using:
wget2.exe --no-parent -r --wait 5 --random-wait 
https://ghostscript.readthedocs.io/en/gs10.02.0/

produces some lines ending in `not followed (disallowed by robots.txt)`

```
URL 'https://ghostscript.readthedocs.io/en/gs10.02.0/_images/icon-docx.svg' not 
followed (disallowed by robots.txt)
Adding URL: https://ghostscript.readthedocs.io/en/gs10.02.0/_images/icon-odt.svg
URL 'https://ghostscript.readthedocs.io/en/gs10.02.0/_images/icon-odt.svg' not 
followed (disallowed by robots.txt)
Adding URL: 
https://ghostscript.readthedocs.io/en/gs10.02.0/_images/icon-xlsx.svg
URL 'https://ghostscript.readthedocs.io/en/gs10.02.0/_images/icon-xlsx.svg' not 
followed (disallowed by robots.txt)
Adding URL: 
https://ghostscript.readthedocs.io/en/gs10.02.0/_images/icon-pptx.svg
URL 'https://ghostscript.readthedocs.io/en/gs10.02.0/_images/icon-pptx.svg' not 
followed (disallowed by robots.txt)
Adding URL: https://ghostscript.readthedocs.io/en/gs10.02.0/_images/icon-txt.svg
URL 'https://ghostscript.readthedocs.io/en/gs10.02.0/_images/icon-txt.svg' not 
followed (disallowed by robots.txt)
```

Trying to download a sing file from above result, will succeed:
```
wget2 https://ghostscript.readthedocs.io/en/gs10.02.0/_images/icon-txt.svg
[0] Downloading 
'https://ghostscript.readthedocs.io/en/gs10.02.0/_images/icon-txt.svg' ...
Saving 'icon-txt.svg'
HTTP response 200  
[https://ghostscript.readthedocs.io/en/gs10.02.0/_images/icon-txt.svg]
```

The file `robots.txt` looks like:
```
D:\TEMP\gs2>dir robots.txt /s/b
D:\TEMP\gs2\ghostscript.readthedocs.io\robots.txt

D:\TEMP\gs2>type ghostscript.readthedocs.io\robots.txt
User-agent: *

Disallow: # Allow everything

Sitemap: https://ghostscript.readthedocs.io/sitemap.xml
```

-- 
Reply to this email directly or view it on GitLab: 
https://gitlab.com/gnuwget/wget2/-/issues/653
You're receiving this email because of your account on gitlab.com.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]