[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Bug-wget] Infinite loop, and bad 'adjust extension' on pdf
From: |
Ángel González |
Subject: |
Re: [Bug-wget] Infinite loop, and bad 'adjust extension' on pdf |
Date: |
Sat, 24 Nov 2012 17:45:33 +0100 |
User-agent: |
Thunderbird |
On 24/11/12 11:33, Lluís Batlle i Rossell wrote:
> Hello,
>
> I was downloading recursively. Specifically:
> wget
> --domains="data.inh.cat,data.jordibilbeny.com,www.inh.cat,www.jordibilbeny.com"
> \
> -H --adjust-extension -k -r -c -l 3 http://www.jordibilbeny.com/
>
> And:
>
> 1) It went in an infinite loop while downloading
> http://www.inh.cat/robots.txt ,
> it returning HTTP 416, and retrying again and again. I had to remove '-c' to
> make wget go.
>
> 2) All links to '.pdf' files had its target changed to '.pdf.html' (that is,
> -k
> and --adjust-extension I guess). But the pdf files downloaded didn't have the
> ".html" name addition. So the local links failed.
>
> I used a "sed -i" in the files of my interest, to rewrite the anchor targets.
>
> I'm running 1.13.4.
>
> Thank you,
> Lluís.
Hola Lluís
I tested (2) with wget 1.14
> wget
> --domains="data.inh.cat,data.jordibilbeny.com,www.inh.cat,www.jordibilbeny.com"
>
> -H --adjust-extension -k -r -c -l 1
> http://www.jordibilbeny.com/altres-investigadors-id.php?Id=35
And it didn't change the .pdf to .pdf.html
I did see a bit of strange behavior with downloading robots.txt when it
was already present,
but not really a loop.
It may be related to the fact that the 416 is served with Content-Type:
text/html, and with
--adjust-extension it is checking if it's present at robots.txt.html
While testing http://www.inh.cat/, I also found out that wget was
following the urls of:
> <script language="javascript">
> function show_language_bar() {
> // comentat:
> // bar = "<a class=notranslate
> href=\"http://www.inh.cat//?lang=ct&translate=0\"
> classe=\"minia_selected\"><img
> src='/kms/css/aqua/img/flags/f_cat.gif'></a> <a class=notranslate
> href=\"http://translate.google.com/translate?client=tmpg&langpair=ca|de&u=http://www.inh.cat//?translate=1\"
> class=\"minia\"><img src='/kms/css/aqua/img/flags/f_de.gif'></a> <a
> class=notranslate
> href=\"http://translate.google.com/translate?client=tmpg&langpair=ca|en&u=http://www.inh.cat//?translate=1\"
> class=\"minia\"><img src='/kms/css/aqua/img/flags/f_en.gif'></a> <a
> class=notranslate
> href=\"http://translate.google.com/translate?client=tmpg&langpair=ca|es&u=http://www.inh.cat//?translate=1\"
> class=\"minia\"><img src='/kms/css/aqua/img/flags/f_es.gif'></a> <a
> class=notranslate
> href=\"http://translate.google.com/translate?client=tmpg&langpair=ca|fr&u=http://www.inh.cat//?translate=1\"
> class=\"minia\"><img src='/kms/css/aqua/img/flags/f_fr.gif'></a> <a
> class=notranslate
> href=\"http://translate.google.com/translate?client=tmpg&langpair=ca|it&u=http://www.inh.cat//?translate=1\"
> class=\"minia\"><img src='/kms/css/aqua/img/flags/f_it.gif'></a> <a
> class=notranslate
> href=\"http://translate.google.com/translate?client=tmpg&langpair=ca|pt&u=http://www.inh.cat//?translate=1\"
> class=\"minia\"><img src='/kms/css/aqua/img/flags/f_pt.gif'></a> <a
> class=notranslate
> href=\"http://translate.google.com/translate?client=tmpg&langpair=ca|ja&u=http://www.inh.cat//?translate=1\"
> class=\"minia\"><img src='/kms/css/aqua/img/flags/f_ja.gif'></a> <a
> class=notranslate
> href=\"http://translate.google.com/translate?client=tmpg&langpair=ca|ko&u=http://www.inh.cat//?translate=1\"
> class=\"minia\"><img src='/kms/css/aqua/img/flags/f_ko.gif'></a> <a
> class=notranslate
> href=\"http://translate.google.com/translate?client=tmpg&langpair=ca|nl&u=http://www.inh.cat//?translate=1\"
> class=\"minia\"><img src='/kms/css/aqua/img/flags/f_nl.gif'></a> <a
> class=notranslate
> href=\"http://translate.google.com/translate?client=tmpg&langpair=ca|pl&u=http://www.inh.cat//?translate=1\"
> class=\"minia\"><img src='/kms/css/aqua/img/flags/f_pl.gif'></a> <a
> class=notranslate
> href=\"http://translate.google.com/translate?client=tmpg&langpair=ca|ru&u=http://www.inh.cat//?translate=1\"
> class=\"minia\"><img src='/kms/css/aqua/img/flags/f_ru.gif'></a> <a
> class=notranslate
> href=\"http://translate.google.com/translate?client=tmpg&langpair=ca|zh&u=http://www.inh.cat//?translate=1\"
> class=\"minia\"><img src='/kms/css/aqua/img/flags/f_zh.gif'></a>";
> // $('div#lanbar').html(bar);
> }
> if (window==window.top) { show_language_bar(); }
> </script>
which it shouldn't.