bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] Infinite loop, and bad 'adjust extension' on pdf


From: Ángel González
Subject: Re: [Bug-wget] Infinite loop, and bad 'adjust extension' on pdf
Date: Sat, 24 Nov 2012 17:45:33 +0100
User-agent: Thunderbird

On 24/11/12 11:33, Lluís Batlle i Rossell wrote:
> Hello,
>
> I was downloading recursively. Specifically:
> wget 
> --domains="data.inh.cat,data.jordibilbeny.com,www.inh.cat,www.jordibilbeny.com"
>  \
>     -H --adjust-extension -k -r -c -l 3 http://www.jordibilbeny.com/
>
> And:
>
> 1) It went in an infinite loop while downloading 
> http://www.inh.cat/robots.txt ,
> it returning HTTP 416, and retrying again and again. I had to remove '-c' to
> make wget go.
>
> 2) All links to '.pdf' files had its target changed to '.pdf.html' (that is, 
> -k
> and --adjust-extension I guess). But the pdf files downloaded didn't have the
> ".html" name addition. So the local links failed.
>
> I used a "sed -i" in the files of my interest, to rewrite the anchor targets.
>
> I'm running 1.13.4.
>
> Thank you,
> Lluís.
Hola Lluís
I tested (2) with wget 1.14
> wget 
> --domains="data.inh.cat,data.jordibilbeny.com,www.inh.cat,www.jordibilbeny.com"
>     
> -H --adjust-extension -k -r -c -l 1
> http://www.jordibilbeny.com/altres-investigadors-id.php?Id=35
And it didn't change the .pdf to .pdf.html

I did see a bit of strange behavior with downloading robots.txt when it
was already present,
but not really a loop.
It may be related to the fact that the 416 is served with Content-Type:
text/html, and with
--adjust-extension it is checking if it's present at robots.txt.html


While testing http://www.inh.cat/, I also found out that wget was
following the urls of:
> <script language="javascript">
> function show_language_bar() {
> // comentat:
> //    bar = "<a class=notranslate 
> href=\"http://www.inh.cat//?lang=ct&translate=0\"; 
> classe=\"minia_selected\"><img 
> src='/kms/css/aqua/img/flags/f_cat.gif'></a>&nbsp;<a class=notranslate 
> href=\"http://translate.google.com/translate?client=tmpg&langpair=ca|de&u=http://www.inh.cat//?translate=1\";
>  class=\"minia\"><img src='/kms/css/aqua/img/flags/f_de.gif'></a>&nbsp;<a 
> class=notranslate 
> href=\"http://translate.google.com/translate?client=tmpg&langpair=ca|en&u=http://www.inh.cat//?translate=1\";
>  class=\"minia\"><img src='/kms/css/aqua/img/flags/f_en.gif'></a>&nbsp;<a 
> class=notranslate 
> href=\"http://translate.google.com/translate?client=tmpg&langpair=ca|es&u=http://www.inh.cat//?translate=1\";
>  class=\"minia\"><img src='/kms/css/aqua/img/flags/f_es.gif'></a>&nbsp;<a 
> class=notranslate 
> href=\"http://translate.google.com/translate?client=tmpg&langpair=ca|fr&u=http://www.inh.cat//?translate=1\";
>  class=\"minia\"><img src='/kms/css/aqua/img/flags/f_fr.gif'></a>&nbsp;<a 
> class=notranslate 
> href=\"http://translate.google.com/translate?client=tmpg&langpair=ca|it&u=http://www.inh.cat//?translate=1\";
>  class=\"minia\"><img src='/kms/css/aqua/img/flags/f_it.gif'></a>&nbsp;<a 
> class=notranslate 
> href=\"http://translate.google.com/translate?client=tmpg&langpair=ca|pt&u=http://www.inh.cat//?translate=1\";
>  class=\"minia\"><img src='/kms/css/aqua/img/flags/f_pt.gif'></a>&nbsp;<a 
> class=notranslate 
> href=\"http://translate.google.com/translate?client=tmpg&langpair=ca|ja&u=http://www.inh.cat//?translate=1\";
>  class=\"minia\"><img src='/kms/css/aqua/img/flags/f_ja.gif'></a>&nbsp;<a 
> class=notranslate 
> href=\"http://translate.google.com/translate?client=tmpg&langpair=ca|ko&u=http://www.inh.cat//?translate=1\";
>  class=\"minia\"><img src='/kms/css/aqua/img/flags/f_ko.gif'></a>&nbsp;<a 
> class=notranslate 
> href=\"http://translate.google.com/translate?client=tmpg&langpair=ca|nl&u=http://www.inh.cat//?translate=1\";
>  class=\"minia\"><img src='/kms/css/aqua/img/flags/f_nl.gif'></a>&nbsp;<a 
> class=notranslate 
> href=\"http://translate.google.com/translate?client=tmpg&langpair=ca|pl&u=http://www.inh.cat//?translate=1\";
>  class=\"minia\"><img src='/kms/css/aqua/img/flags/f_pl.gif'></a>&nbsp;<a 
> class=notranslate 
> href=\"http://translate.google.com/translate?client=tmpg&langpair=ca|ru&u=http://www.inh.cat//?translate=1\";
>  class=\"minia\"><img src='/kms/css/aqua/img/flags/f_ru.gif'></a>&nbsp;<a 
> class=notranslate 
> href=\"http://translate.google.com/translate?client=tmpg&langpair=ca|zh&u=http://www.inh.cat//?translate=1\";
>  class=\"minia\"><img src='/kms/css/aqua/img/flags/f_zh.gif'></a>";
> //    $('div#lanbar').html(bar);
> }
> if (window==window.top) { show_language_bar();  }
> </script>
which it shouldn't.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]