bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] mirroring a Blogger blog without the comments


From: Darshit Shah
Subject: Re: [Bug-wget] mirroring a Blogger blog without the comments
Date: Fri, 25 Apr 2014 08:55:31 +0200

Hello, (Sorry, I didn't catch your name)!

Thanks for your kind words.

Regarding your issue, the reason Wget downloads the comment spam pages
is that you have enabled the --span-hosts option. By default Wget
would not download pages from a different domain, however, since you
explicitly asked it to, it does your bidding. Now I'm not sure if you
really need this functionality or the command just slipped in there by
mistake.

If you do need the --span-hosts switch, you can use one of,
"--accept-regex" or "--reject-regex" switches to define a regular
expresssion that governs what URL's are actually downloaded. Now I
realize, that this might be a lot of work if you have lots of outbound
links from your blog that you wish to download, but as of now we have
nothing that provides you with finer grained control. If you missed
reading about this option in your man page, you might be using a
slightly older version of Wget which does not have these features.

I'm not sure I completely follow your suggestion based on
--next-urls-cmd. But, probably what you are looking for can be
accomplished by using --spider mode and some parsing of the output.

On Thu, Apr 24, 2014 at 6:41 PM,  <address@hidden> wrote:
> Dear wget community,
>
> I'm playing with wget's mirroring functionality for the first time, and
> first off, so far it's fantastic. Thanks for the great work!
>
> I'm using a command like the following to create a (shallow) offline mirror
> of my Blogger blog:
>
> wget --tries=2 -e robots=off --span-hosts --timestamping
> --recursive --level=2 --no-remove-listing --adjust-extension
> --convert-links --page-requisites <MY_BLOG_URL>
>
>
> Unfortunately the blog has some comment spam, and wget is dutifully
> mirroring the spammers' pages which are linked to from the comments.
>
> It occurs to me that it could be useful to be able to tell wget to just
> ignore all comments sections of pages altogether. Is something like that
> possible? I looked through the documentation and just found
> --exclude-domains, which only helps when you know the domains you don't
> want in advance.
>
> I imagine an option like --exclude-crawling-within=<CSS_SELECTOR> could
> accomplish this, where wget would ignore any DOM subtrees matching the
> provided CSS selector (e.g. "#comments" in this case).
>
> Even more general would be something like --next-urls-cmd=<CMD>, where you
> could supply a command that accepts an HTTP response on stdin, and then
> writes the set of URLs to stdout which should be crawled based on it. wget
> could consult this command when in --recursive mode to allow more
> customizable crawling behavior. This leaves any HTML parsing or regular
> expression matching entirely up to the user.
>
> Is there any interest in this? Is it feasible?
>
> Thanks, and thanks again for the great work on wget.



-- 
Thanking You,
Darshit Shah



reply via email to

[Prev in Thread] Current Thread [Next in Thread]