parallel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Parallel spider


From: Ole Tange
Subject: Parallel spider
Date: Wed, 27 Jul 2011 19:31:41 +0200

This small script will be an example in next version. It is a parallel
webspider (walking breadth first).

As it is going to be an example I need to know if I need to explain
anything in the script or if it is clear what is going on.

Run like this:

PARALLEL=-j50 ./parallel-spider http://www.gnu.org/software/parallel

If you can change it to be a parallel webmirroring tool (similar to
wget -m), then that would be great. I gave up after trying for 30
mins.


/Ole


 #!/bin/bash

  # E.g. http://www.gnu.org/software/parallel
  URL=$1
  URLLIST=$(mktemp urllist.XXXX)
  URLLIST2=$(mktemp urllist.XXXX)
  SEEN=$(mktemp seen.XXXX)

  # Spider to get the URLs
  echo $URL >$URLLIST
  cp $URLLIST $SEEN

  while [ -s $URLLIST ] ; do
    cat $URLLIST |
      parallel lynx -listonly -image_links -dump {} \; echo Spidered:
{} \>\&2 |
      perl -ne 's/#.*//; s/\s+\d+.\s(\S+)$/$1/ and do { $seen{$1}++ or
print }' |
      grep -F $URL |
      grep -v -x -F -f $SEEN | tee -a $SEEN > $URLLIST2
    mv $URLLIST2 $URLLIST
  done

  rm -f $URLLIST $URLLIST2 $SEEN



reply via email to

[Prev in Thread] Current Thread [Next in Thread]