[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Parallel spider
From: |
Ole Tange |
Subject: |
Parallel spider |
Date: |
Wed, 27 Jul 2011 19:31:41 +0200 |
This small script will be an example in next version. It is a parallel
webspider (walking breadth first).
As it is going to be an example I need to know if I need to explain
anything in the script or if it is clear what is going on.
Run like this:
PARALLEL=-j50 ./parallel-spider http://www.gnu.org/software/parallel
If you can change it to be a parallel webmirroring tool (similar to
wget -m), then that would be great. I gave up after trying for 30
mins.
/Ole
#!/bin/bash
# E.g. http://www.gnu.org/software/parallel
URL=$1
URLLIST=$(mktemp urllist.XXXX)
URLLIST2=$(mktemp urllist.XXXX)
SEEN=$(mktemp seen.XXXX)
# Spider to get the URLs
echo $URL >$URLLIST
cp $URLLIST $SEEN
while [ -s $URLLIST ] ; do
cat $URLLIST |
parallel lynx -listonly -image_links -dump {} \; echo Spidered:
{} \>\&2 |
perl -ne 's/#.*//; s/\s+\d+.\s(\S+)$/$1/ and do { $seen{$1}++ or
print }' |
grep -F $URL |
grep -v -x -F -f $SEEN | tee -a $SEEN > $URLLIST2
mv $URLLIST2 $URLLIST
done
rm -f $URLLIST $URLLIST2 $SEEN
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- Parallel spider,
Ole Tange <=