parallel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

GNU Parallel - line-based data distribution?


From: Jacek Wielemborek
Subject: GNU Parallel - line-based data distribution?
Date: Thu, 31 Oct 2013 16:05:35 +0100

Hi,

I recently needed to match OS fingerprints from all the Internet
Census 2012 data collection (
http://internetcensus2012.bitbucket.org/paper.html ). In order to do
that, I found fingermatch - a tool that expects the Nmap fingerprint
to be entered via the standard input and (after my modifications)
prints out a single line of output. Then I needed a tool to merge
columns from the input with the output, so that if I entered:

<some ip> <timestamp> <fingerprint>

The <fingerprint> column went to the "fingermatch" standard input, my
supervisor script read its output and printed something like this:

<some ip> <timestamp> <fingermatch output>

I quickly realized that the performance of fingermatch wasn't
satisfactory for me and I enhanced my supervisor script with
multithreading support. I wasn't aware of GNU parallel back then, so I
wrote some of its functionality myself.

Now that I have another massive task to perform (bulk rDNS querying of
some of the hosts), I wanted to use GNU parallel to perform it, but
even after reading the tool's man page (expect for examples, so far),
I find it hard it to replicate the following pattern:

1. Read lots of newline-delimited input
2. Spawn N processes
3. Feed all the idle processes (i.e. not being in the middle of the
read operation) with the input, line-by-line
4. Perform a blocking read on the processes in order to read the
output line from them
5. Print the line in a synchronized manner, so that stdout from from
the programs doesn't overlap (AFAIR, GNU parallel already does that)
6. Should any of the processes die, respawn it

How much of the functionality can currently be achieved with GNU
Parallel? Please note the shift from many short-time worker processes
that terminate after one piece of input to few long-living processes
and the monitoring of their state (whether we're reading from them or
not).

Yours,
Jacek Wielemborek



reply via email to

[Prev in Thread] Current Thread [Next in Thread]