Splitting STDIN to parallel processes (map-reduce on blocks of data)

parallel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Splitting STDIN to parallel processes (map-reduce on blocks of data)

From:	Ole Tange
Subject:	Splitting STDIN to parallel processes (map-reduce on blocks of data)
Date:	Tue, 11 Jan 2011 16:32:32 +0100

You are hereby invited to help design a block-wise-map-reduce feature
of GNU Parallel. These are my current thoughts. Feel free to give your
input - especially if you need something similar.

In my work I have been in the situation where I have a file or output
that I need to process. GNU Parallel works fine if it is arguments for
a command line, but it cannot split the STDIN to different processes.

The situation is typically a big file that I would like to split into
blocks and each block can be processed separately in parallel.

It can be considered a map-reduce on data split into blocks (where the
reduce is simply 'cat').

The input data will either have a record separator or fixed sized
records. The record separator defaults to \n.

The user can specify how many records each block should contain
(default: 1) and GNU Parallel will read that many records.

The user can alternatively specify the minimal size of a block in
bytes. GNU Parallel will then read records until the given size. Will
it make more sense to do max-block size? What should happen if a
single record is bigger than max-block size?

The user can alternatively specify every N'th record sent to a jobslot
(where N is the number of job slots). This one may be hard to get
right if multiple machines are involved with different speeds. How do
I check if a write to a pipe would block? Maybe I could just run 'tail
-f' on the buffer - then slow machines will slow down overall but
maybe that is OK.

Each block is then buffered and fed into the command as input on
STDIN. Buffering is needed because the command may read slowly from
STDIN and it is unacceptable to wait of the command to complete
reading before starting another job. Buffering can be done by creating
a temporary file and put it into that. The temp file will be removed.
The buffering will be done on the local machine.

If the command reads from a file instead of STDIN then use {} as the
filename and GNU Parallel will give it the name of the buffer file.
This would require the file to be completely written, so 'tail -f' is
not OK here.

When the command finishes and frees up a jobslot a new command is
started with a new block.


= Optimizations =

If the input is a file then buffering is unneeded if the command reads
from STDIN: We can simply read from a certain position in the file to
the end of the block. This will require reading enough blocks to
identify the locations of the block starts.


= Example: sorting =

cat bigfile | parallel --block-size 10m "sort {} >{}.out; echo {}.out"
| parallel -X sort -m >bigfile.sorted

= Example: compressing (Here we use the fact that two .gz files can be
catted together) =

cat bigfile | parallel --block-size 10m -k gzip > bigfile.gz


/Ole

[Prev in Thread]

Current Thread

[Next in Thread]

Splitting STDIN to parallel processes (map-reduce on blocks of data), Ole Tange <=
- Message not available
  - Fwd: Splitting STDIN to parallel processes (map-reduce on blocks of data), Ole Tange, 2011/01/12
    - Re: Splitting STDIN to parallel processes (map-reduce on blocks of data), Ole Tange, 2011/01/13
- Message not available
  - Message not available
    - Fwd: Splitting STDIN to parallel processes (map-reduce on blocks of data), Ole Tange, 2011/01/12
- Re: Splitting STDIN to parallel processes (map-reduce on blocks of data), Ole Tange, 2011/01/18
  - Re: Splitting STDIN to parallel processes (map-reduce on blocks of data), Ole Tange, 2011/01/19

Prev by Date: Beta-release of GNU Parallel and have your say
Next by Date: Re: Replacement string for process number
Previous by thread: Beta-release of GNU Parallel and have your say
Next by thread: Fwd: Splitting STDIN to parallel processes (map-reduce on blocks of data)
Index(es):
- Date
- Thread