Re: Splitting STDIN to parallel processes (map-reduce on blocks of data)

parallel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Splitting STDIN to parallel processes (map-reduce on blocks of data)

From:	Jay Hacker
Subject:	Re: Splitting STDIN to parallel processes (map-reduce on blocks of data)
Date:	Wed, 12 Jan 2011 13:56:48 -0500

I think this is a great feature, and I would definitely use it.

> Will it make more sense to do max-block size? What should happen if a single 
> record is bigger than max-block size?

I think you could do both, and just stipulate that at least one record
will always be read, regardless of max-block size.

> how do I check if a write to a pipe would block?

Spawn a thread to feed each jobslot, and just let it block.  Give each
"feeder" thread a record queue, and have an input reader thread fill
them round-robin.  In buffered mode, just keep filling them to allow
fast processes to go at their own pace.  In unbuffered mode, limit the
max queue size to 1 and block the reader on a full queue.  Overall
progress is limited by the slowest machine in the unbuffered case, but
you don't need any extra memory.  Preferably keep the queues on the
remote machines with -S (using a wrapper script) to take advantage of
distributed memory.

Great idea.

[Prev in Thread]

Current Thread

[Next in Thread]

Re: Splitting STDIN to parallel processes (map-reduce on blocks of data), Jay Hacker <=

Prev by Date: Fwd: Splitting STDIN to parallel processes (map-reduce on blocks of data)
Next by Date: Parallel Enhancement request.
Previous by thread: Re: Replacement string for process number
Next by thread: Parallel Enhancement request.
Index(es):
- Date
- Thread