[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Splitting STDIN to parallel processes (map-reduce on blocks of data)
From: |
Jay Hacker |
Subject: |
Re: Splitting STDIN to parallel processes (map-reduce on blocks of data) |
Date: |
Wed, 12 Jan 2011 13:56:48 -0500 |
I think this is a great feature, and I would definitely use it.
> Will it make more sense to do max-block size? What should happen if a single
> record is bigger than max-block size?
I think you could do both, and just stipulate that at least one record
will always be read, regardless of max-block size.
> how do I check if a write to a pipe would block?
Spawn a thread to feed each jobslot, and just let it block. Give each
"feeder" thread a record queue, and have an input reader thread fill
them round-robin. In buffered mode, just keep filling them to allow
fast processes to go at their own pace. In unbuffered mode, limit the
max queue size to 1 and block the reader on a full queue. Overall
progress is limited by the slowest machine in the unbuffered case, but
you don't need any extra memory. Preferably keep the queues on the
remote machines with -S (using a wrapper script) to take advantage of
distributed memory.
Great idea.
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- Re: Splitting STDIN to parallel processes (map-reduce on blocks of data),
Jay Hacker <=