parallel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Splitting STDIN to parallel processes (map-reduce on blocks of data)


From: Ole Tange
Subject: Re: Splitting STDIN to parallel processes (map-reduce on blocks of data)
Date: Thu, 13 Jan 2011 17:11:29 +0100

On Wed, Jan 12, 2011 at 10:54 PM, Ole Tange <tange@gnu.org> wrote:

> A few considerations::
>
> You might borrow the semantics of the BSD `split` and `csplit`
> commands to determine block boundaries, allowing:
>
>        parallel --block 'split -l 100' # 100 line blocks
>
>        parallel --block 'csplit "%^>%" "/^>/" "{*}"' # one fasta
> record per block

I have gives this some more thought and I believe Malcolm has a point.

To help coming up with a simple and useful solution it would be
helpful if we think of what kind of files we might use this function
for. We may not be able to make something that is usable for anything,
but we may be some common features. Please help me think of the types
of files that would make sense to use. If you cannot explain a simple
Begin or End record separator, then explain in other words how to find
the record.

* Normal ascii text files (End record separator = \n)
* FASTA files (Begin record separator = ^>)
* mbox files (Begin record separator = "^From ")
* fastq files (Begin record separator = ^@)
* MOVINS (can this be expressed as a regexp? EQD?)
* BAPLIE (can this be expressed as a regexp?)

* Binary files (Record = 1 byte)
* Fixed length records (Record = the given length)
* Fixed number of lines (End record separator = \n and record count = k*n)


/Ole



reply via email to

[Prev in Thread] Current Thread [Next in Thread]