parallel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: feature suggestion: --preserve-blocking-factor


From: Ole Tange
Subject: Re: feature suggestion: --preserve-blocking-factor
Date: Fri, 17 Feb 2017 22:00:53 +0100

On Thu, Feb 16, 2017 at 7:16 PM, Cook, Malcolm <MEC@stowers.org> wrote:

> When using the --spreadstdin option, it may be desirable to ensure that the 
> blocks "keep together" certain blocks of data.

Yes. We use --recend --recstart for that.

> For example the input may be sorted on column 3, and it may be the case that 
> all lines having the same value for column 3 must be processed together.

So the record depends on column 3 having the same value.

Parsing a CSV-file is expensive if it has to do it correctly (E.g.
values with tabs, quotes, and newlines). I do not see that becoming
part of GNU Parallel.

So how do you deal with the column issue?

Let us use this as an example:

  paste <(seq 105) <(parallel yes {}'|head -n {#}' ::: {a..n}) <(seq
105 | shuf) > example

We want to group this by column 2, so all consecutive lines with the
same column 2 will be treated as a single record and not be split.
However, it will be OK to join multiple records.

We will make a small program to insert a record separator. This has to
be a string not found in the file. Here I have chosen '\0' but it
could be "p-O-P-p-y i'M poPpY", $(mmencode /dev/urandom|head), or
$(mktemp).

  cat example | perl -ape '$F[1] ne $old and print "\0"; $old = $F[1]'

Now it is suddenly trivially simple to tell GNU Parallel to group the
records together and remove the record separator:

  parallel --recend '\0' --rrs --pipe --block 200 wc

We might need something for --pipepart, so you can feed in potential
split positions, but you would still have to write the program that
finds the positions yourself.


/Ole



reply via email to

[Prev in Thread] Current Thread [Next in Thread]