bug-parallel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: GNU Parallel Bug Reports option --compress and --line-buffer causing


From: Derek Wilson
Subject: Re: GNU Parallel Bug Reports option --compress and --line-buffer causing lots of disk usage and hangs and stderr output
Date: Thu, 6 Feb 2014 15:19:50 -0800

getting --compress to work with --line-buffer is not worth that much $$ to me, but a nice failure message would be useful.

having --compress work with --pipe would definitely be nice.

if you want a good use case for --line-buffer, my use case is streaming data through multiple stages of parallel processing while keeping memory and disk usage as low as possible. I had some issues with memory while using the default of grouping output by job because i use long running jobs that produced a great deal of output. for for something like this:

< input.txt parallel --pipe --line-buffer -N1000 -L1 -j128 ./stage1.pl 2> stage1.err | parallel --pipe --line-buffer -N10000 -L1 -j16 ./stage2.pl 2> stage2.err > output.txt

as soon as i start the job i want to start seeing output. my input is one json object per line and my output is one json object per line.

being able to have --compress would be really nice to keep disk usage down - my input and output are large enough that too much temporary storage can cause me trouble.



On Thu, Feb 6, 2014 at 4:03 AM, Ole Tange <address@hidden> wrote:
On Wed, Feb 5, 2014 at 7:55 PM, Derek Wilson <address@hidden> wrote:

> I'm using 20140122 because I noticed the addition of the --line-buffer
> option.

--line-buffer is somewhat of a hack. It does not work with compress.

The reason for this is that the running program sends data directly to
lzop which saves it to a file. Only when the program finishes, I
rewind the file and passes that as STDIN to lzop -d. This will of
course fail if the file is not complete.

What needs to be done for --line-buffer --compress to work is something ala:

Create tmpfile: true > tmpfile
Start 'tail -f tmpfile | lzop -d' and get a file handle for stdout
Start program | lzop >> tmpfile
Remove tmpfile (to avoid manual cleanup if GNU Parallel crashes)
Every now and then do a non-blocking read on all 'lzop -d' file
handles and print if there is a full line.
When program stops, somehow tell tail -f to send all remaining data
(even incomplete lines) to lzop -d and exit without sending a SIGPIPE
(not sure how to do that).
Read until EOF from the 'lzop -d' filehandle.

It is probably doable, but definitely a lot of work. And unless
someone convinces me that this a killer feature, then I will only
implement that on a consultancy basis (150 USD/hr).

> Thanks for that by the way.

For the manual I need good examples for what people use this option
for. I have yet to find the killer usage, so it would be great if you
could describe what you use it for.

> Then i saw --compress and I know this
> will be very useful, but I'm running into issues using it.
>
> I tend to use parallel to process streaming data and I'm doing something
> that looks like this:
>
> yes whatisthisnow | head -n100000000 | parallel --pipe -N100000 -L1 -j512
> cat 2> cattest.log > cattest.out
>
> if i add --compress:
>
> * my log file has a bunch of lines like: "lzop: <stdin>: not a lzop file"
> * i do get streaming output (but is it per job like the default?)
> * not all of the lzop processes exit and parallel hangs indefinitely

It seems you have found a bug. I can even reproduce it with:

  echo k| parallel --pipe  --compress  cat

So --pipe and --compress do currently not work together. I will assume
that is a minor fix (but it could very will be a major debugging
task).

> if i add --line-buffer to that I don't get any output at all ever.

This is to be expected, and GNU Parallel should probably fail with a
decent error message if you do --line-buffer --compress.

> i do hope there is an easy way to resolve these issues and enable me to use
> lzop for temp files while still getting lines as soon as they are available.

That, however, will not work (but see above).

/Ole


reply via email to

[Prev in Thread] Current Thread [Next in Thread]