[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Feature request: halt on threshold
From: |
Ole Tange |
Subject: |
Re: Feature request: halt on threshold |
Date: |
Sat, 19 Jul 2014 01:17:47 +0200 |
On Fri, Jul 18, 2014 at 11:22 PM, Ben Rusholme <rusholme@caltech.edu> wrote:
> There are currently three options to "—halt" - ignore (0), stop new jobs (1),
> or kill everything (2).
>
> I propose an additional option; to set the number of job failures before
> doing anything. This would then allow some tolerance of failure but would
> catch global problems.
>
> Consider this example - running a 1000 jobs each of around 1hr, where a
> random handful will fail due to unexpected bad data or other unforeseen bug,
> but the overwhelming majority will complete successfully.
>
> Setting —halt 0 all jobs will run, and I can check for the failures
> afterwards. Great! However, say I forget to create the results directory, so
> every "good" job runs for full time then fails right at the end…if I wasn’t
> monitoring I just wasted 1000hrs of processing time.
This I do not understand. GNU Parallel 20140622 creates the dirs
before running, so your version is broken:
$ parallel --results /tmp/this/does/not/exist echo ::: 1
1
$ ls /tmp/this/does/not/exist/1/1/
stderr stdout
> Setting halt > 0 the job will stop at or just after the first problem. I have
> to check the logs, figure out and fix if possible, rerun with previous
> success excluded etc.
Using --resume-failed.
> What I would like is to say set the number of tolerable failures to the
> number of workers. Then a serious bug would be caught after the first
> iteration, but the entire job would run and handle some measure of bad input
> data.
You need to give a reproducible example where you cannot just use
--halt 0 and then later --resume-failed when you have fixed the
bug/the input data.
> Does this make sense? Unfortunately it would require changing the current
> flags, either adding another or changing the current halt options.
One possibility for syntax is --halt 10% to allow 10% to fail.
/Ole