parallel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Using parallel over several computers


From: Andy Loftus
Subject: Re: Using parallel over several computers
Date: Wed, 15 Mar 2017 14:48:46 +0000

Anders,
To my knowledge, parallel doesn't support any way to suspend a task. If you send a kill to parallel, it will not start new tasks and then will wait for existing tasks to complete.

However, there is a way to have parallel retry failed tasks (see --retry-failed option).  So if your task is written in a way that it can be killed and re-started, then you can approximate a "suspend" operation.  This would depend entirely on the task being able to save it's state and restart from where it left off.  Then you could first send kill to parallel and second send (appropriate signal) to individual tasks telling them to save state and exit.  Tasks must exit in a way that tells parallel they failed (ie: exit with return code 1), so parallel will retry them when asked.

Just looked up in the manpage: to kill parallel, send the TERM signal:
https://www.gnu.org/software/parallel/man.html#COMPLETE-RUNNING-JOBS-BUT-DO-NOT-START-NEW-JOBS

To ask parallel to kill tasks, see --halt and --termseq options.

Cheers,
--Andy

On Wed, Mar 15, 2017 at 5:12 AM Anders Lind <anders.lind@icm.uu.se> wrote:
HI Andy and Douglas.

Thank you both for your suggestions.
I'll look into both ways. Andy, being able to send a kill signal I think will be key for me since some of these analysis take weeks to finish, so I would be able to
kill  or suspend them when needed.

Cheers
//Anders


On 15/03/17 04:36, Andy Loftus wrote:
Anders,
Take a look at the --sqlmaster and --sqlworker options.

I use them to effectively create a jobqueue that any node can pull tasks from. I do this for long running backups on a parallel filesystem (all nodes have read/write access to the data and the sql joblog file).

1. Create a list of "tasks" and send that to parallel invoked with the --sqlmaster option.  The sqlmaster option will create the joblog and exit.

2. On any machine that has access to the joblog file AND the data, run parallel with the --sqlworker option.  As new machines come available, you can start parallel on them in the same manner.  To stop work on a particular node, send a KILL signal to the parallel process on that node, which will stop spawning any new jobs and exit after existing tasks have completed.

In my case, each "task" is a bash script file, and I list them, one per line, in a tasklist file, such as:
/path/to/001.cmd
/path/to/002.cmd
...
/path/to/675.cmd

The parallel sqlmaster cmdline is then:
parallel -a "/path/to/tasklist" --sqlmaster "$DBURL" bash

The DBURL is now a task queue as well as a joblog.

The parallel sqlworker cmdline is:
parallel --sqlworker "$DBURL"

Some advantages here are:
+ The original (sqlmaster) host does not have to control the parallel process and keep spawning new tasks on all the workers.
+ The worker nodes can each run at their own width (-j option).  This might allow you to run a low task count on the worker nodes without interfering with other users on the node.  You could even stop and restart with different -j values as needed throughout the day.
+ Worker nodes can be started simply by running parallel on each. And can be stopped by sending a KILL to the local parallel on that node.

NOTE: The sql* options have very recent changes to them so make sure you are using the most recent version of parallel.  

Hope this is helpful.

Cheers,
--Andy

On Tue, Mar 14, 2017 at 10:56 AM Douglas A. Augusto <daaugusto@gmail.com> wrote:
On 14/03/2017 at 10:54,
Anders Lind <anders.lind@icm.uu.se> wrote:

> I could perhaps set this up using the ssh functionality of parallel, but I
> would need to be able to on the fly stop some machines from running jobs,
> since the computers belong to co-workers who sometimes need their computers
> for their own work.

Hi Anders,

The following thread may interest you:

   Dynamically changing remote servers list
   https://lists.nongnu.org/archive/html/parallel/2014-08/msg00012.html

Based on that, at the time I made a shell script that keeps parallel's
sshloginfile updated by filtering out unreachable remote servers and also
allowing the user to edit (include and/or exclude remote servers) on-the-fly:

   https://github.com/daaugusto/gnuparallel

PS: It worked with older versions of GNU Parallel (I haven't tested it with
more recent ones yet), so you mileage may vary.

--
Douglas A. Augusto

-- 
Anders Lind
Molecular Evolution
Department of Cell and Molecular Biology
Biomedical Centre
Uppsala University
Box 596
751 23 Uppsala
Sweden
phone: +46 18 471 4058 

reply via email to

[Prev in Thread] Current Thread [Next in Thread]