parallel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: fault tolerance, retry task on different node, recovery orientation?


From: Ole Tange
Subject: Re: fault tolerance, retry task on different node, recovery orientation?
Date: Fri, 30 May 2014 00:17:04 +0200

As mentioned in the man page: Computers will only be reused if the
number of retries > number of computers (or more correctly:
sshlogins).

The order in which the computer is tested is based on the order values
are extracted from a Perl hash using 'values'. I am still puzzled why
you believe this order will be important. I would believe it is much
more important to know that a computer on which the job has failed
will not be chosen unless number of retries > number of sshlogins.


/Ole

On Thu, May 29, 2014 at 9:27 PM, Mitchell Wyle <mfw@wyle.org> wrote:
> Hi Ole,
>
> Thanks for the quick reply.  I meant, if I have 10 SSHLOGIN computers how
> does parallel choose on which one it will dispatch the next job and to which
> one it will dispatch a failed job that it is retrying.  The selection method
> it uses for selecting which computer when it does what the man page says:
> "retry it on another computer."  round-robin is better than random
> (zookeeper) and better than "least loaded."
>
> Thanks again.
>
>
>
>
> On Thu, May 29, 2014 at 12:20 PM, Ole Tange <ole@tange.dk> wrote:
>>
>> On Thu, May 29, 2014 at 8:54 PM, Mitchell Wyle <mfw@wyle.org> wrote:
>> > Cool!  I shall try simple --retries and verify it works.    Does it
>> > "round
>> > robin" the tries?  Thanks!
>>
>> No. It does what it says in the man page:
>>
>>        --retries n
>>                 If a job fails, retry it on another computer. Do
>>                 this n times. If there are fewer than n computers
>>                 in --sshlogin GNU parallel will re-use the
>>                 computers. This is useful if some jobs fail for no
>>                 apparent reason (such as network failure).
>>
>> Why do you think it would do something else than what it says in the man
>> page?
>>
>>
>> /Ole
>
>



reply via email to

[Prev in Thread] Current Thread [Next in Thread]