Re: parallel + blast + LSF

Hi Ole,

after investigating, it looks like the 15 sequences ID missing where just filtered by the evalue filter (the corresponding sequences where very short).

So far my commands for gnuparallel + blast under LSF are:

---------------------------------------------------------------------------------------------------------------

cat queryFile.fasta | parallel --no-notice -j LSFSLOTS --tmpdir /network_filesystem_partition/tmp --wait --slf serversFile --block 200k --recstart '>' --pipe blastp -evalue 1e-05 -outfmt 6 -db dbFile -query - -out resultFile_{#}

wait

---------------------------------------------------------------------------------------------------------------

- I specified a shared folder as tmpdir since it looks like gnuparallel uses local /tmp and I am not sure if that's master node's /tmp (which can't be read by other nodes) or not

- I used the final wait to be sure LSF releases worker nodes only after every blast is finished (Do you think I should add a --fg also/instead?)

Anyway now my main concern is how do I specify the maximum job number per host when I am using multiple worker nodes?

best,

giuseppe

On Tue, Apr 21, 2015 at 8:00 PM, Giuseppe Aprea <giuseppe.aprea@gmail.com> wrote:

Hi Ole,

sorry for this late reply but our cluster had to undergo maintenance.
I have some notices/questions, please.

Remote nodes. LSF just reserve slots on several remote servers and launch your command line on one of those remote servers which we can call the master node. LSF reserved nodes are written on a file whose path is in LSF evironment variable LSB_DJOB_HOSTFILE. As an example If LSF gives you 2 slots non server_1 and 3 slots on server_2 this file is given by:
server_1
server_1
server_2
server_2
server_2
LSF slots should corresponds to server cores. That doesnt' mean LSF is able to enforce the number of program instances. That mus be done by users which may be given slots on the same server. Following LSF syntax, which is also similar to MPI hostfile syntax, I repeated the server names but you are saying that's useless. My question is: (Q1) How do I specify the maximum job number per host? Is it something like (following prevoius example)
2/server_1
3/server_2

Empty result files. I guess I retrieved empty results file for different reasons; one was, as you noticed, the wrong replacement string ( {%} insted of {#} ) but I also had the wrong temporary directory(which must be on a shared filesystem in my case). Now I think I reached a good point with the following script:

#!/bin/bash

#BSUB -J gnuParallel_blast_test # Name of the job.
#BSUB -o %J.out # Appends std output to file %J.out. (%J is the Job ID)
#BSUB -e %J.err # Appends std error to file %J.err.
#BSUB -q cresco3_h144 # Queue name.
#BSUB -n 70 # Number of CPUs.

module load 4.8.3/ncbi/12.0.0
module load 4.8.3/parallel/20150122

SLOTS=`cat ${LSB_DJOB_HOSTFILE} |wc -l`

SERVER=""

for i in `cat ${LSB_DJOB_HOSTFILE}| sort`
do
echo "${i}" >> servers
done

cat /gporq1_1M/usr/aprea/bio/solanum_melongena/analysis/orthomcl_00/goodProteins.fasta | parallel --no-notice -vv -j ${SLOTS} --tmpdir /gporq1_1M/usr/aprea/bio/solanum_melongena/analysis/orthomcl_00/tmp --wait --slf servers --block 200k --recstart '>' --pipe blastp -evalue 1e-05 -outfmt 6 -db /gporq1_1M/usr/aprea/bio/solanum_melongena/analysis/orthomcl_00/goodProteins -query - -out /gporq1_1M/usr/aprea/bio/solanum_melongena/analysis/orthomcl_00/resultd_{#}
wait

server file generated at runtime was:

cresco3x004.portici.enea.it
cresco3x004.portici.enea.it
cresco3x004.portici.enea.it
cresco3x004.portici.enea.it
cresco3x004.portici.enea.it
cresco3x004.portici.enea.it
cresco3x004.portici.enea.it
cresco3x004.portici.enea.it
cresco3x004.portici.enea.it
cresco3x004.portici.enea.it
cresco3x004.portici.enea.it
cresco3x004.portici.enea.it
cresco3x004.portici.enea.it
cresco3x004.portici.enea.it
cresco3x004.portici.enea.it
cresco3x004.portici.enea.it
cresco3x004.portici.enea.it
cresco3x004.portici.enea.it
cresco3x004.portici.enea.it
cresco3x004.portici.enea.it
cresco3x004.portici.enea.it
cresco3x004.portici.enea.it
cresco3x004.portici.enea.it
cresco3x004.portici.enea.it
cresco3x011.portici.enea.it
cresco3x011.portici.enea.it
cresco3x011.portici.enea.it
cresco3x011.portici.enea.it
cresco3x011.portici.enea.it
cresco3x011.portici.enea.it
cresco3x011.portici.enea.it
cresco3x011.portici.enea.it
cresco3x011.portici.enea.it
cresco3x011.portici.enea.it
cresco3x011.portici.enea.it
cresco3x011.portici.enea.it
cresco3x011.portici.enea.it
cresco3x011.portici.enea.it
cresco3x011.portici.enea.it
cresco3x011.portici.enea.it
cresco3x011.portici.enea.it
cresco3x011.portici.enea.it
cresco3x011.portici.enea.it
cresco3x011.portici.enea.it
cresco3x011.portici.enea.it
cresco3x011.portici.enea.it
cresco3x013.portici.enea.it
cresco3x013.portici.enea.it
cresco3x013.portici.enea.it
cresco3x013.portici.enea.it
cresco3x013.portici.enea.it
cresco3x013.portici.enea.it
cresco3x013.portici.enea.it
cresco3x013.portici.enea.it
cresco3x013.portici.enea.it
cresco3x013.portici.enea.it
cresco3x013.portici.enea.it
cresco3x013.portici.enea.it
cresco3x013.portici.enea.it
cresco3x013.portici.enea.it
cresco3x013.portici.enea.it
cresco3x013.portici.enea.it
cresco3x013.portici.enea.it
cresco3x013.portici.enea.it
cresco3x013.portici.enea.it
cresco3x013.portici.enea.it
cresco3x013.portici.enea.it
cresco3x013.portici.enea.it
cresco3x013.portici.enea.it
cresco3x013.portici.enea.it

(I had not read your message about repeated hostnames when I launched)
this time the stderr seemed not too bad (just a few warnings):
parallel: Warning: ssh to cresco3x004.portici.enea.it only allows for 0 simultaneous logins.
You may raise this by changing /etc/ssh/sshd_config:MaxStartups and MaxSessions on cresco3x004.portici.enea.it.
Using only -1 connections to avoid race conditions.
parallel: Warning: ssh to cresco3x013.portici.enea.it only allows for 0 simultaneous logins.
You may raise this by changing /etc/ssh/sshd_config:MaxStartups and MaxSessions on cresco3x013.portici.enea.it.
Using only -1 connections to avoid race conditions.
parallel: Warning: ssh to cresco3x011.portici.enea.it only allows for 0 simultaneous logins.
You may raise this by changing /etc/ssh/sshd_config:MaxStartups and MaxSessions on cresco3x011.portici.enea.it.
Using only -1 connections to avoid race conditions.
(Q2) Do you have any comments on that?
I retrieved 348 result files (all of them non empty) and I cat-ed them on a single file. The problem now is that for this test I run an all vs all BLAST so I expect at least 1 hit for each sequence in the input (each sequence vs itself) . Unfortunately that is not the case:

awk '{print $1}' resultd_all | sort | uniq | wc -l
175610
egrep "^>" goodProteins.fasta |wc -l
175625

As you can see I have 15 sequences ID missing. I am still investigating but I would like to ask you (Q3) if those IDs could have been lost in the data chunks creation (I used "-block 200k --recstart '>' --pipe") and, in case, how could I avoid that?

This is the input file structure:

head -n 12 goodProteins.fasta
>tom|Solyc00g005000.2.1
MFVPSIFLVFIMSCIISASVSYESKSTSGHAISFPTHEHLDVNQAIKEIIQPPETVHDNI
NNIVDDDDDNSRWKLKLLHRDKLPFSHFTDHPHSFQARMKRDLKRVHTLTNTTTNDNNKV
IKEEELGFGFGSEVISGMEQGSGEYFVRIGVGSPVRQQYMVIDAGSDIVWVQCQPCTHCY
HQSDPVFDPSLSASFTGVPCSSSLCNRIDNSGCHAGRCKYQVMYGDGSYTKGTMALETLT
FGRTVIRDVAIGCGHSNHGMFIGAAGGAFSYCLVSRGTNTGSTGSLEFGREVLPAGAAWV
PLIRNPRAPSFYYIGMLGLGVGGVRVPIPEDAFRLTEEGDGGVVMDTGTAVTRLPHEAYV
AFRDAFVAQTSSLPRAPAMSIFDTCYDLNGFVTVRVPTISFFLMGGPILTLPARNFLIPV
DTKGTFCFAFAPSPSRLSIIGNIQQEGIQISIDGANGFVGFGPNIC*
>tom|Solyc00g005020.1.1
MYVICKCICIDILIYMLLKVVEEKPQKDKKRRASDRGVLAQSHENVTNTEMAQERNVNER
LSRGRGITQHSQTSSEANCSGGVLGRGKRPAEHEDTSEGQTRPFKWPRMVGVGIYQAEDG
.....

Many thanks,

giuseppe

On Fri, Apr 17, 2015 at 6:28 PM, Ole Tange <ole@tange.dk> wrote:
On Wed, Apr 15, 2015 at 3:34 PM, Giuseppe Aprea
<giuseppe.aprea@gmail.com> wrote:

> I am trying to use GNU parallel v. 20150122 with blast for a very large
> sequences alignment. I am using Parallel on a cluster which uses LSF as
> queue system.

I have never run anything on a LSF system, so take my advice with 1
mmol of NaCl.

> "servers" is this file:
>
> /afs/enea.it/software/bin/blaunch.sh cresco3x013.portici.enea.it
> /afs/enea.it/software/bin/blaunch.sh cresco3x013.portici.enea.it
:

Duplicate lines in a --slf file are merged. It does no harm to have
the duplicate lines, but the duplicates are simply merged into 1.

> My problems are
:
> - the result files are empy and I can see the following messages:

It has been a while since I used blastp. Does it append to the file
given in '-out'? If not then you are overwriting it for every 24
sequences. Maybe you meant {#} instead?

> sh -c 'dd bs=1 count=1 of=/tmp/pariINik.chr 2>/dev/null'; test ! -s
> "/tmp/pariINik.chr" && rm -f "/tmp/pariINik.chr" && exec true; (cat
> /tmp/pariINik.chr; rm /tmp/pariINik.chr; cat - ) |
> (/afs/enea.it/software/bin/blaunch.sh cresco3x018.portici.enea.it exec perl
> -e
> \\\$ENV\\\{\\\"PARALLEL_PID\\\"\\\}=\\\"30669\\\"\\\;\\\$ENV\\\{\\\"PARALLEL_SEQ\\\"\\\}=\\\"687\\\"\\\;\\\$bashfunc\\\
> =\\\ \\\"\\\"\\\;@ARGV=\\\"blastp\\\ -evalue\\\ 1e-05\\\ -outfmt\\\ 6\\\
> -db\\\
> /gporq1_1M/usr/aprea/bio/solanum_melongena/analysis/orthomcl_00/goodProteins_first_0010000\\\
> -query\\\ -\\\ -out\\\
> /gporq1_1M/usr/aprea/bio/solanum_melongena/analysis/orthomcl_00/resultd_24\\\"\\\;\\\$SIG\\\{CHLD\\\}=sub\\\{\\\$done=1\\\;\\\}\\\;\\\$pid=fork\\\;unless\\$\\\$pid\\$\\\{setpgrp\\\;exec\\\$ENV\\\{SHELL\\\},\\\"-c\\\",\\$\\\$bashfunc.\\\"@ARGV\\\"\\$\\\;die\\\"exec:\\\$\\\!\\\\n\\\"\\\;\\\}do\\\{\\\$s=\\\$s\\\<1\\\?0.001+\\\$s\\\*1.03:\\\$s\\\;select\\$undef,undef,undef,\\\$s\\$\\\;\\\}until\\$\\\$done\\\|\\\|getppid==1\\$\\\;kill\\$SIGHUP,-\\\$\\\{pid\\\}\\$unless\\\$done\\\;wait\\\;exit\\$\\\$\\\?\\\&127\\\?128+\\\(\\\$\\\?\\\&127\\$:1+\\\$\\\?\\\>\\\>8\\\););

-vv is really only useful for debugging: It is extremely hard to read
- even if you are the author of GNU Parallel.

I will highly recommend to use '-v' first and only resort to '-vv' if
'-v' shows what is expected.

/Ole

From:	Giuseppe Aprea
Subject:	Re: parallel + blast + LSF
Date:	Wed, 22 Apr 2015 09:57:38 +0200