A simple solution to submit LSF jobs¶
Motivation¶
Suppose I’m going to run the following command(s) for 4 files:
DP_GP_cluster.py -i file1 -o output_prefix --plot --plot_types png -a 0.2 -s 10
Instead of running it 4 times manually, I can submit a job array to the LSF system.
It can be a little time consuming everytime when you submit a job array. Moreover, it is likely that your pipeline is a tree structure, meaning that job2 depends on job1 and maybe job3 can be paralleled and job4 has to wait for all job3 finished. In all these cases, this script is to help you.
Summary¶
This solution, based on some new syntax, can help you:
run the same script for a list of files
job dependency
help you implement parallel jobs
Usage¶
1. Prepare input list
You should have a tab-seperated file containing all the inputs, parameters, and outputs. I often put input files as the first column and output names as the second column, and parameters in other columns.
Then, the keyword
for using these columns are: ${COL1}
, ${COL2}
, ${COL3}
, upto ${COL8}
. You will see how to use them later.
An example shown below:
ETS1_Jurkat.fastq ETS1_Jurkat 3
Jurkat_input.fastq Jurkat_input 5
2. Prepare job file
A normal LSF job file will look like the following:
#BSUB -P chip_seq_single
#BSUB -o chip_seq_single_yli11_2019-06-20.macs2.message_%J_%I.out -e chip_seq_single_yli11_2019-06-20.macs2.message_%J_%I.err
#BSUB -n 1
#BSUB -q standard
#BSUB -R "span[hosts=1] rusage[mem=30000]"
#BSUB -J "macs2[1-1]"
module purge
#BSUB -w "ended(81082927)"
module load macs2/2.1.1
module load bedtools/2.25.0
id=$LSB_JOBINDEX
COL1=`head -n $id peakcall.tsv|tail -n1|awk '{print $1}'`
COL2=`head -n $id peakcall.tsv|tail -n1|awk '{print $2}'`
COL3=`head -n $id peakcall.tsv|tail -n1|awk '{print $3}'`
LINE=`head -n $id peakcall.tsv|tail -n1`
macs2 callpeak -t ${COL1}.bam -c ${COL2}.bam -g hs --keep-dup all -n ${COL3} -B
YOUR_COMMANDS
In this simple solution, you don’t need to worry about remembering/writing all the information. You just need to write down your dependencies (e.g., module load macs2/2.1.1
), your input files (e.g., peakcall.tsv
), and your commands. For example:
=cut run1 1
module load conda3
source activate py2
inputFile=input
ncore=1
mem=8000
DP_GP_cluster.py -i ${COL1} -o ${COL2} --plot --plot_types png -a 0.01 -s 1
In the above example, keywords ${COL1}
and ${COL2}
are used to specify input and output names, which are strings in the first and second column. =cut
is a keyword declaring a new job, followed by job name, index number and optional dependent job.
inputFile=input
is a keyword, you have to have this line in every =cut
declared jobs if you have a input tsv file, otherwise you don’t need it.
ncore
and mem
specify how many CPUs and memory (in Mb) you need. If you don’t specify these lines, default is 1 CPU and 4G memory.
In summary, =cut job_name job_index_number
is required. All other lines are optional. But you need to put some commands, otherwise, you are running an empty job.
Hello World example¶
=cut H1 1
module load python
inputFile=input
ncore=1
mem=4000
echo "Hello 1"
=cut H2 1
inputFile=input
ncore=1
mem=4000
echo "Hello 2"
=cut H1.2 2 H1
inputFile=input
ncore=1
mem=4000
echo "Hello 1 * 2"
=cut H1.2.1 3 H1.2
inputFile=input
ncore=1
mem=4000
echo "Hello 1 * 2 * 1"
=cut email 4 all
module load python/2.7.13
cd {{jid}}
send_email_v1.py -m "{{jid}} is finished" -j {{jid}}
In the above example, H1 and H2 will run in parallel because they have no parent jobs. H1.2
run after H1
, H1.2.1
run after H1.2
. Then after all
jobs finished, send user an email. Here all
is a keyword, based on the job index
, the current job should wait until all previous jobs have finished.