PRO-seq analysis pipeline¶
usage: run_proseq.py [-h] [-i INPUT_FILE] [-j JOB_NAME] [-g {hg38,hg19,mm10,custom}] [--genome_index GENOME_INDEX] [-s GENOME_SIZE_FILE] [-a1 READ1_ADAPTER] [-a2 READ2_ADAPTER] [-l MIN_LENGTH] [--umi_pattern {regex,string}]
[-u1 READ1_UMI] [-u2 READ2_UMI] [-n CPU]
optional arguments:
-h, --help show this help message and exit
-i INPUT_FILE, --input_file INPUT_FILE
tab separated file with fastq file and name
-j JOB_NAME, --job_name JOB_NAME
this will be used to create output directory
-g {hg38,hg19,mm10,custom}, --genome {hg38,hg19,mm10,custom}
different genome versions available. Default = hg19. incase of custom genome provide index path in --genome_index
--genome_index GENOME_INDEX
genome index for custom genome file
-s GENOME_SIZE_FILE, --genome_size_file GENOME_SIZE_FILE
genome size file for custom genome
-a1 READ1_ADAPTER, --read1_adapter READ1_ADAPTER
adapter sequence for read1
-a2 READ2_ADAPTER, --read2_adapter READ2_ADAPTER
adapter sequence for read2
-l MIN_LENGTH, --min_length MIN_LENGTH
minimum length after trimming adapters. Default value is set to 17
--umi_pattern {regex,string}
default value is regex for umi extraction
-u1 READ1_UMI, --read1_umi READ1_UMI
umi regex pattern for read1, this is for the umi at 3 prime end
-u2 READ2_UMI, --read2_umi READ2_UMI
umi regex pattern for read2, this is for the umi at 3 prime end
-n CPU, --cpu CPU number of processors, default = 10
Summary¶
This pipeline takes the FASTQ files as an input, which is then subjected to adapter and UMI processing. The processed FASTQ file is then mapped to reference genome, and then bam file generated is further split to generate forward and reverse bigwig signal.
Input¶
fastq.tsv
Use --guess_input to automatically generate this.
Banana_R1.fastq.gz Banana_R2.fastq.gz Banana_lovers
Orange_R1.fastq.gz Orange_R2.fastq.gz Orange_lovers
Output¶
The output results folder is created based on the name passed in -j parameter followed by the date the program is run. For example, if the user passed ‘-j result’ and based on today’s date 03/31/2025, the output folder ‘result_03_31_2025’ is created. Inside the folder, following folders are generated:
adapter_umi_processed: This contains the FASTQ files after adapter and UMI processing
bam: Contains bam files after FASTQ files are mapped to reference genome. The bam file with dedup represents the duplicates removed based on UMI.
bw: Contains normalized bigwig files based on the bam files
split_bam_bw: Contains the splitted forward and reverse bam and bigwig files.
fastqc : Contains individual fastqc files for the adapter and UMI processed FASTQ files.
Logs: contains program logs for different processes and QC stat files.
multiqc_data: output folder generated by MULTIQC program
Usage¶
The minimum required parameter is the input file. You can simply run it by using following command:
run_proseq.py -I input_fastq.tsv
The above syntax run the program using all other default parameters which are described as follows:
-j = proseq_result (default output folder)
-g = hg19 (reference genome version)
-l = 17 (minimum length of reads acceptable after trimming adapters)
-a1 = TGGAATTCTCGGGTGCCAAGG (read1 adapter sequence)
-a2 = GATCGTCGGACTGTAGAACTCT (read2 adapter sequence)
--umi_pattern = regex (choose ‘string’ if you know the exact sequence)
-u1 = .+(?P<umi_1>G.{6}$) (regex pattern for read1 UMI)
-u2 = .+(?P<umi_2>T.{6}$) (regex pattern for read2 UMI)
Currently, the program supports hg19, hg38 and mm10 as genome versions. If you have any other genome version, you can use your custom genome by using ‘-g custom’ and then provide the path to the custom genome index file using ‘–genome_index’. Also pass the path to custom genome size file using ‘-s’. For all other parameter change the default values according to your experiment needs.
export PATH=$PATH:"/home/yli11/HemTools/bin"
hpcf_interative.sh
module load conda3/202402
source activate /home/yli11/.conda/envs/jupyterlab_2024
run_proseq.py -I input_fastq.tsv