PRO-seq analysis pipeline

usage: run_proseq.py [-h] [-i INPUT_FILE] [-j JOB_NAME] [-g {hg38,hg19,mm10,custom}] [--genome_index GENOME_INDEX] [-s GENOME_SIZE_FILE] [-a1 READ1_ADAPTER] [-a2 READ2_ADAPTER] [-l MIN_LENGTH] [--umi_pattern {regex,string}]
                     [-u1 READ1_UMI] [-u2 READ2_UMI] [-n CPU]

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT_FILE, --input_file INPUT_FILE
                        tab separated file with fastq file and name
  -j JOB_NAME, --job_name JOB_NAME
                        this will be used to create output directory
  -g {hg38,hg19,mm10,custom}, --genome {hg38,hg19,mm10,custom}
                        different genome versions available. Default = hg19. incase of custom genome provide index path in --genome_index
  --genome_index GENOME_INDEX
                        genome index for custom genome file
  -s GENOME_SIZE_FILE, --genome_size_file GENOME_SIZE_FILE
                        genome size file for custom genome
  -a1 READ1_ADAPTER, --read1_adapter READ1_ADAPTER
                        adapter sequence for read1
  -a2 READ2_ADAPTER, --read2_adapter READ2_ADAPTER
                        adapter sequence for read2
  -l MIN_LENGTH, --min_length MIN_LENGTH
                        minimum length after trimming adapters. Default value is set to 17
  --umi_pattern {regex,string}
                        default value is regex for umi extraction
  -u1 READ1_UMI, --read1_umi READ1_UMI
                        umi regex pattern for read1, this is for the umi at 3 prime end
  -u2 READ2_UMI, --read2_umi READ2_UMI
                        umi regex pattern for read2, this is for the umi at 3 prime end
  -n CPU, --cpu CPU     number of processors, default = 10

Summary

This pipeline takes the FASTQ files as an input, which is then subjected to adapter and UMI processing. The processed FASTQ file is then mapped to reference genome, and then bam file generated is further split to generate forward and reverse bigwig signal.

Input

fastq.tsv

Use --guess_input to automatically generate this.

Banana_R1.fastq.gz      Banana_R2.fastq.gz      Banana_lovers
Orange_R1.fastq.gz      Orange_R2.fastq.gz      Orange_lovers

Output

The output results folder is created based on the name passed in -j parameter followed by the date the program is run. For example, if the user passed ‘-j result’ and based on today’s date 03/31/2025, the output folder ‘result_03_31_2025’ is created. Inside the folder, following folders are generated:

  • adapter_umi_processed: This contains the FASTQ files after adapter and UMI processing

  • bam: Contains bam files after FASTQ files are mapped to reference genome. The bam file with dedup represents the duplicates removed based on UMI.

  • bw: Contains normalized bigwig files based on the bam files

  • split_bam_bw: Contains the splitted forward and reverse bam and bigwig files.

  • fastqc : Contains individual fastqc files for the adapter and UMI processed FASTQ files.

  • Logs: contains program logs for different processes and QC stat files.

  • multiqc_data: output folder generated by MULTIQC program

Usage

The minimum required parameter is the input file. You can simply run it by using following command:

run_proseq.py -I input_fastq.tsv

The above syntax run the program using all other default parameters which are described as follows:

-j = proseq_result (default output folder)
-g = hg19 (reference genome version)
-l = 17 (minimum length of reads acceptable after trimming adapters)
-a1 = TGGAATTCTCGGGTGCCAAGG (read1 adapter sequence)
-a2 = GATCGTCGGACTGTAGAACTCT (read2 adapter sequence)
--umi_pattern = regex (choose ‘string’ if you know the exact sequence)
-u1 = .+(?P<umi_1>G.{6}$) (regex pattern for read1 UMI)
-u2 = .+(?P<umi_2>T.{6}$) (regex pattern for read2 UMI)

Currently, the program supports hg19, hg38 and mm10 as genome versions. If you have any other genome version, you can use your custom genome by using ‘-g custom’ and then provide the path to the custom genome index file using ‘–genome_index’. Also pass the path to custom genome size file using ‘-s’. For all other parameter change the default values according to your experiment needs.

export PATH=$PATH:"/home/yli11/HemTools/bin"
hpcf_interative.sh
module load conda3/202402
source activate /home/yli11/.conda/envs/jupyterlab_2024
run_proseq.py -I input_fastq.tsv

code @ github.