STARR-seq analysis pipeline¶
usage: starr_seq.py [-h] [-j JID] -f FASTQ_TSV -d DESIGN_MATRIX [-bs BIN_SIZE]
[-ss STEP_SIZE] [-c FDR_CUTOFF] [-min MIN_FRAG]
[-max MAX_FRAG] [-q MAPQ] [-a ADDON_PARAMETERS]
[-g GENOME] [-i INDEX_FILE] [-s CHROM_SIZE] [-b BLACKLIST]
[--gc_cov GC_COV] [--map_cov MAP_COV]
[--conv_cov CONV_COV]
starr-seq pipeline
optional arguments:
-h, --help show this help message and exit
-j JID, --jid JID enter a job ID, which is used to make a new directory.
Every output will be moved into this folder. (default:
starr_seq_yli11_2021-10-27)
-f FASTQ_TSV, --fastq_tsv FASTQ_TSV
paired-end fastq tsv (default: None)
-d DESIGN_MATRIX, --design_matrix DESIGN_MATRIX
3 column tsv for design matrix (default: None)
-bs BIN_SIZE, --bin_size BIN_SIZE
bin_size (default: 500)
-ss STEP_SIZE, --step_size STEP_SIZE
step_size (default: 100)
-c FDR_CUTOFF, --FDR_cutoff FDR_CUTOFF
FDR_cutoff (default: 0.05)
-min MIN_FRAG, --min_frag MIN_FRAG
min_frag size (default: 200)
-max MAX_FRAG, --max_frag MAX_FRAG
max_frag size (default: 1000)
-q MAPQ, --MAPQ MAPQ MAPQ cutoff (default: 40)
-a ADDON_PARAMETERS, --addon_parameters ADDON_PARAMETERS
other parameters to add to starrPeaker (default: )
Genome Info:
-g GENOME, --genome GENOME
genome version: hg19, hg38, mm9, mm10. By default,
specifying a genome version will automatically update
index file, black list, chrom size and
effectiveGenomeSize, unless a user explicitly sets
those options. (default: hg19)
-i INDEX_FILE, --index_file INDEX_FILE
BWA index file (default: /home/yli11/Data/Human/hg19/i
ndex/bwa_16a_index/hg19.fa)
-s CHROM_SIZE, --chrom_size CHROM_SIZE
chrome size (default: /home/yli11/Data/Human/STARR_seq
/hg19.chrom.sizes.simple.sorted)
-b BLACKLIST, --blacklist BLACKLIST
blacklist size (default: /home/yli11/Data/Human/hg19/a
nnotations/hg19.blacklist.bed)
--gc_cov GC_COV gc_cov (default: /home/yli11/Data/Human/STARR_seq
/STARRPeaker_cov_hg19_ucsc-gc-5bp.bw)
--map_cov MAP_COV map_cov (default: /home/yli11/Data/Human/STARR_seq
/STARRPeaker_cov_hg19_gem-mappability-100mer.bw)
--conv_cov CONV_COV conv_cov (default: /home/yli11/Data/Human/STARR_seq
/STARRPeaker_cov_hg19_linearfold-folding-energy-
100bp.bw)
Summary¶
STARR-seq is an assay to profile self-transcribed active regions (e.g., enhancer). This pipeline produces called peaks for these active regions.
To generate fastq for starr-seq from bcl file¶
1. two DNA barcodes and 1 RNA barcode + RNA UMI¶
# login to compute node
hpcf_interactive.sh
module load bcl2fastq
bcl2fastq --no-lane-splitting -o starr_seq_fastq --sample-sheet /home/yli11/HemTools/share/misc/starr_seq_SampleSheet.csv --create-fastq-for-index-reads
cd starr_seq_fastq
module load conda3/202011
source activate /home/yli11/.conda/envs/captureC
starr_seq_demultiplex.py ATTACTCG TATAGCCT ATAGAGGC 1
# DNA barcode R1, R2, RNA barcode, mismatch cutoff. Please replace barcode sequence with your barcode sequence.
2. One DNA barcodes and One RNA barcode + DNA UMI and RNA UMI¶
# login to compute node
hpcf_interactive.sh
module load bcl2fastq
bcl2fastq --no-lane-splitting -o starr_seq_fastq --sample-sheet /home/yli11/HemTools/share/misc/starr_seq_SampleSheet2.csv --create-fastq-for-index-reads
cd starr_seq_fastq
module load conda3/202011
source activate /home/yli11/.conda/envs/captureC
starr_seq_demultiplex2.py AGGCTATA AGGATAGG 1
# DNA barcode, RNA barcode, mismatch cutoff. Please replace barcode sequence with your barcode sequence.
Input¶
1. fastq.tsv¶
Use run_lsf.py --guess_input
to automatically generate this.
myDNA1_R1.fastq.gz myDNA1_R2.fastq.gz myDNA1
myDNA2_R1.fastq.gz myDNA2_R2.fastq.gz myDNA2
myRNA1_R1.fastq.gz myRNA1_R2.fastq.gz myRNA1
myRNA2_R1.fastq.gz myRNA2_R2.fastq.gz myRNA2
myRNA3_R1.fastq.gz myRNA3_R2.fastq.gz myRNA3
2. peakcall.tsv¶
A tsv file containing three columns specifying comparisons. For example, RNA sample name, DNA sample name, comparison name. The names have to match the third column specified in fastq.tsv.
Always RNA vs DNA
myRNA1 myDNA1 myRNA1.vs.myDNA1
myRNA2 myDNA1 anyName
myRNA3 myDNA2 Who
Usage¶
hpcf_interactive
module load python/2.7.13
run_lsf.py --guess_input # to generate fastq.tsv
starr_seq.py -f fastq.tsv -d peakcall.tsv -g hg19