STARR-seq analysis pipeline¶
Summary¶
ATAC-STARR-seq is an assay to profile self-transcribed active regions (e.g., enhancer). This pipeline is adopted from Tyler J Hansen 2022, Genome Research paper with minor modifications.
This pipeline assumes both DNA and RNA have UMI.
This pipeline assumes the DNA group and RNA group have equal number of replicates.
(Previous Note) To generate fastq for starr-seq from bcl file¶
If you already have fastq files, please skip this step.
The UMI should be added to the read name as @oldReadName_UMI
.
1. One DNA barcodes and One RNA barcode + DNA UMI and RNA UMI¶
# login to compute node
hpcf_interactive.sh
module load bcl2fastq
bcl2fastq --no-lane-splitting -o starr_seq_fastq --sample-sheet /home/yli11/HemTools/share/misc/starr_seq_SampleSheet2.csv --create-fastq-for-index-reads
cd starr_seq_fastq
module load conda3/202011
source activate /home/yli11/.conda/envs/captureC
starr_seq_demultiplex2.py AGGCTATA AGGATAGG 1
# DNA barcode, RNA barcode, mismatch cutoff. Please replace barcode sequence with your barcode sequence.
Input¶
fastq.tsv¶
Use run_lsf.py --guess_input
to automatically generate this. Then you have to manually add a 4th column, with keywords of DNA
or RNA
.
myDNA1_R1.fastq.gz myDNA1_R2.fastq.gz myDNA1 DNA
myDNA2_R1.fastq.gz myDNA2_R2.fastq.gz myDNA2 DNA
myRNA1_R1.fastq.gz myRNA1_R2.fastq.gz myRNA1 RNA
myRNA2_R1.fastq.gz myRNA2_R2.fastq.gz myRNA2 RNA
Note
This pipeline assumes you have equal number of replicates for the DNA group and the RNA group.
Usage¶
hpcf_interactive
module load python/2.7.13
run_lsf.py --guess_input # to generate fastq.tsv
run_lsf.py -f fastq.tsv -p atac_starr_seq --MAPQ 0 --macs_genome hs -g hg19
For strand separated results, run:¶
run_lsf.py -f fastq.tsv -p atac_starr_seq_forward --MAPQ 0 --macs_genome hs -g hg19
run_lsf.py -f fastq.tsv -p atac_starr_seq_reverse --MAPQ 0 --macs_genome hs -g hg19
When using DNA reads without UMI, run:¶
run_lsf.py -f fastq.tsv -p atac_starr_seq_nodnaumi --MAPQ 0 --macs_genome hs -g hg19
Output¶
QC html¶
A QC report will be emailed to you when the job is done.
General mapping stats. Some important QC are shown below.
There are many blanks in the table because the QC program extracts text directly from file and usually each file only contain one stat.
insertion size/ fragment size.
library complexity
Differential analysis¶
This is the main result.
Take a look at the diff_analysis_plots.html
file inside the Differential_analysis
.
This html provides a volcano plot
(the file is volcano.pdf
), correlation plots (corr_R1-R2.pdf
), number and size disitribution of active/silent elements bins_donut.pdf
/ regulatory-regions_size_distribution.pdf
, and number of active/silent peaks peaks_donut.pdf
.
Active and Silent regions¶
Use silent_regions.bed
and active_regions.bed
. The program create 50bp overlapping bins to define region. So these regions are merged from those bins.
Use *peak.bed
for peak level annotation.
fold change bw files¶
DESEQ outputs can also be viewed as bw files. Please go to the bdg_to_bw*
folder inside the Differential_analysis
.
Active and Silent peaks¶
#get counts
ChrAcc='../Merged_peaks.rmblck.narrowPeak'
#print number of chracc peaks
system(paste0("wc -l ",ChrAcc),intern = TRUE)
#intersect peaks with active regions and print line number
system(paste0("bedtools intersect -u -a ",ChrAcc," -b active_regions.bed > active.peak.bed"),intern = TRUE)
#intersect peaks with silent regions and print line number
system(paste0("bedtools intersect -u -a ",ChrAcc," -b silent_regions.bed > silent.peak.bed"),intern = TRUE)
#intersect peaks with overlapping active and silent regions and print line number
system(paste0("bedtools intersect -u -a ",ChrAcc," -b active_regions.bed | bedtools intersect -u -a - -b silent_regions.bed > both.peak.bed"),intern = TRUE)
#report peaks that do not intersect active or silent regions and print line number
system(paste0("bedtools intersect -v -a ",ChrAcc," -b active_regions.bed | bedtools intersect -v -a - -b silent_regions.bed > neutral.peak.bed"),intern = TRUE)
active.peak.bed
is any peaks overlapped with active.
silent.peak.bed
is any peaks overlapped with silent.
active.peak.bed
may overlap with both.peak.bed
, silent.peak.bed
may overlap with both.peak.bed