eCLIP-seq data analysis pipeline¶
Summary¶
Pipeline adopted from https://www.encodeproject.org/documents/739ca190-8d43-4a68-90ce-1a0ddfffc6fd/@@download/attachment/eCLIP_analysisSOP_v2.2.pdf
Only work for hg19 right now, by 6/6/2022.
Pipeline has been tested using the ENCODE data from K562: blood_regulome/chenggrp/Projects/Siqi_data/CLIP/RBM9_Public/RBM9_K562
Input¶
fastq.tsv¶
Depending on single-end or paired-end, you might use run_lsf.py --guess_input
or run_lsf.py --guess_input --single
to automatically generate this file.
Banana_R1.fastq.gz Banana_R2.fastq.gz Banana_lovers
Orange_R1.fastq.gz Orange_R2.fastq.gz Orange_lovers
Usage¶
hpcf_interactive
module load python/2.7.13
# for paired-end data
run_lsf.py -f fastq.tsv -p eclip_pe
# for single-end data
run_lsf.py -f fastq.tsv -p eclip_se
Output¶
1. eCLIP QC report¶
Please check the QC in the html file.
2. strand specific signals¶
See the bw files
3. called peaks¶
See the bed files.
clipper
results looks more accurate than pureCLIP
, because pureCLIP
predicted binding sites are basically merged bed file from the predicted cross-link sites, and if we look at the signals, these binding sites do not align well with the binding sites. P.S., I don’t know why crosslink site is different than binding sites yet.
clipper
takes a week to finish for 100-200M bam file (UMI-deduplicated).
Example of clipper output:
# column names
chr, start, end
gene_ID|unique ID|read count (default read count cutoff is 3)
minimal pvalue (clipper has a p-value for each position)
strand, peak center start, peak center end
chr1 133723 133804 ENSG00000233750.3_0_4 0.006532397293615632 + 133761 133765
chr1 235687 235773 ENSG00000228463.4_0_3 0.021506732213281816 - 235722 235726
chr1 329595 329633 ENSG00000233653.3_0_3 0.023548354478527544 - 329611 329615
chr1 564499 564571 ENSG00000230021.2_0_29 3.452872201838815e-29 - 564545 56454
QC¶
eCLIP experiments should have 1 million unique fragments or have saturated peak detection in each biological replicate.
The following stats are obtained by re-analysis ENCODE data, not part of the data standards.
FASTqc: duplicates, 30%-40%, input control maybe up to 60%.
STAR align of rRNA removed reads, ~40% mapping rate. Input control maybe lower.