Quantify base editing efficiency/cas9 indel frequency for Hybrid-Capture assay¶
usage: crispressoWGS_BE.py [-h] [-j JID] -r REGION_FILE -f BAM_LIST
[--ref REF] [--alt ALT] [--center CENTER]
[--addon_parameters ADDON_PARAMETERS]
[--queue QUEUE] [-g GENOME]
[--genome_fasta GENOME_FASTA]
optional arguments:
-h, --help show this help message and exit
-j JID, --jid JID enter a job ID, which is used to make a new directory.
Every output will be moved into this folder. (default:
crispressoWGS_BE_yli11_2024-11-20)
-r REGION_FILE, --region_file REGION_FILE
gRNA_bed required (default: None)
-f BAM_LIST, --bam_list BAM_LIST
gRNA_bed required (default: None)
--ref REF reference base (default: A)
--alt ALT alternative base (default: G)
--center CENTER center (default: -10)
--addon_parameters ADDON_PARAMETERS
additional paramteeres, such as
--min_paired_end_reads_overlap for crispresso
(default: )
--queue QUEUE which queue to use (default: standard)
Genome Info:
-g GENOME, --genome GENOME
genome version: hg19, hg38, mm9, mm10. By default,
specifying a genome version will automatically update
index file, black list, chrom size and
effectiveGenomeSize, unless a user explicitly sets
those options. (default: hg19)
--genome_fasta GENOME_FASTA
genome fasta file (default:
/home/yli11/Data/Human/hg19/fasta/hg19.fa)
Summary¶
Code is designed to analyze Hybrid-Capture assay based on crispressoWGS, but should be generic.
Hybrid-Capture is a cost-effective assay comparing to rhAmp-seq if you have >500 regions. In our data, >90% region have at least 1000 UMI-deduplicated reads.
Updates for Cas9 samples, 2/24/2025¶
For cas9 samples, we still use the same pipeline, but we can ignore the ref to alt base conversion stats, and just focus on indel freqeuncy. Some key parameters need to be changed.
First, the input region file, instead of extending +/- 2bp. Users need to add more bases to the PAM side, for example, 2bp to 5end but 10bp added to the 3end.
Seconding, quantification window and window size, we will change to cas9 setting. See below.
crispressoWGS_BE.py -r gRNA.bed -f bam.tsv -g hg38 --ref A --alt G --w_size 1 --center "-3" --addon_parameters " --exclude_bp_from_right 0 --exclude_bp_from_left 0 --plot_window_size 12" --queue priority
Input¶
Target region bed file¶
5-col tsv, chr, start, end, name, gRNA spacer sequence
name, the 4-th col name should start with gRNA name, required to use my chi-square test script to do treatment-control comparison
chr17 82316684 82316712 gRNAName1_OT1_any_name GAGGTCAATGTCTACGGCTC
chrX 10739958 10739986 gRNAName1_OT2 GAGGTCAATGCCTACGCTTG
chr21 27717636 27717664 gRNAName2_OT3 GATGTCAATGTCTACAGCTT
gRNA spacer length can only be 18nt to 22nt. Adjust your gRNA if not in this range.
Note, crispressoWGS requires bam read to cover the entire given region, so your input region file better not too big. We recommand start and end to be 2bp flanking the spacer sequence. crispressoWGS uses samtools faidx hg38.fa chr1:55555559-55555561 to extract fasta, all start and end are 1-index.
Also crispressoWGS has a “bug” when extract fasta sequence, the end-1 position is used, instead of end.
So you need to make sure your start and end is also 1-index. So if your spacer is 3-22 (1-index), the start and end should be 1 and 25.
Output¶
For each sample, it outputs a tsv file containing the gRNA name, gRNA sequence, base editing efficiency for each position (only consider the 20bp gRNA length), and ‘indel_frequency’,’total_indel’,’Reads_total’. So totally 25 columns.
file name: $sample_id.gRNA_length.edit_eff.tsv contain indel infomation and edit% each position
file name: $sample_id.allele.edit.tsv contain allele edit information
file name: $sample_id.max_edit.tsv (pdf) max edit each position
If editing efficiency is -1, meaning no reads mapped to the amplicon sequence. or total reads < 50
Usage¶
Copy fastq files, target bed file in the working dir and run the following:
hpcf_interactive
PATH=/home/yli11/HemTools/bin:/hpcf/lsf/lsf_prod/10.1/linux3.10-glibc2.17-x86_64/etc:/hpcf/lsf/lsf_prod/10.1/linux3.10-glibc2.17-x86_64/bin:/usr/lpp/mmfs/bin:/usr/lpp/mmfs/lib:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/ibutils/bin:/sbin:/cm/local/apps/environment-modules/3.2.10/bin:/opt/puppetlabs/bin
export PATH=$PATH:"/home/yli11/HemTools/bin"
module load python/2.7.13
run_lsf.py --guess_input
crispressoWGS_BE.py -r ABE.target.bed -f ABE.bam.list -g hg38 --ref A --alt G --addon_parameters " --exclude_bp_from_right 0 --exclude_bp_from_left 0 --plot_window_size 12 " --queue standard
Chi-square test¶
1. Prepare a design tsv¶
sample label, group label, replicates
each design.tsv file corresponse to the same control file, if you have two groups using two different control, then create two design.tsv, because Control is a keywork in design.tsv
==> D2.design.tsv <==
GM_VK484_S1 Control 1
GM_VK485_S2 P27_D2 1
GM_VK486_S3 P26_D2 1
GM_VK487_S4 Control 2
GM_VK488_S5 P27_D2 2
GM_VK489_S6 P26_D2 2
GM_VK490_S7 P27_D2 3
GM_VK491_S8 P26_D2 3
==> D3.design.tsv <==
GM_VK492_S9 Control 1
GM_VK493_S10 P27_D3 1
GM_VK494_S11 P26_D3 1
GM_VK495_S12 Control 2
GM_VK496_S13 P27_D3 2
GM_VK497_S14 P26_D3 2
GM_VK498_S15 P27_D3 3
GM_VK499_S16 P26_D3 3
2. Run the code¶
In the crispresso jid folder, where you have allele_edit.tsv and eff_edit.tsv
hpcf_interactive
module load conda3/202402
source activate /home/yli11/.conda/envs/jupyterlab_2024
hybrid_capture_chi_square.py design.tsv
3. Outputs¶
The stats tables are provided for ABE and Cas9 (indel%). You should select one of them based on your assay.
Outputs are Your_group_label.hybrid_capture.pivot_tabe.add_chi_square_stats.ABE.csv and Your_group_label.hybrid_capture.pivot_tabe.add_chi_square_stats.Cas9.csv