Variant calling on PacBio HIFI reads¶
Summary¶
This pipeline is from SVNN with some modifications: https://github.com/YichaoOU/PacBio_data_analysis
SVNN first performs read mapping using minimap2, because it is 3X faster than a more sensitive mapper, ngmlr. Then it extracts possible mis-aligned reads (reads that should be splitted but not by minimap2) based on a machine learning classifier and gives them to ngmlr. Then it merges the mapping results, sorts the bam file. Last, it merges variants called from sniffles and svim.
Input¶
fastq.tsv
Use run_lsf.py --guess_input --single to automatically generate this. The first column is read fastq file the second column is sample label. The following example is where I manually edited the second column.
demultiplex.bc1001_BAK8A_OA--bc1001_BAK8A_OA.hifi_reads.fastq.gz aBC1
demultiplex.bc1002_BAK8A_OA--bc1002_BAK8A_OA.hifi_reads.fastq.gz aBC2
demultiplex.bc1003_BAK8A_OA--bc1003_BAK8A_OA.hifi_reads.fastq.gz aBC3
demultiplex.bc1008_BAK8A_OA--bc1008_BAK8A_OA.hifi_reads.fastq.gz aBC8
Usage¶
hpcf_interactive
module load python/2.7.13
run_lsf.py -f fastq.tsv -p pacbio
Output¶
All files starting with temp_ are intermediate files generated by SVNN.
Called SV¶
final_results.vcf, all SVs in vcf format
SV_summary.csv, all SVs in a table, contain less info than vcf
SV_summary.bedpe, all translocations in bedpe format, for interaction visualzation
SV.read.list, all reads supporting called SV
SV.bam, subset bam file for vis SV.