Variant calling on PacBio HIFI reads

Summary

This pipeline is from SVNN with some modifications: https://github.com/YichaoOU/PacBio_data_analysis

SVNN first performs read mapping using minimap2, because it is 3X faster than a more sensitive mapper, ngmlr. Then it extracts possible mis-aligned reads (reads that should be splitted but not by minimap2) based on a machine learning classifier and gives them to ngmlr. Then it merges the mapping results, sorts the bam file. Last, it merges variants called from sniffles and svim.

Input

fastq.tsv

Use run_lsf.py --guess_input --single to automatically generate this. The first column is read fastq file the second column is sample label. The following example is where I manually edited the second column.

demultiplex.bc1001_BAK8A_OA--bc1001_BAK8A_OA.hifi_reads.fastq.gz        aBC1
demultiplex.bc1002_BAK8A_OA--bc1002_BAK8A_OA.hifi_reads.fastq.gz        aBC2
demultiplex.bc1003_BAK8A_OA--bc1003_BAK8A_OA.hifi_reads.fastq.gz        aBC3
demultiplex.bc1008_BAK8A_OA--bc1008_BAK8A_OA.hifi_reads.fastq.gz        aBC8

Usage

hpcf_interactive

module load python/2.7.13

run_lsf.py -f fastq.tsv -p pacbio

Output

All files starting with temp_ are intermediate files generated by SVNN.

Called SV

final_results.vcf, all SVs in vcf format

SV_summary.csv, all SVs in a table, contain less info than vcf

SV_summary.bedpe, all translocations in bedpe format, for interaction visualzation

SV.read.list, all reads supporting called SV

SV.bam, subset bam file for vis SV.

code @ github.