Variant calling on PacBio HIFI reads¶

Summary¶

This pipeline is from SVNN with some modifications: https://github.com/YichaoOU/PacBio_data_analysis

SVNN first performs read mapping using minimap2, because it is 3X faster than a more sensitive mapper, ngmlr. Then it extracts possible mis-aligned reads (reads that should be splitted but not by minimap2) based on a machine learning classifier and gives them to ngmlr. Then it merges the mapping results, sorts the bam file. Last, it merges variants called from sniffles and svim.

Input¶

fastq.tsv

Use run_lsf.py --guess_input --single to automatically generate this. The first column is read fastq file the second column is sample label. The following example is where I manually edited the second column.

demultiplex.bc1001_BAK8A_OA--bc1001_BAK8A_OA.hifi_reads.fastq.gz        aBC1
demultiplex.bc1002_BAK8A_OA--bc1002_BAK8A_OA.hifi_reads.fastq.gz        aBC2
demultiplex.bc1003_BAK8A_OA--bc1003_BAK8A_OA.hifi_reads.fastq.gz        aBC3
demultiplex.bc1008_BAK8A_OA--bc1008_BAK8A_OA.hifi_reads.fastq.gz        aBC8

Usage¶

hpcf_interactive

module load python/2.7.13

run_lsf.py -f fastq.tsv -p pacbio

Output¶

All files starting with temp_ are intermediate files generated by SVNN.

Called SV¶

final_results.vcf, all SVs in vcf format

SV_summary.csv, all SVs in a table, contain less info than vcf

SV_summary.bedpe, all translocations in bedpe format, for interaction visualzation

SV.read.list, all reads supporting called SV

SV.bam, subset bam file for vis SV.

code @ github.