Standard GATK variant calling for both human and non-human species

usage: GATK_variant_call.py [-h] [-j JID] -f INPUT_LIST -s KNOWN_SNP
                            [-g GENOME] [--genome_fasta GENOME_FASTA]

optional arguments:
  -h, --help            show this help message and exit
  -j JID, --jid JID     enter a job ID, which is used to make a new directory.
                        Every output will be moved into this folder. (default:
                        GATK_variant_call_yli11_2020-01-09)
  -f INPUT_LIST, --input_list INPUT_LIST
                        tsv 2 columns, bam file and output name (default:
                        None)
  -s KNOWN_SNP, --known_SNP KNOWN_SNP
                        give a known snp vcf file. Known variants are used for
                        base recalibration, without known SNPs, this pipeline
                        will run GATK twice and use the first result as known
                        variants. If no known SNPs are available, use NA. This
                        is used for non-human species. (default: None)

Genome Info:
  -g GENOME, --genome GENOME
                        genome version: hg19, hg38, mm9, mm10, custom. By
                        default, specifying a genome version will
                        automatically update index file, black list, chrom
                        size and effectiveGenomeSize, unless a user explicitly
                        sets those options. (default: hg19)
  --genome_fasta GENOME_FASTA
                        genome fasta file (default:
                        /home/yli11/Data/Human/hg19/fasta/hg19.fa)

Summary

This germline variant calling pipeline is designed for non-human species but it also useful for human. Standard GATK pipeline includes BWA-MEM mapping, bam sort and remove duplicates, GATK base recalibration, GATK haplotype caller. Note that variant annotation is not included in this pipeline.

Ref: https://wiki.stjude.org/pages/viewpage.action?pageId=53318458

For species without known variants, we have to perform GATK variant calling at least twice (till convergence), suggested in https://software.broadinstitute.org/gatk/documentation/article?id=11081, because base recalibration is a mandatory step in GATK pipeline and base recalibration can detect systematic sequencing errors and thus increase the accuracy of variant calling.

In practice, two step variant calling is probably enough. See: https://genome.cshlp.org/content/25/12/1921.full.html, “Given the lack of existing polymorphism data, a first round of conservative variant calling provided input to the quality score recalibration procedure. SNPs were then called using the GATK UnifiedGenotyper algorithm on all samples simultaneously.” UnifiedGenotyper is later on replaced by HaplotypeCaller in GATK 3.3 or later.

Input

bam.list

A tsv file with 2 columns: path to bam file and a sample ID. Each line will produce a vcf file named as [sampleID].BSQR.vcf.

1659315_cell1_treat1_ATAC_S13_L001.rmdup.uq.bam 1659315_cell1_treat1
1659316_cell1_control_ATAC_S14_L001.rmdup.uq.bam        1659316_cell1_control
1659317_cell2_treat1_ATAC_S15_L001.rmdup.uq.bam 1659317_cell2_treat1
1659318_cell2_control_ATAC_S16_L001.rmdup.uq.bam        1659318_cell2_control
1659315_cell1_treat1_ATAC_S13_L002.rmdup.uq.bam 1659315_cell1_treat1_2
1659316_cell1_control_ATAC_S14_L002.rmdup.uq.bam        1659316_cell1_control_2
1659317_cell2_treat1_ATAC_S15_L002.rmdup.uq.bam 1659317_cell2_treat1_2
1659318_cell2_control_ATAC_S16_L002.rmdup.uq.bam        1659318_cell2_control_2

Usage

For hg19, use

    hpcf_interactive

module load python/2.7.13

general_variant_calling.py -f bam.list

For all other species, you have to provide a SNP vcf file. If such file is not available, use -s NA. Not recommanded for hg39, mm9, mm10, because such vcf files are available, just not currently provided by this pipeline.

The following example is for green monkey.

GATK_variant_call.py -f bam.list -g custom --genome_fasta chlsab2_myxoma.fa -s NA

Output

Once the job is finished, you will be notified by email.

*.BSQR.vcf contains the called variants, which is inside the {{jid}}/[sample_id] folder

Latest updates

For hg38 only (because SNPdb is hg38), starting from PE fastq files. Output is vcf files and multiQC html file.

run_lsf.py --guess_input

run_lsf.py -f fastq.tsv -g hg38 -p GATK_variant_call_bwa

Pipeline script

=cut GATK 1

inputFile=input_list

ncore=1
mem=16000


module load picard/2.9.4 gatk/3.5
module load samtools/1.9

genome_fasta={{genome_fasta}}
bam=${COL1}
output=${COL2}
mkdir {{jid}}/$output

cp $genome_fasta {{jid}}/$output/



ref={{jid}}/$output/$(basename $genome_fasta)


two_step_GATK={{no_known_SNPs}}

known_SNP={{known_SNP}}




java -jar /hpcf/apps/picard/install/2.9.4/picard.jar CreateSequenceDictionary R= $ref  O= ${ref%.*}.dict

samtools faidx $ref


java -jar /hpcf/apps/picard/install/2.9.4/picard.jar AddOrReplaceReadGroups I= $bam O= $output.bam RGID=test RGLB=test RGPL=illumina RGPU=Hart_Center RGSM=test

java -jar /hpcf/apps/picard/install/2.9.4/picard.jar ReorderSam I= $output.bam O= $output.sorted.bam R= $ref CREATE_INDEX=TRUE


if [ "$two_step_GATK" = true ] ; then

        echo -e "["$(date)"]\tStart two step GATK.."

    java -jar /hpcf/apps/gatk/install/3.5/GenomeAnalysisTK.jar -T HaplotypeCaller -R $ref -I $output.sorted.bam -o $output.known.vcf

        known_SNP=$output.known.vcf

        echo -e "["$(date)"]\tFiltering Variants.."
        java -jar /hpcf/apps/gatk/install/3.5/GenomeAnalysisTK.jar -T VariantFiltration -R $ref -V $known_SNP -filterName FS -filter "FS > 30.0" -filterName QD -filter "QD < 2.0" -o $output.filtered.known.vcf

        known_SNP=$output.filtered.known.vcf

fi



#Perform BQSR
echo -e "["$(date)"]\tPerforming BQSR.."
java -jar /hpcf/apps/gatk/install/3.5/GenomeAnalysisTK.jar -T BaseRecalibrator -I $output.sorted.bam -R $ref -knownSites $known_SNP -o $output.Base.Recal.table

#Print recalibrated reads
echo -e "["$(date)"]\tPrinting recalibrated reads.."
java -jar /hpcf/apps/gatk/install/3.5/GenomeAnalysisTK.jar -T PrintReads -R $ref -I $output.sorted.bam -BQSR $output.Base.Recal.table -o $output.BQSR.bam


#Run HaplotypeCaller
echo -e "["$(date)"]\tRunning HaplotypeCaller.."
java -jar /hpcf/apps/gatk/install/3.5/GenomeAnalysisTK.jar -T HaplotypeCaller -R $ref -I $output.BQSR.bam  -o $output.BQSR.vcf

mv $output*.vcf {{jid}}/$output/
mv $output*.bam {{jid}}/$output/
mv $output*.table {{jid}}/$output/

Report bug

$ HemTools report_bug

Comments

code @ github.