This vignette intends to present a ‘lineage aware’ variant calling pipeline that uses GATK4 in conjunction with a custom script heranca
High false positive rates in SNP and Indel predictions from GATK4 make using the output directly difficult.
This vignette focuses on a heranca, a custom script that seeks to flag likely low-quality predictions originating from homopolymer / polynucleotide runs or those the mutations that break the infinite site assumption, occurring in at least three unrelated strains. Heranca relies on a tab-delimited file to specify the variant names, vcf file locations, and lineage of strains.
Heranca is described more deeply here
--sampleploidy
). The Gresham lab frequently uses haploid
yeast which will result in lower GATK4 performance if used with default
(diploid) settings.VariantFiltration -filter-name "MQ_filter" -filter "MQ < 40.0"
is too strict and should be relaxed to 30.git clone https://github.com/pspealman/heranca.git
cd heranca
python heranca.py
Using the defaults the demo will filter each of the VCFs. Edited VCFs and Filter summary logs are produced for each VCF and saved in the output_path if defined or the current working directory. Some details are printed to screen as excerpted below:
Edited vcf file saved to: /scratch/ps163/Anc_heranca.vcf
Log file saved to: /scratch/ps163/Anc_heranca.log
MQ_FILTER 1
PASS 6
QH_exceeds_polyn 8
SOR_FILTER 1
Edited vcf file saved to: /scratch/ps163/S1_heranca.vcf
Log file saved to: /scratch/ps163/S1_heranca.log
MQ_FILTER 1
PASS 5
QH_exceeds_polyn 16
QH_fails_ISA 2
QH_indel_fails_ISA 4
SOR_FILTER 3
Heranca uses a tab delimited metadata file to coordinate information. This metadata file consists of three columns. 1. Name column for the strain/sample must be unique in the file 2. File_path column with an absolute path to the VCF file 3. Lineage column containing the comma-delimited Names of strains it is descended from.
Note: a strain with it’s own name in it’s Lineage field will be treated as an ancestor, that is to say it will not be descended from any other strains.
The figure below shows a schematic relating 7 strains - one ancestor (Anc), five direct descendants (S1-S5), and one strain descended from previous strain (S6).
To represent these relationships in the metadata file we first specify
names, locations, and then lineages. For the ancestor (Anc) we put it’s
own name in the lineage field, denoting it’s ancestral status. Because
all other strains are descended from the ancestor (Anc) we also populate
those fields with that name as well. Finally, because S6 is also
descended from S5 we include it’s name in the lineage field as well,
using a comma as a separator.
Anc demo/Anc.vcf Anc
S1 demo/S1.vcf Anc
S2 demo/S2.vcf Anc
S3 demo/S3.vcf Anc
S4 demo/S4.vcf Anc
S5 demo/S5.vcf Anc
S6 demo/S6.vcf Anc, S5
Standard format will result in VCFs listed in the
<input_metadata_file.tab> being corrected and output to the
python heranca.py \
-i <input_metadata_file.tab> \
-fa <reference_genome_fasta_file> \
-o <output_path>
input_metadata_file:
Project_Carolino_new/combine_vcf_metadata_indels.txt
DGY1657 Project_Carolino_new/HG7CVAFXX_mini02_40_filtered_indels_2.vcf DGY1657
DGY1726 Project_Carolino_new/HKFYTBGX2_mini02_partii_61_filtered_indels_2.vcf DGY1657
reference_genome_fasta_file:
Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa
output_path:
mkdir heranca
python heranca.py \
-i Project_Carolino_new/combine_vcf_metadata_indels.txt
-fa Ensembl/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa \
-o heranca/example_
This will produce two edited VCF files and two summary log files:
heranca/example_HG7CVAFXX_mini02_40_filtered_indels_2_heranca.vcf
heranca/example_HG7CVAFXX_mini02_40_filtered_indels_2_heranca.log
heranca/example_HKFYTBGX2_mini02_partii_61_filtered_indels_2_heranca.vcf
heranca/example_HKFYTBGX2_mini02_partii_61_filtered_indels_2_heranca.log
Output will be a modified VCF file inluding all variants present in the original, but with those candidates that failed the polynucelotide or ISA criteria flagged as described above.
XIII 680936 . T C 1282.06 PASS AC=2;AF=1.00;AN=2;DP=30;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=34.37;SOR=1.143 GT:AD:DP:GQ:PL 1/1:0,30:30:90:1296,90,0
XIII 680940 . C T 1226.06 PASS AC=2;AF=1.00;AN=2;DP=29;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=28.70;SOR=1.179 GT:AD:DP:GQ:PL 1/1:0,28:28:84:1240,84,0
XIII 809198 . A G 2052.06 PASS AC=2;AF=1.00;AN=2;DP=57;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=32.78;SOR=0.846 GT:AD:DP:GQ:PL 1/1:0,54:54:99:2066,162,0
XIII 908174 . G T 567.64 PASS AC=1;AF=0.500;AN=2;BaseQRankSum=0.372;DP=57;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=55.71;MQRankSum=-2.084;QD=11.58;ReadPosRankSum=-3.264;SOR=0.681 GT:AD:DP:GQ:PL 0/1:32,17:49:99:575,0,1292
remove_filtered = True
): Heranca
finds two of theses variants fail and omits these recordsXIII 680936 . T C 1282.06 PASS AC=2;AF=1.00;AN=2;DP=30;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=34.37;SOR=1.143 GT:AD:DP:GQ:PL 1/1:0,30:30:90:1296,90,0
XIII 809198 . A G 2052.06 PASS AC=2;AF=1.00;AN=2;DP=57;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=32.78;SOR=0.846 GT:AD:DP:GQ:PL 1/1:0,54:54:99:2066,162,0
remove_filtered = False
): Setting Heranca
to include filtered reads, we find rows 2 (XIII:680940) and 4
(XIII:908174)XIII 680936 . T C 1282.06 PASS AC=2;AF=1.00;AN=2;DP=30;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=34.37;SOR=1.143 GT:AD:DP:GQ:PL 1/1:0,30:30:90:1296,90,0
XIII 680940 . C T 1226.06 QH_Filter_exceeds_polyn_5 AC=2;AF=1.00;AN=2;DP=29;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=28.70;SOR=1.179 GT:AD:DP:GQ:PL 1/1:0,28:28:84:1240,84,0
XIII 809198 . A G 2052.06 PASS AC=2;AF=1.00;AN=2;DP=57;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=32.78;SOR=0.846 GT:AD:DP:GQ:PL 1/1:0,54:54:99:2066,162,0
XIII 908174 . G T 567.64 QH_fails_ISA_5 AC=1;AF=0.500;AN=2;BaseQRankSum=0.372;DP=57;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=55.71;MQRankSum=-2.084;QD=11.58;ReadPosRankSum=-3.264;SOR=0.681 GT:AD:DP:GQ:PL 0/1:32,17:49:99:575,0,1292
Heranca will also generate summary log files of the counts of variant filter status categories. A complete description of the ‘QH’ filters is available here
MQ_FILTER 4
PASS 55
QH_exceeds_polyn 23
QH_low_genotype_relative_likelihood 6
SOR_FILTER 4
Here we see variants filtered by GATK4 include 4 filtered for MQ, and SOR criteria. 23 are filtered by heranca for exceeding the polyN criteria and 6 fail to meet the relative genotype likelihood criteria. Finally, 55 variants meet all the criteria. Notably, no variants are filtered for violating the ISA criteria as this is a ancestral strain.
PASS 27
QH_exceeds_polyn 22
QH_indel_fails_ISA 15
QH_low_genotype_relative_likelihood 1
Here we see that no variants were filtered by GATK4. 22 are filtered by heranca for exceeding the polyN criteria and 1 fails to meet the relative genotype likelihood criteria. 15 variants are filtered for violating ISA. Only, 27 variants meet all the criteria.