SMaSH: Scoring and Rescue

Scoring

SMaSH compares two VCF files and scores their concordance. In the usual use case, we consider one of those callsets ground truth and the other a set of predicted variants produced by a variant calling pipeline; we compare the predicted variants to the ground truth variants to determine how accurate the pipeline was. We may also compare two predicted callsets to see how well they agree; the algorithm for comparison is exactly the same, but fields should be relabeled (for example, true positives should be considered concordance, etc).

SMaSH broadly categorizes variants as SNPs (changes of a single base), indels (variants that are 50 basepairs in length or less) and structural variants (variants that are longer than 50 basepairs). Within those categories, we further distinguish indels as deletions, insertions, inversions and other, and structural variants as deletions, insertions and other.

SNPs and indels are evaluated strictly; a variant in one callset is the same as one in the other callset if it is at the same position and has the same alleles. Structural variants are evaluated more loosely; two SV calls are considered equivalent if they are of the same basic type (deletion, insertion, or many-to-many) and if their positions and length are within a given basepair tolerance. The tolerance can be set at runtime; the default value is 50.

Rescue

However, since it is possible to express the same underlying change in multiple different ways in the VCF format, simply evaluating based on exact matches would misclassify some equivalent variants. SMaSH addresses this problem with the rescue algorithm. After identifying true positive, false positive and false negative calls, SMaSH iterates over all false negatives to see if they can be rescued.

Ground truth callset
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	NA00001	SMaSH classification
chr1	2	.	T	A	20	PASS	.	GT	0/1	SNP - False Negative
chr1	4	.	C	G	20	PASS	.	GT	0/1	SNP - False Negative

Ground truth callset #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 SMaSH classification chr1 2 . T A 20 PASS . GT 0/1 SNP - False Negative chr1 4 . C G 20 PASS . GT 0/1 SNP - False Negative Predicted callset #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 SMaSH classfication chr1 2 . TAC AAG 20 PASS . GT 0/1 Indel Other - False Positive

Predicted callset
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	NA00001	SMaSH classfication
chr1	2	.	TAC	AAG	20	PASS	.	GT	0/1	Indel Other - False Positive

To attempt to rescue a variant, we take all the variants that fall within a window it. The size of this window can be configured; the default value is 50 basepairs. We use those variants and the reference to build up the underlying haplotypes for the true and predicted callsets. If those haplotypes are equivalent, we "rescue" all variants within that window by marking false negative variants as true positives and removing the false positive variants from the counts.

In this example, we pick the first false negative SNP and build up a window around it (abbreviated here for clarity). Since all the true and predicted variants are heterozygous reference, we expand and compare the reference sequence and the alternate sequence within the window. (Had one of these variants been homozygous alt, we would have made that change on both strands for comparison; for this reason, we only rescue variants with matching genotyping.) We can see here that both underlying haplotypes are equivalent, so rescue is successful.

Ground truth callset
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	NA00001	SMaSH classification
chr1	2	.	T	A	20	PASS	.	GT	0/1	SNP - True Positive
chr1	4	.	C	G	20	PASS	.	GT	0/1	SNP - True Positive

Ground truth callset #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 SMaSH classification chr1 2 . T A 20 PASS . GT 0/1 SNP - True Positive chr1 4 . C G 20 PASS . GT 0/1 SNP - True Positive Predicted callset #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 SMaSH classfication chr1 2 . TAC AAG 20 PASS . GT 0/1 Indel Other - Rescued

Predicted callset
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	NA00001	SMaSH classfication
chr1	2	.	TAC	AAG	20	PASS	.	GT	0/1	Indel Other - Rescued

Genotype Evaluation

Beyond evaluating variant matches, we also compare the genotyping of the two callsets. For all true positive calls, we count up the calls with matching genotypes and report the percent of true positive calls with incorrect genotyping. Phasing evaluation is not yet supported.

Ground truth callset
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	NA00001	SMaSH classification
chr4	5	.	T	A	20	PASS	.	GT	0/1	SNP - True Positive

Ground truth callset #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 SMaSH classification chr4 5 . T A 20 PASS . GT 0/1 SNP - True Positive Predicted callset #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 SMaSH classfication chr4 5 . T A 20 PASS . GT 1/1 SNP - True Positive

Predicted callset
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	NA00001	SMaSH classfication
chr4	5	.	T	A	20	PASS	.	GT	1/1	SNP - True Positive

In the above example, the SNP call is correct, but the genotypes do not match. We score this as 1 true positive SNP and 0 correctly genotyped SNP, yielding a precision of 100% and a non-reference discrepancy of 100%.