Normalization

A VCF file represents variants, or locations in which an individual's genome differs from a reference genome. Unfortunately, the same underlying sequence can be represented as a VCF variant in many different ways. SMaSH has two main strategies for dealing with VCF ambiguities: normalization and rescue. Here, we describe how variants are left-normalized prior to evaluation.

Here is a sequence from a reference genome and a sequence from an individual genome below it. We can see that a small deletion is present in the individual. This variant will be described in the VCF format as the position at which the variant occurs, the reference allele, and an alternative allele. (Multiple alternative alleles can be represented in the VCF string, but for now we'll only consider the case in which there is one.) This same variant could be described equally well several different ways in this format.

#CHROMPOSIDREFALTQUALFILTERINFOFORMATNA00001
chr15.GCCGAGA20PASS.GT0/1
#CHROMPOSIDREFALTQUALFILTERINFOFORMATNA00001
chr14.CGCCGCG20PASS.GT0/1
#CHROMPOSIDREFALTQUALFILTERINFOFORMATNA00001
chr12.ACGCCGAACGA20PASS.GT0/1
#CHROMPOSIDREFALTQUALFILTERINFOFORMATNA00001
chr12.ACGCA20PASS.GT0/1

In order to fairly compare different representations of the same variant, we standardize them via left-normalization. Let's take this example:

As a first step, we remove the longest proper suffix from all reference and alternative alleles

Then, we slide the alleles to the left by adding a base from the reference to the beginning of the alleles and removing a base from the end. We continue this process until the final base of all alleles is not the same.

Executing this procedure on any of the different representations above will yield the same output, allowing us to easily compare them.