User Guide

A guide to running SMaSH. If you have a question that isn't answered here, please visit our support forum.

How to run the SMaSH evaluation scripts.

Tips on setting up EC2 instances and gathering metrics.

Documentation for our rescue and normalization algorithms.

How to run the SMaSH evaluation scripts

The evaluation scripts can be run on consumer hardware; allow about 30 minutes and about 5 GB of RAM
  • Check out the repository.
  • Install all dependencies. These are:
    • Python 2.7 or higher
    • PyVCF 0.6 or higher
    • PyFASTA 0.4.5 or higher
    • Numpy
  • Download the ground truth vcf and the reference for the dataset you wish to evaluate.
  • If using your own VCF files, be sure they are sorted by chromosome in the same order as your reference fasta index. If not, you can sort them by calling perl smash/scripts/sortByRef.pl /PATH/TO/YOUR/VCF /PATH/TO/YOUR/REFERENCE.FA(STA).FAI
  • The first stage in the SMaSH pipeline is to normalize the input variants. You can normalize your files in a single step and write them out to disk, using this command:
    python smash/smashbenchmarking/normalize_vcf.py /PATH/TO/YOUR/VCF /PATH/TO/REFERENCE.FA(STA) myvcf 50
    or you can normalize as part of the evaluation script by adding the option --normalize
  • Run the evaluation script smash/smashbenchmarking/bench.py on the normalized VCFs, or add the option --normalize if your VCF file are not normalized:
    • For synthetic datasets:
      python smash/smashbenchmarking/bench.py /PATH/TO/TRUE_VCF /PATH/TO/PREDICTED_VCF /PATH/TO/REFERENCE /PATH/TO/REFERENCE/INDEX --snp_err 0.0 --indel_err 0.0 --sv_err 0.0 --sv_bp 100 -w 50
    • For mouse datasets:
      python smash/smashbenchmarking/bench.py /PATH/TO/TRUE_VCF /PATH/TO/PREDICTED_VCF /PATH/TO/REFERENCE /PATH/TO/REFERENCE/INDEX --snp_err 0.002 --indel_err 0.002 --sv_err 0.003 --sv_bp 100 -w 50
    • For sampled human datasets:
      python smash/smashbenchmarking/bench.py /PATH/TO/TRUE_VCF /PATH/TO/PREDICTED_VCF /PATH/TO/REFERENCE /PATH/TO/REFERENCE/INDEX --snp_err 0.0004 --indel_err 0 --sv_err 0.01 --sv_bp 100 -w 50
  • The resulting output will contain precision and recall metrics for your predicted VCF.

Gathering metrics on EC2

Here are some recommendations for benchmarking performance on EC2 for aligners or variant callers.