SMaSH: User Guide

How to run the SMaSH evaluation scripts

The evaluation scripts can be run on consumer hardware; allow about 30 minutes and about 5 GB of RAM

Check out the repository.
Install all dependencies. These are:
- Python 2.7 or higher
- PyVCF 0.6 or higher
- PyFASTA 0.4.5 or higher
- Numpy
Download the ground truth vcf and the reference for the dataset you wish to evaluate.
If using your own VCF files, be sure they are sorted by chromosome in the same order as your reference fasta index. If not, you can sort them by calling perl smash/scripts/sortByRef.pl /PATH/TO/YOUR/VCF /PATH/TO/YOUR/REFERENCE.FA(STA).FAI
The first stage in the SMaSH pipeline is to normalize the input variants. You can normalize your files in a single step and write them out to disk, using this command:
```
python smash/smashbenchmarking/normalize_vcf.py /PATH/TO/YOUR/VCF /PATH/TO/REFERENCE.FA(STA) myvcf 50
```
or you can normalize as part of the evaluation script by adding the option --normalize

Run the evaluation script smash/smashbenchmarking/bench.py on the normalized VCFs, or add the option --normalize if your VCF file are not normalized:

For synthetic datasets:

python smash/smashbenchmarking/bench.py /PATH/TO/TRUE_VCF /PATH/TO/PREDICTED_VCF /PATH/TO/REFERENCE /PATH/TO/REFERENCE/INDEX --snp_err 0.0 --indel_err 0.0 --sv_err 0.0 --sv_bp 100 -w 50

For mouse datasets:

python smash/smashbenchmarking/bench.py /PATH/TO/TRUE_VCF /PATH/TO/PREDICTED_VCF /PATH/TO/REFERENCE /PATH/TO/REFERENCE/INDEX --snp_err 0.002 --indel_err 0.002 --sv_err 0.003 --sv_bp 100 -w 50

For sampled human datasets:

python smash/smashbenchmarking/bench.py /PATH/TO/TRUE_VCF /PATH/TO/PREDICTED_VCF /PATH/TO/REFERENCE /PATH/TO/REFERENCE/INDEX --snp_err 0.0004 --indel_err 0 --sv_err 0.01 --sv_bp 100 -w 50

The resulting output will contain precision and recall metrics for your predicted VCF.

Gathering metrics on EC2

Here are some recommendations for benchmarking performance on EC2 for aligners or variant callers.

Sign up for an AWS account (note: this will require a credit card) and install the AWS CLI tools.
SMaSH uses the Amazon instance pricing as a mechanism to fairly calculate the cost of running a program. For example, you might choose to run a cluster of four small instances at $0.25 per hour each, or one larger instance at $2.50 per hour. However, you can save money by using spot instances, just so long as you calculate your metrics by the full instance price. With spot instances, you place a bid for a given type of instance, and you will be charged the current market price so long as it does not rise above your bid. You can check the spot instance pricing history for your region and desired instance type in your Amazon console. Note that in some cases you may want to switch to another region to get better pricing, but make that decision upfront because switching between regions midstream can be cumbersome.
Provision an EBS volume to act as your hard drive. You'll pay per GB per month, but note that volumes can't change size dynamically, so make sure you have all the space you need.
Pick an AMI (Amazon Machine Image) to work with. Amazon maintains a list of public Linux images.
Launch your instance and mount your volumes. Download any data you need from our datasets and install any dependencies needed for your algorithm. Use the "time" utility to get your program's running time, divide the "real" time by 60 and round up to the nearest hour, then multiply by the instance price (not the spot price) to get the cost. Don't include any time spent downloading files or setting up your instance. If you are benchmarking an aligner, you'll need to run a caller on the resulting bams; if you downloaded our BAMs to benchmark a variant caller, add the hours/cost noted on our results page to the total cost of your pipeline.
Once you have a predicted VCF file, you can run the SMaSH evaluation scripts on it as detailed above to get your accuracy metrics. These scripts shouldn't need a lot of computational power so you can download your VCF and evaluate it locally if you like.
Once your experiment is finished, unmount your volume and terminate your instance. If you no longer need the data on your volume, you can delete it.