SNVer is a statistical tool for calling common and rare variants in analysis of pool or individual next-generation sequencing data. It reports one single overall p-value for evaluating the significance of a candidate locus being a variant, based on which multiplicity control can be obtained. Loci with any (low) coverage can be tested and depth of coverage will be quantitatively factored into final significance calculation. SNVer runs very fast, making it feasible for analysis of whole-exome sequencing data, or even whole-genome sequencing data.

Manual


SNVer for Pooled Sequencing


Usage: java -jar SNVerPool.jar -i input_directory -r reference_file 
	[-c pool_info_file | -n number_of_haploids] [options]

Required:

 -i	Input directory must be a directory contains a batch of 
	standard SAM/BAM files.
	
 -r	Reference file must be the same as what are used in the input 
	bam files. Inconsistent reference file, such as different chromosome 
	names and different chromosome length, will prevent SNVer from 
	running. See how to prepare a reference in Getting Started. 
	
 -c	Pool info file, a tab-delim file containing five columns, the 
	first column is for file names and the second column is for the 
	number of haploids in the pool and the third column is the number
	of samples. For example, if assuming diploid individual, the number
	should be 2 * no. of individuals in a pool. The file is preferred 
	when dealing with pools with different pool size and the output vcf 
	file is also based on the order of file names listed. The following 
	is an example of the pool info file, the line starting with # will 
	be omitted automatically. In the new SNVer, user is able to set 
	different base quality and mapping quality threshold among different 
	pools, which can be configured in the fouth and fifth columns, if 
	missing, -bq and -mq will be used for the analysis.
	
	#names	no.haploids	no.samples	mq	bq
	test1.bam	2	1		17	30
	test2.bam	2	1		20	20
	
 -n	The number of haploids, similar with the second column in pool 
	info file, which is required when option -c is not given. If a number 
	is give here, SNVer will assign each pool with the same number of 
	haploids for calculation. For example, if each pool has 5 samples, 
	we need to input 10 haploids here.

 

SNVer for Individual Sequencing


Usage: java -jar SNVerIndividual.jar -i input_file -r reference_file [options]

Required:

 -i	Input file must be a standard SAM/BAM file.
 
 -r	Reference file must be the same as what is used in the input bam 
	file. Inconsistent reference file, such as different chromosome names 
	and different chromosome length, will prevent SNVer from running.
	See how to prepare a reference in Getting Started.  
   

More Options:


 -l	Target regions in bed file format, if absent, SNVer will pileup
	for the entire reference. The default is null.
    
 -o	The prefix of output file, SNVer output results will generate 
	two vcf files, one is called prefix.raw.vcf and the other is called 
	prefix.filter.vcf according to the p-value cutoff uses. The default 
	is input_file.filter.vcf for SNVerIndividual and input_directory.
	filter.vcf for SNVerPool.
	
 -n	The number of haploids. For SNVerIndividual, the default is 2, 
	assuming diploid individual. For SNVerPool, which is mentioned in the
	requirement of SNVerPool.
	
 -mq	The mapping quality threshold, meaning that only consider reads with 
	mapping quality above the cutoff. The default is 20.
	 
 -bq	The base quality threshold, meaning that only consider bases with 
	base quality above the cutoff. The default is 17.
	
 -s	The strand bias threshold, aiming to remove potential false postives 
	due to strand bias issue. SNVer uses a one-sided binomial test for 
	alternative forward count, and alternative reverse count. The default 
	p-value cutoff is 0.0001.
	
 -f	The fisher exact threshold, aim to remove potential false postives
	due to allele imbalance issue. SNVer uses a one-sided fisher's exact 
	test for contingency table of alternative forward count, alternative 
	reverse count, reference forward count, reference reverse count. 
	The default p-value cutoff is 0.0001.
	
 -p	The SNVer p-value threshold for testing significant variants. The 
	default p-value cutoff is based on Bonferroni correction, the definition
	is 0.05/the number of tests. If specify a p-value cutoff, say 0.5, 
	the loci with p-value greater than the cutoff would be filtered out.
	P-value range is [0-1].
	
 -a	Require at least this number of reads supporting each strand for 
	alternative allele. For example, if alternative forward count is 0 and 
	alternative reverse count is 10. The loci would be discarded. The 
	default is at least 1 supported read.
	
 -b	Require the ratio of alt/ref above the threshold, aiming to filter out 
	loci with reference bias problem. The default ratio is 0.25. This is only
	for individual data.
	
 -t	Allele frequency threshold, which can be only used in analysis of pool
	data. For example, if you want to test all variants, this should be 0.
	The default is 0.01. Allele frequency range is [0-1].
	
 -het	The heterozygosity, the prior for computing posterior probability of 
	genotypes. This is only for computing the genotype for individual data. 
	The default is 0.001.
	
 -u	Inactivate -s and -f above this threshold. The default is 30, which means 
	if observing 30 or more alterative count, SNVer will not conduct such tests. 
	Decreasing -u will increase the sensitivity but lower the specificity. 

 -db	Support query for snp_id, if users provide a dbSNP file and the column 
	number of chr, pos, snp_id. The format is [path for dbSNP, column number of 
	chromosome, position and snp_id]. The default is null, meaning that no such 
	query needed.


Understand VCF Output


The basic VCF fields, from CHROM to FILTER, are standard as VCF 4.0. SNVer will give a p-value for each variant instead of a score in QUAL field.

VCF fields used by SNVerPool:

INFO Meaning
DP   total depth above base quality threshold, among all the pools
NP   number of pools without no coverage or strand bias
AF   estimated alternative allele frequency
PV   p-value generated by SNVerPool
FORMAT Meaning
AC   alternative allele count above base quality threshold in this pool
DP   total depth above base quality threshold in this pool

VCF fields used by SNVerIndividual:

INFO Meaning
DP   total depth above base quality threshold
AC   alternative allele count above base quality threshold
SP   strand bias test p-value
FS   fisher's exact test p-value
PV   p-value generated by SNVerIndividual
FORMAT Meaning
AC1   alternative allele forward count above base quality threshold
AC2   alternative allele reverse count above base quality threshold
RC1   reference allele forward count above base quality threshold
RC2   reference allele reverse count above base quality threshold
GT   genotype: 1 for alt, 0 for ref
  1/1: homozygous alternate
  1/0: heterozygous
  0/0: homozygous reference
PL   Phred posterior probablity of 1/1,1/0,0/0

Understand heterozygosity


We currently adapt GATK's prior when computing SNVer's posterior probablity of 1/1,1/0,0/0, which slightly improves our genotyping concordance. You may change the expected heterozygosity value by -h option. The default priors are:
  het = 1e-3
  P(hom-ref genotype) = 1 - 3 * het / 2
  P(het genotype) = het
  P(hom-var genotype) = het / 2

Empirical error rate estimation


Instead of -e 0.01 error rate threshold in the previous version, the current version estimates the empirical error rate with 0.01+weighted mean of base quality, reasoning that the error could be coming from two parts:
 1. One is from mapping error, which is currently set to be a simple value 
    of 0.01, since it can give a reasonable accuracy on real data. In later 
    release, this part might be also replaced by an empirical estimation.
 2. The other part is from base error, we adapt similar idea (Supplementary 
    Equation8) from MAQ paper(Li, Ruan et al. 2008), in order to estimate 
    empirically for each locus by considering the error dependency, which is 
    governed by the weight.