Manual
SNVer for Pooled Sequencing
Usage: java -jar SNVerPool.jar -i input_directory -r reference_file [-c pool_info_file | -n number_of_haploids] [options] Required: -i Input directory must be a directory contains a batch of standard SAM/BAM files. -r Reference file must be the same as what are used in the input bam files. Inconsistent reference file, such as different chromosome names and different chromosome length, will prevent SNVer from running. See how to prepare a reference in Getting Started. -c Pool info file, a tab-delim file containing five columns, the first column is for file names and the second column is for the number of haploids in the pool and the third column is the number of samples. For example, if assuming diploid individual, the number should be 2 * no. of individuals in a pool. The file is preferred when dealing with pools with different pool size and the output vcf file is also based on the order of file names listed. The following is an example of the pool info file, the line starting with # will be omitted automatically. In the new SNVer, user is able to set different base quality and mapping quality threshold among different pools, which can be configured in the fouth and fifth columns, if missing, -bq and -mq will be used for the analysis. #names no.haploids no.samples mq bq test1.bam 2 1 17 30 test2.bam 2 1 20 20 -n The number of haploids, similar with the second column in pool info file, which is required when option -c is not given. If a number is give here, SNVer will assign each pool with the same number of haploids for calculation. For example, if each pool has 5 samples, we need to input 10 haploids here.
SNVer for Individual Sequencing
Usage: java -jar SNVerIndividual.jar -i input_file -r reference_file [options] Required: -i Input file must be a standard SAM/BAM file. -r Reference file must be the same as what is used in the input bam file. Inconsistent reference file, such as different chromosome names and different chromosome length, will prevent SNVer from running. See how to prepare a reference in Getting Started.
More Options:
-l Target regions in bed file format, if absent, SNVer will pileup for the entire reference. The default is null. -o The prefix of output file, SNVer output results will generate two vcf files, one is called prefix.raw.vcf and the other is called prefix.filter.vcf according to the p-value cutoff uses. The default is input_file.filter.vcf for SNVerIndividual and input_directory. filter.vcf for SNVerPool. -n The number of haploids. For SNVerIndividual, the default is 2, assuming diploid individual. For SNVerPool, which is mentioned in the requirement of SNVerPool. -mq The mapping quality threshold, meaning that only consider reads with mapping quality above the cutoff. The default is 20. -bq The base quality threshold, meaning that only consider bases with base quality above the cutoff. The default is 17. -s The strand bias threshold, aiming to remove potential false postives due to strand bias issue. SNVer uses a one-sided binomial test for alternative forward count, and alternative reverse count. The default p-value cutoff is 0.0001. -f The fisher exact threshold, aim to remove potential false postives due to allele imbalance issue. SNVer uses a one-sided fisher's exact test for contingency table of alternative forward count, alternative reverse count, reference forward count, reference reverse count. The default p-value cutoff is 0.0001. -p The SNVer p-value threshold for testing significant variants. The default p-value cutoff is based on Bonferroni correction, the definition is 0.05/the number of tests. If specify a p-value cutoff, say 0.5, the loci with p-value greater than the cutoff would be filtered out. P-value range is [0-1]. -a Require at least this number of reads supporting each strand for alternative allele. For example, if alternative forward count is 0 and alternative reverse count is 10. The loci would be discarded. The default is at least 1 supported read. -b Require the ratio of alt/ref above the threshold, aiming to filter out loci with reference bias problem. The default ratio is 0.25. This is only for individual data. -t Allele frequency threshold, which can be only used in analysis of pool data. For example, if you want to test all variants, this should be 0. The default is 0.01. Allele frequency range is [0-1]. -het The heterozygosity, the prior for computing posterior probability of genotypes. This is only for computing the genotype for individual data. The default is 0.001. -u Inactivate -s and -f above this threshold. The default is 30, which means if observing 30 or more alterative count, SNVer will not conduct such tests. Decreasing -u will increase the sensitivity but lower the specificity. -db Support query for snp_id, if users provide a dbSNP file and the column number of chr, pos, snp_id. The format is [path for dbSNP, column number of chromosome, position and snp_id]. The default is null, meaning that no such query needed.
Understand VCF Output
The basic VCF fields, from CHROM to FILTER, are standard as
VCF 4.0. SNVer will give a p-value for each variant instead of a score in QUAL field.
VCF fields used by SNVerPool:
VCF fields used by SNVerIndividual:
VCF fields used by SNVerPool:
INFO | Meaning |
---|---|
DP | total depth above base quality threshold, among all the pools |
NP | number of pools without no coverage or strand bias |
AF | estimated alternative allele frequency |
PV | p-value generated by SNVerPool |
FORMAT | Meaning |
AC | alternative allele count above base quality threshold in this pool |
DP | total depth above base quality threshold in this pool |
VCF fields used by SNVerIndividual:
INFO | Meaning |
---|---|
DP | total depth above base quality threshold |
AC | alternative allele count above base quality threshold |
SP | strand bias test p-value |
FS | fisher's exact test p-value |
PV | p-value generated by SNVerIndividual |
FORMAT | Meaning |
AC1 | alternative allele forward count above base quality threshold |
AC2 | alternative allele reverse count above base quality threshold |
RC1 | reference allele forward count above base quality threshold |
RC2 | reference allele reverse count above base quality threshold |
GT | genotype: 1 for alt, 0 for ref
1/1: homozygous alternate 1/0: heterozygous 0/0: homozygous reference |
PL | Phred posterior probablity of 1/1,1/0,0/0 |
Understand heterozygosity
We currently adapt GATK's prior when computing SNVer's posterior probablity of 1/1,1/0,0/0, which slightly improves our genotyping concordance. You may change the expected heterozygosity value by -h option. The default priors are:
het = 1e-3
P(hom-ref genotype) = 1 - 3 * het / 2
P(het genotype) = het
P(hom-var genotype) = het / 2
Empirical error rate estimation
Instead of -e 0.01 error rate threshold in the previous version, the current version estimates the empirical error rate with 0.01+weighted mean of base quality, reasoning that the error could be coming from two parts:
1. One is from mapping error, which is currently set to be a simple value of 0.01, since it can give a reasonable accuracy on real data. In later release, this part might be also replaced by an empirical estimation. 2. The other part is from base error, we adapt similar idea (Supplementary Equation8) from MAQ paper(Li, Ruan et al. 2008), in order to estimate empirically for each locus by considering the error dependency, which is governed by the weight.