Fall2012SNPAnalysisPage

= Genotyping & SNP Analysis Workshop = Emilia Huerta, Filipe G. Vieira, Matteo Fumagalli
 * Oct 29, 2012 **

This wiki will guide you through some basic exercises for calling SNPs and genotypes. We will mainly use a program called 'ANGSD' (Analysis of Nexg Generation Sequencing Data) currently being developed in out lab. More information about usage, rational and some of the implemented methods can be found [|here] and on ANGSD wiki [|here].

The data needed for this tutorial is in a single tar.gz file that you can download [|here].
 * bin/ precompiled binary files for 32 and 64 bit Linux and MacOSX.
 * input/ input files for the analysis: //lycaeides.bed// (list of sites to analyze) and //lycaeides.files_list// (list of BAM files to analyze)
 * input/bam input BAM files for 20 Lycaeides butterflies ([])

In case you have some problem running the example, the outputs can be downloaded [|here].

toc

Installing
Create a folder and unpack the //workshop.tar.gz// file into it: code mkdir workshop cd workshop wget http://palin.popgen.dk/workshop/workshop.tar.gz __OR__  click above link tar -zxvf workshop.tar.gz code

This creates the bin/ and input/ folders. We will use 3 programs, 'ANGSD', 'optimSFS' and 'sfstools'. We can use the precompiled binaries or we can build the programs ourselves.

Using precompiled binaries
Just use the binary that best matches your system and run: code ln -fs bin/angsd.LINUX-x86_64 angsd ln -fs bin/optimSFS.LINUX-x86_64 optimSFS ln -fs bin/sfstools.LINUX-x86_64 sfstools code code ln -fs bin/angsd.OSX-i386 angsd ln -fs bin/optimSFS.OSX-i386 optimSFS ln -fs bin/sfstools.OSX-i386 sfstools code Test the binaries by running: code ./angsd ./optimSFS ./sfstools code
 * For x86_64 Linux:
 * For i386 MacOSX:

Building source files
If the precompiled binaries won't run you can compile the source code for youself.

Download the source code from and untar the file: code tar -zxvf angsd0.204.tar.gz code And compile: code cd angsd0.204/samtools-0.1.17 make clean; make cd ../misc make clean; make cd .. make clean; make cd .. ln -fs angsd0.204/angsd.g++ angsd ln -fs angsd0.204/misc/optimSFS.g++ optimSFS ln -fs angsd0.204/misc/sfstools.g++ sfstools code

Running Examples
We will now walk through an example. There are 20 BAM files in the input/bam subdirectory. The file called //lycaeides.files_list// is simply a file containing the relative filepaths to the actual BAM files. code head -n 3 input/lycaeides.files_list code The 20 BAM files are a small sample subset and we'll analyze around 100kb. More information about the original aligned files can be obtained on the original paper []. In this paper authors analyzed a butterfly Next-Generation Sequencing dataset sequenced using Illumina GAII technology. The dataset consists of a total of 381 samples of Lycaeides idas, Lycaeides melissa and an hybrid collection (Jackson Hole). DNA resequencing was conducted on custom reduced genomic complexity libraries. We will analyzed only a subset of 20 samples.
 * 1) ./input/bam/vic_10_05f.bam
 * 2) ./input/bam/sin_10_19f.bam
 * 3) ./input/bam/lan_10_18m.bam

Getting the MAF (Minor Allele Frequency) and calling SNPs
We will use the program ANGSD to analyze our data code ./angsd -nThreads 10 -ref input/referenceseq.fasta -anc input/referenceseq.fasta -outfiles lycaeides \ -doGlf 3 -doMajorMinor 1 -doMaf 2 -doSNP 1 \ mpileup -gIDS -q 0 -Q 20 -C 50 -f input/referenceseq.fasta \ -b input/lycaeides.files_list -l input/lycaeides.bed > /dev/null code where:
 * -ref Reference sequence (FASTA)
 * -anc Ancestral sequence (FASTA), for this example we will set the ancestral as reference because we don't have this information
 * -outfiles PREFIX for our output files
 * -doGlf Output Genotype Likelihoods (not used in this workshop)
 * -doMajorMinor Determine Major/Minor alleles
 * -doMaf Estimate the MAF using an EM algorithm
 * -doSNP Calculate the LRT for the site being variable

The above command generates some files, all prefixed with //lycaeides//. One of the most useful is //lycaeides.mafs//, which contains some basic infomation. You can view these files using 'less' and exit with 'q'. code less -S lycaeides.mafs code code ... ref_contig     183     C       A       C       C       0.000000        -0.000001       20 ref_contig     184     T       A       T       T       0.000000        -0.000001       20 ref_contig     185     G       A       G       G       0.000001        -0.000083       20 ref_contig     186     C       T       C       C       0.000000        -0.000002       20 ref_contig     187     T       A       T       T       0.000000        -0.000001       20 ref_contig     188     T       A       T       T       0.000000        -0.000001       20 ref_contig     189     G       A       G       G       0.000012        -0.000349       20 ref_contig     190     T       A       T       T       0.000000        -0.000001       20 ref_contig     191     A       G       A       A       0.033086        6.012643        20 ref_contig     192     T       C       T       T       0.026010        8.255082        20 ref_contig     193     A       C       A       A       0.000000        -0.000001       20 ref_contig     194     C       A       C       C       0.000000        -0.000001       20 ref_contig     195     T       A       T       T       0.000000        -0.000001       20 ref_contig     196     A       C       A       A       0.000000        -0.000001       20 ref_contig     197     C       A       C       C       0.000000        -0.000001       20 ref_contig     198     A       C       A       A       0.000000        -0.000001       20 ref_contig     199     G       A       G       G       0.016429        0.152365        20 ref_contig     200     T       A       T       T       0.000000        -0.000002       20 ref_contig     201     A       C       A       A       0.000000        -0.000001       20 ref_contig     202     A       C       A       A       0.000000        -0.000001       20 ... code

Column1-2 is the position, column3 and 4 is major and minor. Column 5-6 is the reference and the ancestral (the reference is not used for any caluculations in this workshop). Column7 is an estimate of the MAF, this estimate is based on the likelihoods and using an EM-algorithm. Column 8 is a 2*loglike test of the site being variable. And the last column is the number of individuals for which we have data at the site.

With this output we can already perform a crude SNP calling, based on the MAF. In fact, ANGSD can easily provide with the //-minMaf// argument. code ./angsd -nThreads 10 -ref input/referenceseq.fasta -anc input/referenceseq.fasta -outfiles lycaeides_VAR -minMaf 0.05 \ -doGlf 3 -doMajorMinor 1 -doMaf 2 -doSNP 1 \ mpileup -gIDS -q 0 -Q 20 -C 50 \ -f input/referenceseq.fasta -b input/lycaeides.files_list -l input/lycaeides.bed > /dev/null code Which gives us the following output in //lycaeides_VAR.mafs//: code chromo position        major   minor   ref     anc     knownEM pK-EM   nInd ref_contig     321     G       A       G       G       0.084433        6.943010        19 ref_contig     387     T       C       C       C       0.152956        70.097210       20 ref_contig     620     A       G       A       A       0.107351        14.582970       18 ref_contig     681     T       C       T       T       0.245610        78.355335       18 ref_contig     1114    A       G       A       A       0.118452        3.339191        7 ref_contig     1370    G       A       G       G       0.050942        31.441695       20 ref_contig     1381    C       T       C       C       0.139262        27.149469       20 ref_contig     1442    A       G       A       A       0.081306        10.207323       20 ... code However, this methodology can be error prone and rely on an arbitrary choice of a cutoff. Another more robust way to do it is to use the results from Column 8. Those values are chi-square distributed with 1 d.f. Therefore the 0.05 threshold is at 3.84. We can print all sites which have an LRT value above a certain threshold using the parameter -minLRT. code ./angsd -nThreads 10 -ref input/referenceseq.fasta -anc input/referenceseq.fasta -outfiles lycaeides_VAR2 \ -minLRT 5 -doGlf 3 -doMajorMinor 1 -doMaf 2 -doSNP 1 \ mpileup -gIDS -q 0 -Q 20 -C 50 \ -f input/referenceseq.fasta -b input/lycaeides.files_list -l input/lycaeides.bed > /dev/null code code chromo   position    major    minor    ref    anc    knownEM    pK-EM    nInd ref_contig   191    A    G    A    A    0.032944    6.003593    20 ref_contig   192    T    C    T    T    0.025917    8.248013    20 ref_contig   321    G    A    G    G    0.081707    6.812460    19 ref_contig   329    G    A    G    G    0.041448    21.143322    20 ref_contig   387    T    C    C    C    0.177486    76.792329    20 ref_contig   549    G    T    G    G    0.024890    7.788239    20 ref_contig   620    A    G    A    A    0.104913    14.438307    18 ref_contig   643    G    A    G    G    0.047299    8.346708    18 ref_contig   681    T    C    T    T    0.209771    66.714527    18 ... code

Estimating the Site Frequency Spectrum (SFS)
For estimating the SFS we will need to generate a .sfs file which is a binary file that contains the posterior probabilities of sample allele frequencies for the sites where we have data, and where we have information about the ancestral state. code ./angsd -ref input/referenceseq.fasta -anc input/referenceseq.fasta -outfiles lycaeides \ -realSFS 1 \ mpileup -gIDS -q 0 -Q 20 -C 50 \ -f input/referenceseq.fasta -b input/lycaeides.files_list -l input/lycaeides.bed > /dev/null code

We can obtain the global spectrum using optimSFS code ./optimSFS -nThread 10 -binput lycaeides.sfs -nChr 40 -outnames lycaeides.sfs code This generates a file called //lycaeides.sfs.ml//, which contains our SFS estimate. code cat lycaeides.sfs.ml code But we have too few sites to get a good estimate so we'll use a pre-computed file of the entire genome. Notice that not all sites are represented, since we don't have reference/ancestral information for all sites within our region.
 * 1) 0.978381   0.008968    0.002196    0.001873    0.000761    0.001017    0.000828    0.000855    0.000000    0.000141    0.001058    0.000000    0.000367    0.000350    0.000010    0.000315    0.000000    0.000269    0.000303    0.000312    0.000000    0.000137    0.000091    0.000000    0.000000    0.000139    0.000191    0.000133    0.000000    0.000062    0.000000    0.000000    0.000023    0.000123    0.000000    0.000089    0.000000    0.000113    0.000000    0.000184    0.000711

To view the SFS, and compare the estimates that we have from a short region or from the whole-genome, we can use R: code R sfs=scan("lycaeides.sfs.ml"); sfs_full=scan("input/lycaeides.sfs.ml") sfs[1]=sfs_full[1]=NA; barplot(rbind(sfs,sfs_full), beside=T, legend=c("estimated on a short region", "estimated on a genome-wide scale"), main="SFS")
 * 1) we may want to 'mask' the frequency of non variable sites

code

Now we can finally compute the per site posterior probabilities of sample allele frequencies using the overall SFS a a prior: code ./sfstools -nChr 40 -priorFile input/lycaeides.sfs.ml -sfsFile lycaeides.sfs -tajima lycaeides.sfs.ml.tajima -dumpBinary 0 > lycaeides.sfs.ml.post code This will output two files, one with posterior probabilities //lycaeides.sfs.ml.post// and another with Tajima values //lycaeides.sfs.ml.tajima//. Notably, we can use these information to implement a new way to call SNPs. For instance we can call a SNP if the probability of the site being at sample allele frequency zero is lower than a certain threshold (e.g. 0.05).

Comparing results
Here we've looked into 3 different ways of calling SNPs:
 * setting a threshold on the estimated MAF
 * using a Likelihood-Ratio Test (LRT)
 * using the posterior probability of sample allele frequencies

But how do they compare? Let's merge all the results into a single file: code awk 'BEGIN{print "post_probSFS"} {print 1-$1}' lycaeides.sfs.ml.post | paste lycaeides.mafs - > lycaeides.ALL code From this file we can see the number of sites that we'd call SNP's under each approach:
 * ~ Method ||~ # # SNPs ||
 * MAF > 0.01 ||> 2984 ||
 * MAF > 0.025 ||> 2411 ||
 * MAF > 0.05 ||> 1258 ||
 * LRT > 3.84 (Chi2 p=0.05; df=1) ||> 1947 ||
 * LRT > 6.63 (Chi2 p=0.01; df=1) ||> 1532 ||
 * SFS_post > 0.95 ||> 888 ||
 * SFS_post > 0.99 ||> 746 ||

Checking some of the sites in detail: code ... ref_contig     190     T       A       T       T       0.000000        -0.000001       20      0.000273 ref_contig     191     A       G       A       A       0.032944        6.003593        20      0.167791 ref_contig     192     T       C       T       T       0.025917        8.248013        20      0.416617 ref_contig     193     A       C       A       A       0.000000        -0.000001       20      0.000294 ... ref_contig     328     C       A       C       C       0.000000        -0.000011       20      0.000709 ref_contig     329     G       A       G       G       0.041448        21.143322       20      0.996964 ref_contig     330     T       A       T       T       0.000000        -0.000014       20      0.000756 ... ref_contig     1661    A       C       A       A       0.000008        -0.000031       2       0.017503 ref_contig     1662    A       C       A       A       0.332954        9.540601        3       0.01851 ref_contig     1663    G       A       G       G       0.000008        -0.000062       4       0.019638 ... code

Calling Genotypes
To call genotypes we can use the //doGeno// argument. For that, we can use the previous analysis to get a list of putatively variable sites. We'll use the posterior probabilities of sample allele frequency criteria with a p-value of 0.05 as an example: code awk '$10 > 0.95 {print $1,$2-1,$2}' lycaeides.ALL > lycaeides_VAR.bed code And we feed this file into our genotype calling: code ./angsd -nThreads 10 -ref input/referenceseq.fasta -anc input/referenceseq.fasta -outfiles lycaeides \ -doMajorMinor 1 -doMaf 2 -doGeno 64 \ mpileup -gIDS -q 0 -Q 20 -C 50 \ -f input/referenceseq.fasta -b input/lycaeides.files_list -l lycaeides_VAR.bed > /dev/null code Which will give us a binary file with the posterior probabilities of each genotype for each individual and site. Binary is very convenient if you want to use it later on (small size, fast access) but it can be a problem to manually inspect. We can use the //doGeno 5// argument to produce a text file instead. code ./angsd -nThreads 10 -ref input/referenceseq.fasta -anc input/referenceseq.fasta -outfiles lycaeides \ -doMajorMinor 1 -doMaf 2 -doGeno 5 \ mpileup -gIDS -q 0 -Q 20 -C 50 \ -f input/referenceseq.fasta -b input/lycaeides.files_list -l lycaeides_VAR.bed > /dev/null code Which will give us a file with the position (col 1-2), the major and minor alleles (col 3-4) and the called genotypes (number coded: 0=AA, 1=AC, 2=AG, 3=AT, 4=CC, 5=CG, 6=CT, 7=GG, 8=GT, 9=TT). code chromo position        major   minor   MAXLRT ref_contig     329     G       A       7       7       7       7       7       7       7       7       7       7       7       7       7       7       2       7       7       7       7       7 ref_contig     387     T       C       6       6       9       9       6       9       9       9       9       9       9       6       9       9       6       6       9       9       6       9 ref_contig     681     T       C       9       0       6       9       6       9       9       9       9       6       0       6       6       9       6       9       9       6       9       9 ref_contig     1370    G       A       7       7       7       7       7       2       7       7       7       7       7       7       7       7       7       2       7       7       2       7 ref_contig     1381    C       T       6       4       6       9       4       4       4       4       4       4       6       6       4       4       4       4       4       4       4       4 ref_contig     1500    G       T       7       7       7       7       7       8       7       7       7       7       7       7       7       7       7       7       7       7       7       7 ref_contig     1543    A       G       0       0       0       0       0       2       0       0       0       0       0       0       0       0       0       0       0       0       0       0 ref_contig     1565    A       T       0       3       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0 ref_contig     1747    A       C       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0 ref_contig     2341    A       C       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0 ref_contig     2412    C       A       4       4       4       4       4       4       4       4       0       4       0       4       4       0       4       4       4       4       4       4 ref_contig     2616    A       G       0       0       0       0       0       0       0       0       0       2       0       0       0       0       0       0       0       0       0       0 ref_contig     2666    C       T       4       4       4       4       4       4       4       4       4       4       4       4       4       4       4       4       4       4       4       4 ref_contig     2715    T       A       0       0       9       9       0       0       0       9       0       0       0       0       9       9       0       0       9       9       9       0 ref_contig     3118    T       A       9       9       9       9       9       9       9       9       9       9       9       9       9       9       9       9       9       9       9       9 ref_contig     3205    C       A       4       0       4       0       4       4       4       4       4       0       0       4       4       4       4       4       4       4       4       4 ref_contig     3239    A       C       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0 ref_contig     3285    T       A       9       0       9       0       9       9       9       9       0       0       0       9       9       9       0       9       9       9       9       0 ref_contig     3326    A       C       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0 ref_contig     3584    G       A       7       7       7       7       7       7       7       7       7       7       7       7       7       7       7       7       7       7       7       7 ... code