Genotyping & SNP Analysis Workshop

Oct 29, 2012
Emilia Huerta, Filipe G. Vieira, Matteo Fumagalli

This wiki will guide you through some basic exercises for calling SNPs and genotypes. We will mainly use a program called 'ANGSD' (Analysis of Nexg Generation Sequencing Data) currently being developed in out lab. More information about usage, rational and some of the implemented methods can be found here and on ANGSD wiki here.

The data needed for this tutorial is in a single tar.gz file that you can download here.
  • bin/ precompiled binary files for 32 and 64 bit Linux and MacOSX.
  • input/ input files for the analysis: lycaeides.bed (list of sites to analyze) and lycaeides.files_list (list of BAM files to analyze)
  • input/bam input BAM files for 20 Lycaeides butterflies (http://www.ncbi.nlm.nih.gov/pubmed/22759293)

In case you have some problem running the example, the outputs can be downloaded here.


Installing

Create a folder and unpack the workshop.tar.gz file into it:
mkdir workshop
cd workshop
wget http://palin.popgen.dk/workshop/workshop.tar.gz  __OR__  click above link
tar -zxvf workshop.tar.gz

This creates the bin/ and input/ folders.
We will use 3 programs, 'ANGSD', 'optimSFS' and 'sfstools'.
We can use the precompiled binaries or we can build the programs ourselves.

Using precompiled binaries

Just use the binary that best matches your system and run:
  • For x86_64 Linux:
ln -fs bin/angsd.LINUX-x86_64 angsd
ln -fs bin/optimSFS.LINUX-x86_64 optimSFS
ln -fs bin/sfstools.LINUX-x86_64 sfstools
  • For i386 MacOSX:
ln -fs bin/angsd.OSX-i386 angsd
ln -fs bin/optimSFS.OSX-i386 optimSFS
ln -fs bin/sfstools.OSX-i386 sfstools
Test the binaries by running:
./angsd
./optimSFS
./sfstools

Building source files

If the precompiled binaries won't run you can compile the source code for youself.

Download the source code from here and untar the file:
tar -zxvf angsd0.204.tar.gz
And compile:
cd angsd0.204/samtools-0.1.17
make clean; make
cd ../misc
make clean; make
cd ..
make clean; make
cd ..
ln -fs angsd0.204/angsd.g++ angsd
ln -fs angsd0.204/misc/optimSFS.g++ optimSFS
ln -fs angsd0.204/misc/sfstools.g++ sfstools

Running Examples

We will now walk through an example. There are 20 BAM files in the input/bam subdirectory. The file called lycaeides.files_list is simply a file containing the relative filepaths to the actual BAM files.
head -n 3 input/lycaeides.files_list
# ./input/bam/vic_10_05f.bam
# ./input/bam/sin_10_19f.bam
# ./input/bam/lan_10_18m.bam
The 20 BAM files are a small sample subset and we'll analyze around 100kb. More information about the original aligned files can be obtained on the original paper http://www.ncbi.nlm.nih.gov/pubmed/22759293. In this paper authors analyzed a butterfly Next-Generation Sequencing dataset sequenced using Illumina GAII technology. The dataset consists of a total of 381 samples of Lycaeides idas, Lycaeides melissa and an hybrid collection (Jackson Hole). DNA resequencing was conducted on custom reduced genomic complexity libraries. We will analyzed only a subset of 20 samples.

Getting the MAF (Minor Allele Frequency) and calling SNPs

We will use the program ANGSD to analyze our data
./angsd -nThreads 10 -ref input/referenceseq.fasta -anc input/referenceseq.fasta -outfiles lycaeides \
        -doGlf 3 -doMajorMinor 1 -doMaf 2 -doSNP 1 \
        mpileup -gIDS -q 0 -Q 20 -C 50 -f input/referenceseq.fasta \
        -b input/lycaeides.files_list -l input/lycaeides.bed > /dev/null
where:
  • -ref Reference sequence (FASTA)
  • -anc Ancestral sequence (FASTA), for this example we will set the ancestral as reference because we don't have this information
  • -outfiles PREFIX for our output files
  • -doGlf Output Genotype Likelihoods (not used in this workshop)
  • -doMajorMinor Determine Major/Minor alleles
  • -doMaf Estimate the MAF using an EM algorithm
  • -doSNP Calculate the LRT for the site being variable

The above command generates some files, all prefixed with lycaeides. One of the most useful is lycaeides.mafs, which contains some basic infomation. You can view these files using 'less' and exit with 'q'.
less -S lycaeides.mafs
...
ref_contig      183     C       A       C       C       0.000000        -0.000001       20
ref_contig      184     T       A       T       T       0.000000        -0.000001       20
ref_contig      185     G       A       G       G       0.000001        -0.000083       20
ref_contig      186     C       T       C       C       0.000000        -0.000002       20
ref_contig      187     T       A       T       T       0.000000        -0.000001       20
ref_contig      188     T       A       T       T       0.000000        -0.000001       20
ref_contig      189     G       A       G       G       0.000012        -0.000349       20
ref_contig      190     T       A       T       T       0.000000        -0.000001       20
ref_contig      191     A       G       A       A       0.033086        6.012643        20
ref_contig      192     T       C       T       T       0.026010        8.255082        20
ref_contig      193     A       C       A       A       0.000000        -0.000001       20
ref_contig      194     C       A       C       C       0.000000        -0.000001       20
ref_contig      195     T       A       T       T       0.000000        -0.000001       20
ref_contig      196     A       C       A       A       0.000000        -0.000001       20
ref_contig      197     C       A       C       C       0.000000        -0.000001       20
ref_contig      198     A       C       A       A       0.000000        -0.000001       20
ref_contig      199     G       A       G       G       0.016429        0.152365        20
ref_contig      200     T       A       T       T       0.000000        -0.000002       20
ref_contig      201     A       C       A       A       0.000000        -0.000001       20
ref_contig      202     A       C       A       A       0.000000        -0.000001       20
...

Column1-2 is the position, column3 and 4 is major and minor. Column 5-6 is the reference and the ancestral (the reference is not used for any caluculations in this workshop). Column7 is an estimate of the MAF, this estimate is based on the likelihoods and using an EM-algorithm. Column 8 is a 2*loglike test of the site being variable. And the last column is the number of individuals for which we have data at the site.

With this output we can already perform a crude SNP calling, based on the MAF. In fact, ANGSD can easily provide with the -minMaf argument.
./angsd -nThreads 10 -ref input/referenceseq.fasta -anc input/referenceseq.fasta -outfiles lycaeides_VAR -minMaf 0.05 \
        -doGlf 3 -doMajorMinor 1 -doMaf 2 -doSNP 1 \
        mpileup -gIDS -q 0 -Q 20 -C 50 \
        -f input/referenceseq.fasta -b input/lycaeides.files_list -l input/lycaeides.bed > /dev/null
Which gives us the following output in lycaeides_VAR.mafs:
chromo  position        major   minor   ref     anc     knownEM pK-EM   nInd
ref_contig      321     G       A       G       G       0.084433        6.943010        19
ref_contig      387     T       C       C       C       0.152956        70.097210       20
ref_contig      620     A       G       A       A       0.107351        14.582970       18
ref_contig      681     T       C       T       T       0.245610        78.355335       18
ref_contig      1114    A       G       A       A       0.118452        3.339191        7
ref_contig      1370    G       A       G       G       0.050942        31.441695       20
ref_contig      1381    C       T       C       C       0.139262        27.149469       20
ref_contig      1442    A       G       A       A       0.081306        10.207323       20
...
However, this methodology can be error prone and rely on an arbitrary choice of a cutoff. Another more robust way to do it is to use the results from Column 8. Those values are chi-square distributed with 1 d.f. Therefore the 0.05 threshold is at 3.84. We can print all sites which have an LRT value above a certain threshold using the parameter -minLRT.
./angsd -nThreads 10 -ref input/referenceseq.fasta -anc input/referenceseq.fasta -outfiles lycaeides_VAR2 \
        -minLRT 5 -doGlf 3 -doMajorMinor 1 -doMaf 2 -doSNP 1 \
        mpileup -gIDS -q 0 -Q 20 -C 50 \
        -f input/referenceseq.fasta -b input/lycaeides.files_list -l input/lycaeides.bed > /dev/null
chromo    position    major    minor    ref    anc    knownEM    pK-EM    nInd
ref_contig    191    A    G    A    A    0.032944    6.003593    20
ref_contig    192    T    C    T    T    0.025917    8.248013    20
ref_contig    321    G    A    G    G    0.081707    6.812460    19
ref_contig    329    G    A    G    G    0.041448    21.143322    20
ref_contig    387    T    C    C    C    0.177486    76.792329    20
ref_contig    549    G    T    G    G    0.024890    7.788239    20
ref_contig    620    A    G    A    A    0.104913    14.438307    18
ref_contig    643    G    A    G    G    0.047299    8.346708    18
ref_contig    681    T    C    T    T    0.209771    66.714527    18
...



Estimating the Site Frequency Spectrum (SFS)

For estimating the SFS we will need to generate a .sfs file which is a binary file that contains the posterior probabilities of sample allele frequencies for the sites where we have data, and where we have information about the ancestral state.
./angsd -ref input/referenceseq.fasta -anc input/referenceseq.fasta -outfiles lycaeides \
        -realSFS 1 \
        mpileup -gIDS -q 0 -Q 20 -C 50 \
        -f input/referenceseq.fasta -b input/lycaeides.files_list -l input/lycaeides.bed > /dev/null

We can obtain the global spectrum using optimSFS
./optimSFS -nThread 10 -binput lycaeides.sfs -nChr 40 -outnames lycaeides.sfs
This generates a file called lycaeides.sfs.ml, which contains our SFS estimate.
cat lycaeides.sfs.ml
#0.978381    0.008968    0.002196    0.001873    0.000761    0.001017    0.000828    0.000855    0.000000    0.000141    0.001058    0.000000    0.000367    0.000350    0.000010    0.000315    0.000000    0.000269    0.000303    0.000312    0.000000    0.000137    0.000091    0.000000    0.000000    0.000139    0.000191    0.000133    0.000000    0.000062    0.000000    0.000000    0.000023    0.000123    0.000000    0.000089    0.000000    0.000113    0.000000    0.000184    0.000711
But we have too few sites to get a good estimate so we'll use a pre-computed file of the entire genome. Notice that not all sites are represented, since we don't have reference/ancestral information for all sites within our region.

To view the SFS, and compare the estimates that we have from a short region or from the whole-genome, we can use R:
R
sfs=scan("lycaeides.sfs.ml");
sfs_full=scan("input/lycaeides.sfs.ml")
# we may want to 'mask' the frequency of non variable sites
sfs[1]=sfs_full[1]=NA;
barplot(rbind(sfs,sfs_full), beside=T, legend=c("estimated on a short region", "estimated on a genome-wide scale"), main="SFS")
 

Now we can finally compute the per site posterior probabilities of sample allele frequencies using the overall SFS a a prior:
./sfstools -nChr 40 -priorFile input/lycaeides.sfs.ml -sfsFile lycaeides.sfs -tajima lycaeides.sfs.ml.tajima -dumpBinary 0 > lycaeides.sfs.ml.post
This will output two files, one with posterior probabilities lycaeides.sfs.ml.post and another with Tajima values lycaeides.sfs.ml.tajima. Notably, we can use these information to implement a new way to call SNPs. For instance we can call a SNP if the probability of the site being at sample allele frequency zero is lower than a certain threshold (e.g. 0.05).

Comparing results


Here we've looked into 3 different ways of calling SNPs:
  • setting a threshold on the estimated MAF
  • using a Likelihood-Ratio Test (LRT)
  • using the posterior probability of sample allele frequencies

But how do they compare? Let's merge all the results into a single file:
awk 'BEGIN{print "post_probSFS"} {print 1-$1}' lycaeides.sfs.ml.post | paste lycaeides.mafs - > lycaeides.ALL
From this file we can see the number of sites that we'd call SNP's under each approach:
Method
  1. # SNPs
MAF > 0.01
2984
MAF > 0.025
2411
MAF > 0.05
1258
LRT > 3.84 (Chi2 p=0.05; df=1)
1947
LRT > 6.63 (Chi2 p=0.01; df=1)
1532
SFS_post > 0.95
888
SFS_post > 0.99
746

Checking some of the sites in detail:
...
ref_contig      190     T       A       T       T       0.000000        -0.000001       20      0.000273
ref_contig      191     A       G       A       A       0.032944        6.003593        20      0.167791
ref_contig      192     T       C       T       T       0.025917        8.248013        20      0.416617
ref_contig      193     A       C       A       A       0.000000        -0.000001       20      0.000294
...
ref_contig      328     C       A       C       C       0.000000        -0.000011       20      0.000709
ref_contig      329     G       A       G       G       0.041448        21.143322       20      0.996964
ref_contig      330     T       A       T       T       0.000000        -0.000014       20      0.000756
...
ref_contig      1661    A       C       A       A       0.000008        -0.000031       2       0.017503
ref_contig      1662    A       C       A       A       0.332954        9.540601        3       0.01851
ref_contig      1663    G       A       G       G       0.000008        -0.000062       4       0.019638
...


Calling Genotypes


To call genotypes we can use the doGeno argument. For that, we can use the previous analysis to get a list of putatively variable sites. We'll use the posterior probabilities of sample allele frequency criteria with a p-value of 0.05 as an example:
awk '$10 > 0.95 {print $1,$2-1,$2}' lycaeides.ALL > lycaeides_VAR.bed
And we feed this file into our genotype calling:
./angsd -nThreads 10 -ref input/referenceseq.fasta -anc input/referenceseq.fasta -outfiles lycaeides \
        -doMajorMinor 1 -doMaf 2 -doGeno 64 \
        mpileup -gIDS -q 0 -Q 20 -C 50 \
        -f input/referenceseq.fasta -b input/lycaeides.files_list -l lycaeides_VAR.bed > /dev/null
Which will give us a binary file with the posterior probabilities of each genotype for each individual and site. Binary is very convenient if you want to use it later on (small size, fast access) but it can be a problem to manually inspect. We can use the doGeno 5 argument to produce a text file instead.
./angsd -nThreads 10 -ref input/referenceseq.fasta -anc input/referenceseq.fasta -outfiles lycaeides \
        -doMajorMinor 1 -doMaf 2 -doGeno 5 \
        mpileup -gIDS -q 0 -Q 20 -C 50 \
        -f input/referenceseq.fasta -b input/lycaeides.files_list -l lycaeides_VAR.bed > /dev/null
Which will give us a file with the position (col 1-2), the major and minor alleles (col 3-4) and the called genotypes (number coded: 0=AA, 1=AC, 2=AG, 3=AT, 4=CC, 5=CG, 6=CT, 7=GG, 8=GT, 9=TT).
chromo  position        major   minor   MAXLRT
ref_contig      329     G       A       7       7       7       7       7       7       7       7       7       7       7       7       7       7       2       7       7       7       7       7
ref_contig      387     T       C       6       6       9       9       6       9       9       9       9       9       9       6       9       9       6       6       9       9       6       9
ref_contig      681     T       C       9       0       6       9       6       9       9       9       9       6       0       6       6       9       6       9       9       6       9       9
ref_contig      1370    G       A       7       7       7       7       7       2       7       7       7       7       7       7       7       7       7       2       7       7       2       7
ref_contig      1381    C       T       6       4       6       9       4       4       4       4       4       4       6       6       4       4       4       4       4       4       4       4
ref_contig      1500    G       T       7       7       7       7       7       8       7       7       7       7       7       7       7       7       7       7       7       7       7       7
ref_contig      1543    A       G       0       0       0       0       0       2       0       0       0       0       0       0       0       0       0       0       0       0       0       0
ref_contig      1565    A       T       0       3       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0
ref_contig      1747    A       C       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0
ref_contig      2341    A       C       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0
ref_contig      2412    C       A       4       4       4       4       4       4       4       4       0       4       0       4       4       0       4       4       4       4       4       4
ref_contig      2616    A       G       0       0       0       0       0       0       0       0       0       2       0       0       0       0       0       0       0       0       0       0
ref_contig      2666    C       T       4       4       4       4       4       4       4       4       4       4       4       4       4       4       4       4       4       4       4       4
ref_contig      2715    T       A       0       0       9       9       0       0       0       9       0       0       0       0       9       9       0       0       9       9       9       0
ref_contig      3118    T       A       9       9       9       9       9       9       9       9       9       9       9       9       9       9       9       9       9       9       9       9
ref_contig      3205    C       A       4       0       4       0       4       4       4       4       4       0       0       4       4       4       4       4       4       4       4       4
ref_contig      3239    A       C       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0
ref_contig      3285    T       A       9       0       9       0       9       9       9       9       0       0       0       9       9       9       0       9       9       9       9       0
ref_contig      3326    A       C       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0
ref_contig      3584    G       A       7       7       7       7       7       7       7       7       7       7       7       7       7       7       7       7       7       7       7       7
...