Oct 29, 2012
Emilia Huerta, Filipe G. Vieira, Matteo Fumagalli
This wiki will guide you through some basic exercises for calling SNPs and genotypes. We will mainly use a program called 'ANGSD' (Analysis of Nexg Generation Sequencing Data) currently being developed in out lab. More information about usage, rational and some of the implemented methods can be found here and on ANGSD wiki here.
The data needed for this tutorial is in a single tar.gz file that you can download here.
bin/ precompiled binary files for 32 and 64 bit Linux and MacOSX.
input/ input files for the analysis: lycaeides.bed (list of sites to analyze) and lycaeides.files_list (list of BAM files to analyze)
Create a folder and unpack the workshop.tar.gz file into it:
mkdir workshop
cd workshop
wget http://palin.popgen.dk/workshop/workshop.tar.gz __OR__ click above link
tar -zxvf workshop.tar.gz
This creates the bin/ and input/ folders.
We will use 3 programs, 'ANGSD', 'optimSFS' and 'sfstools'.
We can use the precompiled binaries or we can build the programs ourselves.
Using precompiled binaries
Just use the binary that best matches your system and run:
If the precompiled binaries won't run you can compile the source code for youself.
Download the source code from here and untar the file:
tar -zxvf angsd0.204.tar.gz
And compile:
cd angsd0.204/samtools-0.1.17
make clean; make
cd ../misc
make clean; make
cd ..
make clean; make
cd ..
ln -fs angsd0.204/angsd.g++ angsd
ln -fs angsd0.204/misc/optimSFS.g++ optimSFS
ln -fs angsd0.204/misc/sfstools.g++ sfstools
Running Examples
We will now walk through an example. There are 20 BAM files in the input/bam subdirectory. The file called lycaeides.files_list is simply a file containing the relative filepaths to the actual BAM files.
head -n 3 input/lycaeides.files_list
# ./input/bam/vic_10_05f.bam
# ./input/bam/sin_10_19f.bam
# ./input/bam/lan_10_18m.bam
The 20 BAM files are a small sample subset and we'll analyze around 100kb. More information about the original aligned files can be obtained on the original paper http://www.ncbi.nlm.nih.gov/pubmed/22759293. In this paper authors analyzed a butterfly Next-Generation Sequencing dataset sequenced using Illumina GAII technology. The dataset consists of a total of 381 samples of Lycaeides idas, Lycaeides melissa and an hybrid collection (Jackson Hole). DNA resequencing was conducted on custom reduced genomic complexity libraries. We will analyzed only a subset of 20 samples.
Getting the MAF (Minor Allele Frequency) and calling SNPs
-anc Ancestral sequence (FASTA), for this example we will set the ancestral as reference because we don't have this information
-outfiles PREFIX for our output files
-doGlf Output Genotype Likelihoods (not used in this workshop)
-doMajorMinor Determine Major/Minor alleles
-doMaf Estimate the MAF using an EM algorithm
-doSNP Calculate the LRT for the site being variable
The above command generates some files, all prefixed with lycaeides. One of the most useful is lycaeides.mafs, which contains some basic infomation. You can view these files using 'less' and exit with 'q'.
less -S lycaeides.mafs
...
ref_contig 183 C A C C 0.000000 -0.000001 20
ref_contig 184 T A T T 0.000000 -0.000001 20
ref_contig 185 G A G G 0.000001 -0.000083 20
ref_contig 186 C T C C 0.000000 -0.000002 20
ref_contig 187 T A T T 0.000000 -0.000001 20
ref_contig 188 T A T T 0.000000 -0.000001 20
ref_contig 189 G A G G 0.000012 -0.000349 20
ref_contig 190 T A T T 0.000000 -0.000001 20
ref_contig 191 A G A A 0.033086 6.012643 20
ref_contig 192 T C T T 0.026010 8.255082 20
ref_contig 193 A C A A 0.000000 -0.000001 20
ref_contig 194 C A C C 0.000000 -0.000001 20
ref_contig 195 T A T T 0.000000 -0.000001 20
ref_contig 196 A C A A 0.000000 -0.000001 20
ref_contig 197 C A C C 0.000000 -0.000001 20
ref_contig 198 A C A A 0.000000 -0.000001 20
ref_contig 199 G A G G 0.016429 0.152365 20
ref_contig 200 T A T T 0.000000 -0.000002 20
ref_contig 201 A C A A 0.000000 -0.000001 20
ref_contig 202 A C A A 0.000000 -0.000001 20
...
Column1-2 is the position, column3 and 4 is major and minor. Column 5-6 is the reference and the ancestral (the reference is not used for any caluculations in this workshop). Column7 is an estimate of the MAF, this estimate is based on the likelihoods and using an EM-algorithm. Column 8 is a 2*loglike test of the site being variable. And the last column is the number of individuals for which we have data at the site.
With this output we can already perform a crude SNP calling, based on the MAF. In fact, ANGSD can easily provide with the -minMaf argument.
Which gives us the following output in lycaeides_VAR.mafs:
chromo position major minor ref anc knownEM pK-EM nInd
ref_contig 321 G A G G 0.084433 6.943010 19
ref_contig 387 T C C C 0.152956 70.097210 20
ref_contig 620 A G A A 0.107351 14.582970 18
ref_contig 681 T C T T 0.245610 78.355335 18
ref_contig 1114 A G A A 0.118452 3.339191 7
ref_contig 1370 G A G G 0.050942 31.441695 20
ref_contig 1381 C T C C 0.139262 27.149469 20
ref_contig 1442 A G A A 0.081306 10.207323 20
...
However, this methodology can be error prone and rely on an arbitrary choice of a cutoff. Another more robust way to do it is to use the results from Column 8. Those values are chi-square distributed with 1 d.f. Therefore the 0.05 threshold is at 3.84. We can print all sites which have an LRT value above a certain threshold using the parameter -minLRT.
chromo position major minor ref anc knownEM pK-EM nInd
ref_contig 191 A G A A 0.032944 6.003593 20
ref_contig 192 T C T T 0.025917 8.248013 20
ref_contig 321 G A G G 0.081707 6.812460 19
ref_contig 329 G A G G 0.041448 21.143322 20
ref_contig 387 T C C C 0.177486 76.792329 20
ref_contig 549 G T G G 0.024890 7.788239 20
ref_contig 620 A G A A 0.104913 14.438307 18
ref_contig 643 G A G G 0.047299 8.346708 18
ref_contig 681 T C T T 0.209771 66.714527 18
...
Estimating the Site Frequency Spectrum (SFS)
For estimating the SFS we will need to generate a .sfs file which is a binary file that contains the posterior probabilities of sample allele frequencies for the sites where we have data, and where we have information about the ancestral state.
But we have too few sites to get a good estimate so we'll use a pre-computed file of the entire genome. Notice that not all sites are represented, since we don't have reference/ancestral information for all sites within our region.
To view the SFS, and compare the estimates that we have from a short region or from the whole-genome, we can use R:
R
sfs=scan("lycaeides.sfs.ml");
sfs_full=scan("input/lycaeides.sfs.ml")
# we may want to 'mask' the frequency of non variable sites
sfs[1]=sfs_full[1]=NA;
barplot(rbind(sfs,sfs_full), beside=T, legend=c("estimated on a short region", "estimated on a genome-wide scale"), main="SFS")
Now we can finally compute the per site posterior probabilities of sample allele frequencies using the overall SFS a a prior:
This will output two files, one with posterior probabilities lycaeides.sfs.ml.post and another with Tajima values lycaeides.sfs.ml.tajima. Notably, we can use these information to implement a new way to call SNPs. For instance we can call a SNP if the probability of the site being at sample allele frequency zero is lower than a certain threshold (e.g. 0.05).
Comparing results
Here we've looked into 3 different ways of calling SNPs:
setting a threshold on the estimated MAF
using a Likelihood-Ratio Test (LRT)
using the posterior probability of sample allele frequencies
But how do they compare? Let's merge all the results into a single file:
From this file we can see the number of sites that we'd call SNP's under each approach:
Method
# SNPs
MAF > 0.01
2984
MAF > 0.025
2411
MAF > 0.05
1258
LRT > 3.84 (Chi2 p=0.05; df=1)
1947
LRT > 6.63 (Chi2 p=0.01; df=1)
1532
SFS_post > 0.95
888
SFS_post > 0.99
746
Checking some of the sites in detail:
...
ref_contig 190 T A T T 0.000000 -0.000001 20 0.000273
ref_contig 191 A G A A 0.032944 6.003593 20 0.167791
ref_contig 192 T C T T 0.025917 8.248013 20 0.416617
ref_contig 193 A C A A 0.000000 -0.000001 20 0.000294
...
ref_contig 328 C A C C 0.000000 -0.000011 20 0.000709
ref_contig 329 G A G G 0.041448 21.143322 20 0.996964
ref_contig 330 T A T T 0.000000 -0.000014 20 0.000756
...
ref_contig 1661 A C A A 0.000008 -0.000031 2 0.017503
ref_contig 1662 A C A A 0.332954 9.540601 3 0.01851
ref_contig 1663 G A G G 0.000008 -0.000062 4 0.019638
...
Calling Genotypes
To call genotypes we can use the doGeno argument. For that, we can use the previous analysis to get a list of putatively variable sites. We'll use the posterior probabilities of sample allele frequency criteria with a p-value of 0.05 as an example:
Which will give us a binary file with the posterior probabilities of each genotype for each individual and site. Binary is very convenient if you want to use it later on (small size, fast access) but it can be a problem to manually inspect. We can use the doGeno 5 argument to produce a text file instead.
Which will give us a file with the position (col 1-2), the major and minor alleles (col 3-4) and the called genotypes (number coded: 0=AA, 1=AC, 2=AG, 3=AT, 4=CC, 5=CG, 6=CT, 7=GG, 8=GT, 9=TT).
Genotyping & SNP Analysis Workshop
Oct 29, 2012Emilia Huerta, Filipe G. Vieira, Matteo Fumagalli
This wiki will guide you through some basic exercises for calling SNPs and genotypes. We will mainly use a program called 'ANGSD' (Analysis of Nexg Generation Sequencing Data) currently being developed in out lab. More information about usage, rational and some of the implemented methods can be found here and on ANGSD wiki here.
The data needed for this tutorial is in a single tar.gz file that you can download here.
In case you have some problem running the example, the outputs can be downloaded here.
Table of Contents
Installing
Create a folder and unpack the workshop.tar.gz file into it:This creates the bin/ and input/ folders.
We will use 3 programs, 'ANGSD', 'optimSFS' and 'sfstools'.
We can use the precompiled binaries or we can build the programs ourselves.
Using precompiled binaries
Just use the binary that best matches your system and run:- For i386 MacOSX:
Test the binaries by running:Building source files
If the precompiled binaries won't run you can compile the source code for youself.Download the source code from here and untar the file:
And compile:
Running Examples
We will now walk through an example. There are 20 BAM files in the input/bam subdirectory. The file called lycaeides.files_list is simply a file containing the relative filepaths to the actual BAM files.The 20 BAM files are a small sample subset and we'll analyze around 100kb. More information about the original aligned files can be obtained on the original paper http://www.ncbi.nlm.nih.gov/pubmed/22759293. In this paper authors analyzed a butterfly Next-Generation Sequencing dataset sequenced using Illumina GAII technology. The dataset consists of a total of 381 samples of Lycaeides idas, Lycaeides melissa and an hybrid collection (Jackson Hole). DNA resequencing was conducted on custom reduced genomic complexity libraries. We will analyzed only a subset of 20 samples.
Getting the MAF (Minor Allele Frequency) and calling SNPs
We will use the program ANGSD to analyze our data./angsd -nThreads 10 -ref input/referenceseq.fasta -anc input/referenceseq.fasta -outfiles lycaeides \ -doGlf 3 -doMajorMinor 1 -doMaf 2 -doSNP 1 \ mpileup -gIDS -q 0 -Q 20 -C 50 -f input/referenceseq.fasta \ -b input/lycaeides.files_list -l input/lycaeides.bed > /dev/nullwhere:The above command generates some files, all prefixed with lycaeides. One of the most useful is lycaeides.mafs, which contains some basic infomation. You can view these files using 'less' and exit with 'q'.
Column1-2 is the position, column3 and 4 is major and minor. Column 5-6 is the reference and the ancestral (the reference is not used for any caluculations in this workshop). Column7 is an estimate of the MAF, this estimate is based on the likelihoods and using an EM-algorithm. Column 8 is a 2*loglike test of the site being variable. And the last column is the number of individuals for which we have data at the site.
With this output we can already perform a crude SNP calling, based on the MAF. In fact, ANGSD can easily provide with the -minMaf argument.
./angsd -nThreads 10 -ref input/referenceseq.fasta -anc input/referenceseq.fasta -outfiles lycaeides_VAR -minMaf 0.05 \ -doGlf 3 -doMajorMinor 1 -doMaf 2 -doSNP 1 \ mpileup -gIDS -q 0 -Q 20 -C 50 \ -f input/referenceseq.fasta -b input/lycaeides.files_list -l input/lycaeides.bed > /dev/nullWhich gives us the following output in lycaeides_VAR.mafs:However, this methodology can be error prone and rely on an arbitrary choice of a cutoff. Another more robust way to do it is to use the results from Column 8. Those values are chi-square distributed with 1 d.f. Therefore the 0.05 threshold is at 3.84. We can print all sites which have an LRT value above a certain threshold using the parameter -minLRT.
./angsd -nThreads 10 -ref input/referenceseq.fasta -anc input/referenceseq.fasta -outfiles lycaeides_VAR2 \ -minLRT 5 -doGlf 3 -doMajorMinor 1 -doMaf 2 -doSNP 1 \ mpileup -gIDS -q 0 -Q 20 -C 50 \ -f input/referenceseq.fasta -b input/lycaeides.files_list -l input/lycaeides.bed > /dev/nullEstimating the Site Frequency Spectrum (SFS)
For estimating the SFS we will need to generate a .sfs file which is a binary file that contains the posterior probabilities of sample allele frequencies for the sites where we have data, and where we have information about the ancestral state../angsd -ref input/referenceseq.fasta -anc input/referenceseq.fasta -outfiles lycaeides \ -realSFS 1 \ mpileup -gIDS -q 0 -Q 20 -C 50 \ -f input/referenceseq.fasta -b input/lycaeides.files_list -l input/lycaeides.bed > /dev/nullWe can obtain the global spectrum using optimSFS
This generates a file called lycaeides.sfs.ml, which contains our SFS estimate.
But we have too few sites to get a good estimate so we'll use a pre-computed file of the entire genome. Notice that not all sites are represented, since we don't have reference/ancestral information for all sites within our region.
To view the SFS, and compare the estimates that we have from a short region or from the whole-genome, we can use R:
R sfs=scan("lycaeides.sfs.ml"); sfs_full=scan("input/lycaeides.sfs.ml") # we may want to 'mask' the frequency of non variable sites sfs[1]=sfs_full[1]=NA; barplot(rbind(sfs,sfs_full), beside=T, legend=c("estimated on a short region", "estimated on a genome-wide scale"), main="SFS")Now we can finally compute the per site posterior probabilities of sample allele frequencies using the overall SFS a a prior:
This will output two files, one with posterior probabilities lycaeides.sfs.ml.post and another with Tajima values lycaeides.sfs.ml.tajima. Notably, we can use these information to implement a new way to call SNPs. For instance we can call a SNP if the probability of the site being at sample allele frequency zero is lower than a certain threshold (e.g. 0.05).
Comparing results
Here we've looked into 3 different ways of calling SNPs:
But how do they compare? Let's merge all the results into a single file:
awk 'BEGIN{print "post_probSFS"} {print 1-$1}' lycaeides.sfs.ml.post | paste lycaeides.mafs - > lycaeides.ALLFrom this file we can see the number of sites that we'd call SNP's under each approach:Checking some of the sites in detail:
Calling Genotypes
To call genotypes we can use the doGeno argument. For that, we can use the previous analysis to get a list of putatively variable sites. We'll use the posterior probabilities of sample allele frequency criteria with a p-value of 0.05 as an example:
awk '$10 > 0.95 {print $1,$2-1,$2}' lycaeides.ALL > lycaeides_VAR.bedAnd we feed this file into our genotype calling:./angsd -nThreads 10 -ref input/referenceseq.fasta -anc input/referenceseq.fasta -outfiles lycaeides \ -doMajorMinor 1 -doMaf 2 -doGeno 64 \ mpileup -gIDS -q 0 -Q 20 -C 50 \ -f input/referenceseq.fasta -b input/lycaeides.files_list -l lycaeides_VAR.bed > /dev/nullWhich will give us a binary file with the posterior probabilities of each genotype for each individual and site. Binary is very convenient if you want to use it later on (small size, fast access) but it can be a problem to manually inspect. We can use the doGeno 5 argument to produce a text file instead../angsd -nThreads 10 -ref input/referenceseq.fasta -anc input/referenceseq.fasta -outfiles lycaeides \ -doMajorMinor 1 -doMaf 2 -doGeno 5 \ mpileup -gIDS -q 0 -Q 20 -C 50 \ -f input/referenceseq.fasta -b input/lycaeides.files_list -l lycaeides_VAR.bed > /dev/nullWhich will give us a file with the position (col 1-2), the major and minor alleles (col 3-4) and the called genotypes (number coded: 0=AA, 1=AC, 2=AG, 3=AT, 4=CC, 5=CG, 6=CT, 7=GG, 8=GT, 9=TT).