This workshop will cover key steps in analyzing 16S bacterial amplicon sequencing data. We will learn how to QC raw data, then use Mothur software to cluster OTUs and make taxonomic assignments.This tutorial reflects the way that I usually do my analyses.
Before the workshop, you need to:
1.Download and unzip Mothur software (v1.39.5) OR according to your OS. https://github.com/mothur/mothur/releases/tag/v1.35.1 2. Download and install “FastQC” – make sure that you have Java installed too. If you have (Java) trouble, consider this as an optional installation. http://www.bioinformatics.babraham.ac.uk/projects/download.html#fastqc
3. Download the “Amplicon_Workshop_CGRL” folder – it contains two subfolders. One subfolder is called “Workshop_files” and includes the fastq (sequencing) files, the stability.files (mapping file), and reference files (SILVA). These reference files can be downloaded here for your future analyses http://www.mothur.org/wiki/Silva_reference_files. The second subfolder contains the output files of mothur from the analysis that we are going to run. I have already run it and I give you the files in case you get stuck somewhere in the tutorial today, so that we can all proceed with the analysis.
4. Download and install: R: http://cran.rstudio.com
RStudio: http://www.rstudio.com/products/rstudio/download/ Optional
5. Install the following packages in R: “calibrate”, “shape”, “direct labels”, “knitr”, “clustsig”, “ellipse, “plyr”, “vegan”, “ggplot2”, and “phyloseq”. Except for “phyloseq”, all packages can simply be installed by running: install.packages(“package_name”). Please follow the directions to successfully install “phyloseq”: http://joey711.github.io/phyloseq/install
It is imperative that you download the workshop material BEFORE the workshop. Working in mothur
In Windows, you can double-click the executable mothur icon. In Mac, you can either right click and hit open (not recommended) or open it from the terminal ($ ./mothur). The easiest way to run this tutorial will be to transfer all files found in “Workshop_files” folder into the unzipped mothur folder that you downloaded in step 1.
When using, please cite: Schloss, P.D., et al., Introducing mothur: Open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol, 2009. 75(23):7537-41.
In the first command, we will extract the information from the forward (R1) and reverse (R2) fastq files.
mothur > fastq.info(fastq=soil.R1.fastq)
This command will extract the sequence and quality score data from your fastq files.
Output File Names:
soil.R1.fasta
soil.R1.qual
[WARNING]: your sequence names contained ':'. I changed them to '_' to avoid problems in your downstream analysis.
In every step of the analysis, we can see what the sequences look like with the “summary.seqs” command.
Because the table does not look good, please check your workshop material. I will paste it again only in couple of steps below.
Repeat for R2 on your own.
Now, we will pair the forward and reverse reads with the “make.contigs” command, which requires the “stability.files” as input. You should format the stability.files based on the form that you get your data back from your sequencing facility. In this tutorial, we only need a three-column format, where the first column corresponds to the sample, and the second and third columns to the R1 and R2 fastq files of this sample, respectively.
There are some troublesome observations, such as the 500 bp long sequences, the 12 and 92 ambiguous bases in some of them etc. The next command enables us to keep sequences that fulfill certain user-defined criteria.
mothur > summary.seqs(fasta=soil_stability.trim.contigs.good.unique.fasta, name=soil_stability.trim.contigs.good.names, processors=4)
.......
# of unique seqs: 73327
total # of seqs: 283406
Now, let’s create a count table that makes everything more affordable computationally.
mothur > count.seqs(name=soil_stability.trim.contigs.good.names, group=soil_stability.contigs.good.groups)
Using 4 processors.
It took 1 secs to create a table for 283406 sequences.
Total number of sequences: 283406
Output File Names:
soil_stability.trim.contigs.good.count_table
We will need to align our sequences to a reference database (fasta & taxonomy). This is how mothur works! I downloaded silva_v119 (//http://www.mothur.org/wiki/Silva_reference_files//) and trimmed it to the region of interest (V4) of the 16S rRNA gene with the “pcr.seqs” command. This is not necessary, but it will make your computer’s life easier. Let's see what I have there.
We will re-run screen.seqs to make sure that everything overlaps the same region. We will get sequences to start at or before position 7436 and let mothur optimize the end (or optimize it at 17018). We will also set the maximum homopolymer length to 8 (this could have been done in the first execution of screen.seqs above).
Now that we know our sequences overlap the same alignment coordinates, we want to make sure they only overlap that region. So we will filter the sequences to remove the overhangs at both ends.
If you did something wrong in the “screen.seqs” you will end up with nothing here. If you get the summary, you will see that all is well. Now, we will remove any redundancy that we created by running “unique.seqs” again.
In the summary, we see that we have 72,621 unique sequences out of a total of 282,291 sequences, which is about 26% of the total. This is a very high percentage and we we will try to reduce it as much as possible by running a couple of commands that will "remove" the sequencing error. We do not get rid of data, we just group them in meaningful ways and save our computer from crashing.
The first command is an algorith developed by Sue Huse (doi: 10.1111/j.1462-2920.2010.02193.x.). By default, mothur runs the pre.cluster command by looking for sequences that are within one (diffs=1) mismatch of the sequence being considered. You can change this threshold with the diffs option.
In the summary, we see that we have 37,219 unique sequences out of a total of 282,291 sequences, which is about a 12.5% reduction of the dataset. And now, let’s identify any chimeric sequences and remove them.
We now have 18,179 unique sequences out of a total of 246,946, which is about 7%. It is time we classified our sequences. I like doing this so as to remove unspecific sequences within mothur. Others do this further down in R. I do not see the reason of not further reducing my dataset as soon as possible, but this is my personal preference.
[WARNING]: M02248_9_000000000-AB7RF_1_1105_10669_16474 could not be classified. You can use the remove.lineage command with taxon=unknown; to remove such sequences.
The primers we use are only supposed to amplify members of the Bacteria and if they are hitting Eukaryota or Archaea, it is a mistake. Also, chloroplasts and mitochondria have no functional role in a microbial community. There's also just the random stuff that we want to get rid of (“unknown” -> no domain). Now, we will remove these sequences.
mothur > summary.seqs(fasta=current, count=current, processors=4)
............
# of unique seqs: 17967
total # of seqs: 244310
Our unique sequences are 7% of our dataset. I think this is still a high percentage of unique sequences and if you run the following command, you will be able to spit the abundance into abundant sequences and rare sequences. I will use a cutoff of one read to define a rare sequence. Then I will summarize the abundant fasta.
mothur > summary.seqs(fasta=soil_stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.abund.fasta, count=soil_stability.trim.contigs.good.unique.good.filter.unique.precluster.uchime.pick.pick.abund.count_table)
...........
# of unique seqs: 9265
total # of seqs: 235608
Interestingly, almost half of our unique sequences were singleton reads, and while this produced a great reduction in our unique sequences (which will make things easier for clustering and probably more meaningful ecologically), the total reduction of the whole dataset was only 3.5%. This is your call to make in your analysis.
Let’s continue with traditional analysis and proceed to clustering the sequences into OTUs. There are two ways to do this in mothur: 1) the traditional way that we will do consists of calculating uncorrected pairwise distances between aligned DNA sequences and the clustering into OTUs based on your favorite clustering method implemented in mothur (furthest, average, and nearest neighbor), 2) the faster and memory saving cluster.split way, where taxonomic information is used to split the sequences into bins and then cluster within each bin.
Mothur provides several other options for β-diversity analyses (e.g., calculation of ordination matrices, venn diagrams, heatmaps etc), as well as other ecological approaches such as the Indicator Species or the LEfSE (Linear discriminant analysis Effect Size) analyses. Plotting and other statistics can be done in R.
Particularly useful commands might be the “get.oturep”, which generates a fasta-formatted sequence file containing only a representative sequence for each OTU, and the “bin.seqs”, which can provide the full list of sequences for each OTU and a fasta file, if you provide a fasta file in the optional parameters.
Open R – Basic R Package - This is not a R workshop, so we will explore basic ways of plotting data from mothur with R. Thing can get more sophisticated with packages such as vegan, phyloseq etc. We will: 1) Plot a rarefaction curve 2) Plot a nMDS ordination plot using a mothur-outputted matrix.
Use the “phyloseq: package, make a “phyloseq object” and graph relative abundances bars.
library("phyloseq)
library("ggplot2")
library("plyr")
library("vegan")
sharedFile = read.table("soil_final.0.03.subsample.shared")
sharedFile = t(sharedFile)
rownames(sharedFile) = sharedFile[,1]
colnames(sharedFile) = sharedFile[2,]
sharedFile = sharedFile[,2:5]
sharedFile = sharedFile[4:5241,]
class(sharedFile) <- "numeric"
taxFile = read.table('soil_final.taxonomy.txt', header=T, sep='\t')
rownames(taxFile) = taxFile[,1]
taxFile = taxFile[,3:8]
taxFile = as.matrix(taxFile)
metaFile = read.table('soil_stability.meta.txt', header=T, sep='\t')
rownames(metaFile) = metaFile[,1]
metaFile = metaFile[,2:3]
OTU = otu_table(sharedFile, taxa_are_rows = TRUE)
TAX = tax_table(taxFile)
META = sample_data(metaFile)
Soil_physeq = phyloseq(OTU, TAX, META)
Soil_physeq <- prune_taxa(taxa_sums(Soil_physeq) > 0, Soil_physeq)
sum(taxa_sums(Soil_physeq) == 0) #check if you have OTUs with "0" abundance
#Transform to proportions
Soil_physeqP = transform_sample_counts(Soil_physeq, function(x) 100 * x/sum(x))
#Plot the top20 OTUs
top20otus = names(sort(taxa_sums(Soil_physeqP), TRUE)[1:20])
taxtab20 = cbind(tax_table(Soil_physeqP), genus20 = NA)
taxtab20[top20otus, "genus20"] <- as(tax_table(Soil_physeqP)[top20otus, "Genus"], "character")
tax_table(Soil_physeqP) <- tax_table(taxtab20)
Soil_physeq_P20 = prune_taxa(top20otus, Soil_physeqP)
title = "Relative Abundance of Top20 OTUs"
p1 = plot_bar(Soil_physeq_P20, fill = "genus20", title = title) + labs(colour = "genus")
# plot with panelling by Method for an easier method-wise comparison
p1 + facet_wrap(~Method)
INSTRUCTOR: Despoina Lymperopoulou
WORKSHOP INFORMATION AND REQUIREMENTS
This workshop will cover key steps in analyzing 16S bacterial amplicon sequencing data. We will learn how to QC raw data, then use Mothur software to cluster OTUs and make taxonomic assignments.This tutorial reflects the way that I usually do my analyses.Before the workshop, you need to:
1.Download and unzip Mothur software (v1.39.5) OR according to your OS.
https://github.com/mothur/mothur/releases/tag/v1.35.1
2. Download and install “FastQC” – make sure that you have Java installed too. If you have (Java) trouble, consider this as an optional installation. http://www.bioinformatics.babraham.ac.uk/projects/download.html#fastqc
3. Download the “Amplicon_Workshop_CGRL” folder – it contains two subfolders. One subfolder is called “Workshop_files” and includes the fastq (sequencing) files, the stability.files (mapping file), and reference files (SILVA). These reference files can be downloaded here for your future analyses http://www.mothur.org/wiki/Silva_reference_files. The second subfolder contains the output files of mothur from the analysis that we are going to run. I have already run it and I give you the files in case you get stuck somewhere in the tutorial today, so that we can all proceed with the analysis.
4. Download and install:
R: http://cran.rstudio.com
RStudio: http://www.rstudio.com/products/rstudio/download/
Optional
5. Install the following packages in R: “calibrate”, “shape”, “direct labels”, “knitr”, “clustsig”, “ellipse, “plyr”, “vegan”, “ggplot2”, and “phyloseq”. Except for “phyloseq”, all packages can simply be installed by running: install.packages(“package_name”). Please follow the directions to successfully install “phyloseq”: http://joey711.github.io/phyloseq/install
It is imperative that you download the workshop material BEFORE the workshop.
Working in mothur
In Windows, you can double-click the executable mothur icon. In Mac, you can either right click and hit open (not recommended) or open it from the terminal ($ ./mothur). The easiest way to run this tutorial will be to transfer all files found in “Workshop_files” folder into the unzipped mothur folder that you downloaded in step 1.
When using, please cite:
Schloss, P.D., et al., Introducing mothur: Open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol, 2009. 75(23):7537-41.
In the first command, we will extract the information from the forward (R1) and reverse (R2) fastq files.
This command will extract the sequence and quality score data from your fastq files.
In every step of the analysis, we can see what the sequences look like with the “summary.seqs” command.
mothur > summary.seqs(fasta=soil.R1.fasta, processors=4) Using 4 processors. Start End NBases Ambigs Polymer NumSeqs Minimum: 1 250 250 0 3 1 2.5%-tile: 1 250 250 0 4 9374 25%-tile: 1 250 250 0 4 93740 Median: 1 250 250 0 5 187480 75%-tile: 1 250 250 0 5 281219 97.5%-tile: 1 250 250 0 8 365585 Maximum: 1 250 250 0 59 374958 Mean: 1 250 250 0 4.72324 # of Seqs: 374958 Output File Names: soil.R1.summary It took 4 secs to summarize 374958 sequences.Because the table does not look good, please check your workshop material. I will paste it again only in couple of steps below.Repeat for R2 on your own.
Now, we will pair the forward and reverse reads with the “make.contigs” command, which requires the “stability.files” as input. You should format the stability.files based on the form that you get your data back from your sequencing facility. In this tutorial, we only need a three-column format, where the first column corresponds to the sample, and the second and third columns to the R1 and R2 fastq files of this sample, respectively.
Let's see what happened:
mothur > summary.seqs(fasta=soil_stability.trim.contigs.fasta, processors=4) Start End NBases Ambigs Polymer NumSeqs Minimum: 1 247 247 0 3 1 2.5%-tile: 1 253 253 0 4 9374 25%-tile: 1 253 253 0 4 93740 Median: 1 253 253 0 4 187480 75%-tile: 1 253 253 0 5 281219 97.5%-tile: 1 254 254 12 6 365585 Maximum: 1 500 500 92 250 374958 Mean: 1 253.178 253.178 1.16067 4.55027 # of Seqs: 374958There are some troublesome observations, such as the 500 bp long sequences, the 12 and 92 ambiguous bases in some of them etc. The next command enables us to keep sequences that fulfill certain user-defined criteria.mothur > summary.seqs(fasta=soil_stability.trim.contigs.good.fasta, processors=4) Using 4 processors. Start End NBases Ambigs Polymer NumSeqs Minimum: 1 250 250 0 3 1 2.5%-tile: 1 253 253 0 4 7086 25%-tile: 1 253 253 0 4 70852 Median: 1 253 253 0 5 141704 75%-tile: 1 253 253 0 5 212555 97.5%-tile: 1 254 254 0 6 276321 Maximum: 1 280 280 0 11 283406 Mean: 1 253.12 253.12 0 4.55898 # of Seqs: 283406We anticipate that many of our sequences are duplicates of each other. Let’s de-replicate our dataset.mothur > summary.seqs(fasta=soil_stability.trim.contigs.good.unique.fasta, name=soil_stability.trim.contigs.good.names, processors=4) ....... # of unique seqs: 73327 total # of seqs: 283406Now, let’s create a count table that makes everything more affordable computationally.We will need to align our sequences to a reference database (fasta & taxonomy). This is how mothur works! I downloaded silva_v119 (//http://www.mothur.org/wiki/Silva_reference_files//) and trimmed it to the region of interest (V4) of the 16S rRNA gene with the “pcr.seqs” command. This is not necessary, but it will make your computer’s life easier. Let's see what I have there.
mothur > summary.seqs(fasta=silva.v4.fasta, processors=4) Using 4 processors. Start End NBases Ambigs Polymer NumSeqs Minimum: 1 20730 501 0 4 1 2.5%-tile: 2 21228 522 0 4 374 25%-tile: 2 21228 528 0 4 3740 Median: 2 21228 545 0 5 7479 75%-tile: 2 21228 549 0 5 11218 97.5%-tile: 2 21228 551 2 6 14583 Maximum: 10 21228 587 5 9 14956 Mean: 2.00053 21228 539.276 0.13192 4.97727 # of Seqs: 14956We can align our own sequences now. After the alignment, the previously ".fasta" file is now called ".align".
We will re-run screen.seqs to make sure that everything overlaps the same region. We will get sequences to start at or before position 7436 and let mothur optimize the end (or optimize it at 17018). We will also set the maximum homopolymer length to 8 (this could have been done in the first execution of screen.seqs above).
Now that we know our sequences overlap the same alignment coordinates, we want to make sure they only overlap that region. So we will filter the sequences to remove the overhangs at both ends.
If you did something wrong in the “screen.seqs” you will end up with nothing here. If you get the summary, you will see that all is well. Now, we will remove any redundancy that we created by running “unique.seqs” again.
In the summary, we see that we have 72,621 unique sequences out of a total of 282,291 sequences, which is about 26% of the total. This is a very high percentage and we we will try to reduce it as much as possible by running a couple of commands that will "remove" the sequencing error. We do not get rid of data, we just group them in meaningful ways and save our computer from crashing.
The first command is an algorith developed by Sue Huse (doi: 10.1111/j.1462-2920.2010.02193.x.). By default, mothur runs the pre.cluster command by looking for sequences that are within one (diffs=1) mismatch of the sequence being considered. You can change this threshold with the diffs option.
In the summary, we see that we have 37,219 unique sequences out of a total of 282,291 sequences, which is about a 12.5% reduction of the dataset. And now, let’s identify any chimeric sequences and remove them.
We now have 18,179 unique sequences out of a total of 246,946, which is about 7%. It is time we classified our sequences. I like doing this so as to remove unspecific sequences within mothur. Others do this further down in R. I do not see the reason of not further reducing my dataset as soon as possible, but this is my personal preference.
We got some warnings such as:
The primers we use are only supposed to amplify members of the Bacteria and if they are hitting Eukaryota or Archaea, it is a mistake. Also, chloroplasts and mitochondria have no functional role in a microbial community. There's also just the random stuff that we want to get rid of (“unknown” -> no domain). Now, we will remove these sequences.
Our unique sequences are 7% of our dataset. I think this is still a high percentage of unique sequences and if you run the following command, you will be able to spit the abundance into abundant sequences and rare sequences. I will use a cutoff of one read to define a rare sequence. Then I will summarize the abundant fasta.
Interestingly, almost half of our unique sequences were singleton reads, and while this produced a great reduction in our unique sequences (which will make things easier for clustering and probably more meaningful ecologically), the total reduction of the whole dataset was only 3.5%. This is your call to make in your analysis.
Let’s continue with traditional analysis and proceed to clustering the sequences into OTUs. There are two ways to do this in mothur: 1) the traditional way that we will do consists of calculating uncorrected pairwise distances between aligned DNA sequences and the clustering into OTUs based on your favorite clustering method implemented in mothur (furthest, average, and nearest neighbor), 2) the faster and memory saving cluster.split way, where taxonomic information is used to split the sequences into bins and then cluster within each bin.
Now, we will create a table with the number of sequences per OTU and per sample.
And we will need to classify these OTUs
Let's give some easier names to our final files. Use "copy" instead of "cp" if you work in Windows OS.
We need to figure out how many sequences we have per sample…
…and then normalize the sampling effort by sub-sampling with the lowest number of sequences observed.
We can evaluate the sequencing depth in our samples in mothur and plot it in R, as shown below.
Indices are important, not only to calculate and plot them, but also to understand the information they provide and the limitations they might have.
Mothur provides several other options for β-diversity analyses (e.g., calculation of ordination matrices, venn diagrams, heatmaps etc), as well as other ecological approaches such as the Indicator Species or the LEfSE (Linear discriminant analysis Effect Size) analyses. Plotting and other statistics can be done in R.
Particularly useful commands might be the “get.oturep”, which generates a fasta-formatted sequence file containing only a representative sequence for each OTU, and the “bin.seqs”, which can provide the full list of sequences for each OTU and a fasta file, if you provide a fasta file in the optional parameters.
Open R – Basic R Package - This is not a R workshop, so we will explore basic ways of plotting data from mothur with R. Thing can get more sophisticated with packages such as vegan, phyloseq etc. We will:
1) Plot a rarefaction curve
2) Plot a nMDS ordination plot using a mothur-outputted matrix.
Use the “phyloseq: package, make a “phyloseq object” and graph relative abundances bars. library("phyloseq) library("ggplot2") library("plyr") library("vegan") sharedFile = read.table("soil_final.0.03.subsample.shared") sharedFile = t(sharedFile) rownames(sharedFile) = sharedFile[,1] colnames(sharedFile) = sharedFile[2,] sharedFile = sharedFile[,2:5] sharedFile = sharedFile[4:5241,] class(sharedFile) <- "numeric" taxFile = read.table('soil_final.taxonomy.txt', header=T, sep='\t') rownames(taxFile) = taxFile[,1] taxFile = taxFile[,3:8] taxFile = as.matrix(taxFile) metaFile = read.table('soil_stability.meta.txt', header=T, sep='\t') rownames(metaFile) = metaFile[,1] metaFile = metaFile[,2:3] OTU = otu_table(sharedFile, taxa_are_rows = TRUE) TAX = tax_table(taxFile) META = sample_data(metaFile) Soil_physeq = phyloseq(OTU, TAX, META) Soil_physeq <- prune_taxa(taxa_sums(Soil_physeq) > 0, Soil_physeq) sum(taxa_sums(Soil_physeq) == 0) #check if you have OTUs with "0" abundance #Transform to proportions Soil_physeqP = transform_sample_counts(Soil_physeq, function(x) 100 * x/sum(x)) #Plot the top20 OTUs top20otus = names(sort(taxa_sums(Soil_physeqP), TRUE)[1:20]) taxtab20 = cbind(tax_table(Soil_physeqP), genus20 = NA) taxtab20[top20otus, "genus20"] <- as(tax_table(Soil_physeqP)[top20otus, "Genus"], "character") tax_table(Soil_physeqP) <- tax_table(taxtab20) Soil_physeq_P20 = prune_taxa(top20otus, Soil_physeqP) title = "Relative Abundance of Top20 OTUs" p1 = plot_bar(Soil_physeq_P20, fill = "genus20", title = title) + labs(colour = "genus") # plot with panelling by Method for an easier method-wise comparison p1 + facet_wrap(~Method)WORKSHOP TUTORIAL