INSTRUCTOR: Despoina Lymperopoulou


This workshop will cover key steps in analyzing 16S bacterial amplicon sequencing data. We will learn how to QC raw data, then use Mothur software to cluster OTUs and make taxonomic assignments.This tutorial reflects the way that I usually do my analyses.

Before the workshop, you need to:
1.Download and unzip Mothur software (v1.39.5) OR according to your OS.
2. Download and install “FastQC” – make sure that you have Java installed too. If you have (Java) trouble, consider this as an optional installation.
3. Download the “Amplicon_Workshop_CGRL” folder – it contains two subfolders. One subfolder is called “Workshop_files” and includes the fastq (sequencing) files, the stability.files (mapping file), and reference files (SILVA). These reference files can be downloaded here for your future analyses The second subfolder contains the output files of mothur from the analysis that we are going to run. I have already run it and I give you the files in case you get stuck somewhere in the tutorial today, so that we can all proce
ed with the analysis.
4. Download and install:


5. Install the following packages in R: “calibrate”, “shape”, “direct labels”, “knitr”, “clustsig”, “ellipse, “plyr”, “vegan”, “ggplot2”, and “phyloseq”. Except for “phyloseq”, all packages can simply be installed by running: install.packages(“package_name”). Please follow the directions to successfully install “phyloseq”:

It is imperative that you download the workshop material BEFORE the workshop.

Working in mothur

In Windows, you can double-click the executable mothur icon. In Mac, you can either right click and hit open (not recommended) or open it from the terminal ($ ./mothur). The easiest way to run this tutorial will be to transfer all files found in “Workshop_files” folder into the unzipped mothur folder that you downloaded in step 1.

When using, please cite:
Schloss, P.D., et al., Introducing mothur: Open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol, 2009. 75(23):7537-41.

In the first command, we will extract the information from the forward (R1) and reverse (R2) fastq files.
mothur >
This command will extract the sequence and quality score data from your fastq files.
Output File Names:
[WARNING]: your sequence names contained ':'.  I changed them to '_' to avoid problems in your downstream analysis.
In every step of the analysis, we can see what the sequences look like with the “summary.seqs” command.
mothur > summary.seqs(fasta=soil.R1.fasta, processors=4)
Using 4 processors.
        Start    End    NBases    Ambigs    Polymer    NumSeqs
Minimum:    1    250    250    0    3    1
2.5%-tile:    1    250    250    0    4    9374
25%-tile:    1    250    250    0    4    93740
Median:     1    250    250    0    5    187480
75%-tile:    1    250    250    0    5    281219
97.5%-tile:    1    250    250    0    8    365585
Maximum:    1    250    250    0    59    374958
Mean:    1    250    250    0    4.72324
# of Seqs:    374958
Output File Names:
It took 4 secs to summarize 374958 sequences.
Because the table does not look good, please check your workshop material. I will paste it again only in couple of steps below.

Repeat for R2 on your own.

Now, we will pair the forward and reverse reads with the “make.contigs” command, which requires the “stability.files” as input. You should format the stability.files based on the form that you get your data back from your sequencing facility. In this tutorial, we only need a three-column format, where the first column corresponds to the sample, and the second and third columns to the R1 and R2 fastq files of this sample, respectively.
mothur > make.contigs(file=soil_stability.files, processors=4)
Let's see what happened:
mothur > summary.seqs(fasta=soil_stability.trim.contigs.fasta, processors=4)
        Start    End    NBases    Ambigs    Polymer    NumSeqs
Minimum:    1    247    247    0    3    1
2.5%-tile:    1    253    253    0    4    9374
25%-tile:    1    253    253    0    4    93740
Median:     1    253    253    0    4    187480
75%-tile:    1    253    253    0    5    281219
97.5%-tile:    1    254    254    12    6    365585
Maximum:    1    500    500    92    250    374958
Mean:    1    253.178    253.178    1.16067    4.55027
# of Seqs:    374958
There are some troublesome observations, such as the 500 bp long sequences, the 12 and 92 ambiguous bases in some of them etc. The next command enables us to keep sequences that fulfill certain user-defined criteria.
mothur > screen.seqs(fasta=soil_stability.trim.contigs.fasta, group=soil_stability.contigs.groups, maxambig=0, maxlength=280)
mothur > summary.seqs(fasta=soil_stability.trim.contigs.good.fasta, processors=4)
Using 4 processors.
        Start    End    NBases    Ambigs    Polymer    NumSeqs
Minimum:    1    250    250    0    3    1
2.5%-tile:    1    253    253    0    4    7086
25%-tile:    1    253    253    0    4    70852
Median:     1    253    253    0    5    141704
75%-tile:    1    253    253    0    5    212555
97.5%-tile:    1    254    254    0    6    276321
Maximum:    1    280    280    0    11    283406
Mean:    1    253.12    253.12    0    4.55898
# of Seqs:    283406
We anticipate that many of our sequences are duplicates of each other. Let’s de-replicate our dataset.
mothur > unique.seqs(fasta=soil_stability.trim.contigs.good.fasta)
283406    73327
Output File Names:
mothur > summary.seqs(fasta=soil_stability.trim.contigs.good.unique.fasta, name=soil_stability.trim.contigs.good.names, processors=4)
# of unique seqs:    73327
total # of seqs:    283406
Now, let’s create a count table that makes everything more affordable computationally.
mothur > count.seqs(name=soil_stability.trim.contigs.good.names, group=soil_stability.contigs.good.groups)
Using 4 processors.
It took 1 secs to create a table for 283406 sequences.
Total number of sequences: 283406
Output File Names:
We will need to align our sequences to a reference database (fasta & taxonomy). This is how mothur works! I downloaded silva_v119 (// and trimmed it to the region of interest (V4) of the 16S rRNA gene with the “pcr.seqs” command. This is not necessary, but it will make your computer’s life easier. Let's see what I have there.
mothur > summary.seqs(fasta=silva.v4.fasta, processors=4)
Using 4 processors.
        Start    End    NBases    Ambigs    Polymer    NumSeqs
Minimum:    1    20730    501    0    4    1
2.5%-tile:    2    21228    522    0    4    374
25%-tile:    2    21228    528    0    4    3740
Median:     2    21228    545    0    5    7479
75%-tile:    2    21228    549    0    5    11218
97.5%-tile:    2    21228    551    2    6    14583
Maximum:    10    21228    587    5    9    14956
Mean:    2.00053    21228    539.276    0.13192    4.97727
# of Seqs:    14956

We can align our own sequences now. After the alignment, the previously ".fasta" file is now called ".align".
mothur > align.seqs(fasta=soil_stability.trim.contigs.good.unique.fasta, reference=silva.v4.fasta)
Output File Names:
We will re-run screen.seqs to make sure that everything overlaps the same region. We will get sequences to start at or before position 7436 and let mothur optimize the end (or optimize it at 17018). We will also set the maximum homopolymer length to 8 (this could have been done in the first execution of screen.seqs above).
mothur > screen.seqs(fasta=soil_stability.trim.contigs.good.unique.align, count=soil_stability.trim.contigs.good.count_table, summary=soil_stability.trim.contigs.good.unique.summary, start=7436, optimize=end, maxhomop=8)
Now that we know our sequences overlap the same alignment coordinates, we want to make sure they only overlap that region. So we will filter the sequences to remove the overhangs at both ends.
mothur > filter.seqs(fasta=soil_stability.trim.contigs.good.unique.good.align, vertical=T, trump=.)
If you did something wrong in the “screen.seqs” you will end up with nothing here. If you get the summary, you will see that all is well. Now, we will remove any redundancy that we created by running “unique.seqs” again.
mothur > unique.seqs(fasta=soil_stability.trim.contigs.good.unique.good.filter.fasta, count=soil_stability.trim.contigs.good.good.count_table)
In the summary, we see that we have 72,621 unique sequences out of a total of 282,291 sequences, which is about 26% of the total. This is a very high percentage and we we will try to reduce it as much as possible by running a couple of commands that will "remove" the sequencing error. We do not get rid of data, we just group them in meaningful ways and save our computer from crashing.

The first command is an algorith developed by Sue Huse (doi: 10.1111/j.1462-2920.2010.02193.x.). By default, mothur runs the pre.cluster command by looking for sequences that are within one (diffs=1) mismatch of the sequence being considered. You can change this threshold with the diffs option.

mothur > pre.cluster(fasta=soil_stability.trim.contigs.good.unique.good.filter.unique.fasta, count=soil_stability.trim.contigs.good.unique.good.filter.count_table, diffs=2)
In the summary, we see that we have 37,219 unique sequences out of a total of 282,291 sequences, which is about a 12.5% reduction of the dataset. And now, let’s identify any chimeric sequences and remove them.
mothur > chimera.uchime(fasta=soil_stability.trim.contigs.good.unique.good.filter.unique.precluster.fasta, count=soil_stability.trim.contigs.good.unique.good.filter.unique.precluster.count_table, dereplicate=t)
mothur > remove.seqs(fasta=soil_stability.trim.contigs.good.unique.good.filter.unique.precluster.fasta, accnos=soil_stability.trim.contigs.good.unique.good.filter.unique.precluster.uchime.accnos)

We now have 18,179 unique sequences out of a total of 246,946, which is about 7%. It is time we classified our sequences. I like doing this so as to remove unspecific sequences within mothur. Others do this further down in R. I do not see the reason of not further reducing my dataset as soon as possible, but this is my personal preference.
mothur > classify.seqs(fasta=soil_stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.fasta, count=soil_stability.trim.contigs.good.unique.good.filter.unique.precluster.uchime.pick.count_table,,, cutoff=65)

We got some warnings such as:
[WARNING]: M02248_9_000000000-AB7RF_1_1105_10669_16474 could not be classified. You can use the remove.lineage command with taxon=unknown; to remove such sequences.
The primers we use are only supposed to amplify members of the Bacteria and if they are hitting Eukaryota or Archaea, it is a mistake. Also, chloroplasts and mitochondria have no functional role in a microbial community. There's also just the random stuff that we want to get rid of (“unknown” -> no domain). Now, we will remove these sequences.
mothur > remove.lineage(fasta=soil_stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.fasta, count=soil_stability.trim.contigs.good.unique.good.filter.unique.precluster.uchime.pick.count_table,, taxon=Archaea-Chloroplast-mitochondria-Eukaryota-unknown)
mothur > summary.seqs(fasta=current, count=current, processors=4)
# of unique seqs:    17967
total # of seqs:    244310
Our unique sequences are 7% of our dataset. I think this is still a high percentage of unique sequences and if you run the following command, you will be able to spit the abundance into abundant sequences and rare sequences. I will use a cutoff of one read to define a rare sequence. Then I will summarize the abundant fasta.
mothur > split.abund(fasta=soil_stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.fasta, count=soil_stability.trim.contigs.good.unique.good.filter.unique.precluster.uchime.pick.pick.count_table, cutoff=1)
mothur > summary.seqs(fasta=soil_stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.abund.fasta, count=soil_stability.trim.contigs.good.unique.good.filter.unique.precluster.uchime.pick.pick.abund.count_table)
# of unique seqs:    9265
total # of seqs:    235608
Interestingly, almost half of our unique sequences were singleton reads, and while this produced a great reduction in our unique sequences (which will make things easier for clustering and probably more meaningful ecologically), the total reduction of the whole dataset was only 3.5%. This is your call to make in your analysis.

Let’s continue with traditional analysis and proceed to clustering the sequences into OTUs. There are two ways to do this in mothur: 1) the traditional way that we will do consists of calculating uncorrected pairwise distances between aligned DNA sequences and the clustering into OTUs based on your favorite clustering method implemented in mothur (furthest, average, and nearest neighbor), 2) the faster and memory saving cluster.split way, where taxonomic information is used to split the sequences into bins and then cluster within each bin.
mothur > dist.seqs(fasta=soil_stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.fasta, cutoff=0.20)
mothur > cluster(column=soil_stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.dist, count=soil_stability.trim.contigs.good.unique.good.filter.unique.precluster.uchime.pick.pick.count_table)
Now, we will create a table with the number of sequences per OTU and per sample.
mothur > make.shared(, count=soil_stability.trim.contigs.good.unique.good.filter.unique.precluster.uchime.pick.pick.count_table, label=0.03)
And we will need to classify these OTUs
classify.otu(list=...., count=...., taxonomy=...., label=0.03)
Let's give some easier names to our final files. Use "copy" instead of "cp" if you work in Windows OS.
mothur > system(cp soil_final.shared)
mothur > system(cp soil_final.cons.taxonomy)
We need to figure out how many sequences we have per sample…
mothur > count.groups(shared=soil_final.shared)
2A contains 68535.
3A contains 64649.
4A contains 55745.
61A contains 55381.
Total seqs: 244310.
…and then normalize the sampling effort by sub-sampling with the lowest number of sequences observed.
mothur > sub.sample(shared=soil_final.shared, size=55381)
We can evaluate the sequencing depth in our samples in mothur and plot it in R, as shown below.
mothur > rarefaction.single(shared=soil_final.shared, label=0.03, freq=1000)
Indices are important, not only to calculate and plot them, but also to understand the information they provide and the limitations they might have.
mothur > summary.single(shared=soil_final.shared, calc=nseqs-coverage-sobs-shannon-invsimpson, subsample=55381)
Mothur provides several other options for β-diversity analyses (e.g., calculation of ordination matrices, venn diagrams, heatmaps etc), as well as other ecological approaches such as the Indicator Species or the LEfSE (Linear discriminant analysis Effect Size) analyses. Plotting and other statistics can be done in R.
mothur > dist.shared(shared= soil_final.shared, calc=braycurtis, subsample=55381)
Particularly useful commands might be the “get.oturep”, which generates a fasta-formatted sequence file containing only a representative sequence for each OTU, and the “bin.seqs”, which can provide the full list of sequences for each OTU and a fasta file, if you provide a fasta file in the optional parameters.

Open R – Basic R Package - This is not a R workshop, so we will explore basic ways of plotting data from mothur with R. Thing can get more sophisticated with packages such as vegan, phyloseq etc. We will:
1) Plot a rarefaction curve
2) Plot a nMDS ordination plot using a mothur-outputted matrix.

Use the “phyloseq: package, make a “phyloseq object” and graph relative abundances bars.
sharedFile = read.table("soil_final.0.03.subsample.shared")
sharedFile = t(sharedFile)
rownames(sharedFile) = sharedFile[,1]
colnames(sharedFile) = sharedFile[2,]
sharedFile = sharedFile[,2:5]
sharedFile = sharedFile[4:5241,]
class(sharedFile) <- "numeric"
taxFile = read.table('soil_final.taxonomy.txt', header=T, sep='\t')
rownames(taxFile) = taxFile[,1]
taxFile = taxFile[,3:8]
taxFile = as.matrix(taxFile)
metaFile = read.table('soil_stability.meta.txt', header=T, sep='\t')
rownames(metaFile) = metaFile[,1]
metaFile = metaFile[,2:3]
OTU = otu_table(sharedFile, taxa_are_rows = TRUE)
TAX = tax_table(taxFile)
META = sample_data(metaFile)
Soil_physeq = phyloseq(OTU, TAX, META)
Soil_physeq <- prune_taxa(taxa_sums(Soil_physeq) > 0, Soil_physeq)
sum(taxa_sums(Soil_physeq) == 0) #check if you have OTUs with "0" abundance
#Transform to proportions
Soil_physeqP = transform_sample_counts(Soil_physeq, function(x) 100 * x/sum(x))
#Plot the top20 OTUs
top20otus = names(sort(taxa_sums(Soil_physeqP), TRUE)[1:20])
taxtab20 = cbind(tax_table(Soil_physeqP), genus20 = NA)
taxtab20[top20otus, "genus20"] <- as(tax_table(Soil_physeqP)[top20otus, "Genus"], "character")
tax_table(Soil_physeqP) <- tax_table(taxtab20)
Soil_physeq_P20 = prune_taxa(top20otus, Soil_physeqP)
title = "Relative Abundance of Top20 OTUs"
p1 = plot_bar(Soil_physeq_P20, fill = "genus20", title = title) + labs(colour = "genus")
# plot with panelling by Method for an easier method-wise comparison
p1 + facet_wrap(~Method)