16S+Amplicon+Sequencing+Data+Analysis+Workshop

=INSTRUCTOR: Despoina Lymperopoulou =

WORKSHOP INFORMATION AND REQUIREMENTS
This workshop will cover key steps in analyzing 16S bacterial amplicon sequencing data. We will learn how to QC raw data, then use Mothur software to cluster OTUs and make taxonomic assignments.This tutorial reflects the way that I usually do my analyses.

Before the workshop, you need to: 1.Download and unzip Mothur software (v1.39.5) OR according to your OS. [] 2. Download and install “FastQC” – make sure that you have Java installed too. If you have (Java) trouble, consider this as an optional installation. [|http://www.bioinformatics.babraham.ac.uk/projects/download.html#fastqc] 3. Download the “Amplicon_Workshop_CGRL” folder – it contains two subfolders. One subfolder is called “Workshop_files” and includes the fastq (sequencing) files, the stability.files (mapping file), and reference files (SILVA). These reference files can be downloaded here for your future analyses []. The second subfolder contains the output files of mothur from the analysis that we are going to run. I have already run it and I give you the files in case you get stuck somewhere in the tutorial today, so that we can all proce  ed with the analysis. 4. Download and install: R: [|http://cran.rstudio.com] RStudio: [] **Optional** 5. Install the following packages in R: “calibrate”, “shape”, “direct labels”, “knitr”, “clustsig”, “ellipse, “plyr”, “vegan”, “ggplot2”, and “phyloseq”. Except for “phyloseq”, all packages can simply be installed by running: **install.packages(“package_name”)**. Please follow the directions to successfully install “phyloseq”: []

__** It is imperative that you download the workshop material BEFORE the workshop. **__ **Working in mothur**

In Windows, you can double-click the executable mothur icon. In Mac, you can either right click and hit open (not recommended) or open it from the terminal ($ ./mothur). **The easiest way to run this tutorial will be to transfer all files found in “Workshop_files” folder into the unzipped mothur folder that you downloaded in step 1.**

When using, please cite: Schloss, P.D., et al., Introducing mothur: Open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol, 2009. 75(23):7537-41.

// In the first command, we will extract the information from the forward (R1) and reverse (R2) fastq files. // code mothur > fastq.info(fastq=soil.R1.fastq) code // This command will extract the sequence and quality score data from your fastq files. // code Output File Names: soil.R1.fasta soil.R1.qual

[WARNING]: your sequence names contained ':'. I changed them to '_' to avoid problems in your downstream analysis.

code // In every step of the analysis, we can see what the sequences look like with the “summary.seqs” command. // code mothur > summary.seqs(fasta=soil.R1.fasta, processors=4)

Using 4 processors.

Start   End    NBases    Ambigs    Polymer    NumSeqs Minimum:   1    250    250    0    3    1 2.5%-tile:   1    250    250    0    4    9374 25%-tile:   1    250    250    0    4    93740 Median:    1    250    250    0    5    187480 75%-tile:   1    250    250    0    5    281219 97.5%-tile:   1    250    250    0    8    365585 Maximum:   1    250    250    0    59    374958 Mean:   1    250    250    0    4.72324
 * 1) of Seqs:    374958

Output File Names: soil.R1.summary

It took 4 secs to summarize 374958 sequences.

code //Because the table does not look good, please check your workshop material. I will paste it again only in couple of steps below.//

//Repeat for R2 on your own.//

// Now, we will pair the forward and reverse reads with the “make.contigs” command, which requires the “stability.files” as input. You should format the stability.files based on the form that you get your data back from your sequencing facility. In this tutorial, we only need a three-column format, where the first column corresponds to the sample, and the second and third columns to the R1 and R2 fastq files of this sample, respectively. // code mothur > make.contigs(file=soil_stability.files, processors=4) code // Let's see what happened: // code mothur > summary.seqs(fasta=soil_stability.trim.contigs.fasta, processors=4)

Start   End    NBases    Ambigs    Polymer    NumSeqs Minimum:   1    247    247    0    3    1 2.5%-tile:   1    253    253    0    4    9374 25%-tile:   1    253    253    0    4    93740 Median:    1    253    253    0    4    187480 75%-tile:   1    253    253    0    5    281219 97.5%-tile:   1    254    254    12    6    365585 Maximum:   1    500    500    92    250    374958 Mean:   1    253.178    253.178    1.16067    4.55027
 * 1) of Seqs:    374958

code // There are some troublesome observations, such as the 500 bp long sequences, the 12 and 92 ambiguous bases in some of them etc. The next command enables us to keep sequences that fulfill certain user-defined criteria. // code mothur > screen.seqs(fasta=soil_stability.trim.contigs.fasta, group=soil_stability.contigs.groups, maxambig=0, maxlength=280) code code mothur > summary.seqs(fasta=soil_stability.trim.contigs.good.fasta, processors=4)

Using 4 processors.

Start   End    NBases    Ambigs    Polymer    NumSeqs Minimum:   1    250    250    0    3    1 2.5%-tile:   1    253    253    0    4    7086 25%-tile:   1    253    253    0    4    70852 Median:    1    253    253    0    5    141704 75%-tile:   1    253    253    0    5    212555 97.5%-tile:   1    254    254    0    6    276321 Maximum:   1    280    280    0    11    283406 Mean:   1    253.12    253.12    0    4.55898
 * 1) of Seqs:    283406

code // We anticipate that many of our sequences are duplicates of each other. Let’s de-replicate our dataset. // code mothur > unique.seqs(fasta=soil_stability.trim.contigs.good.fasta) 283406   73327

Output File Names: soil_stability.trim.contigs.good.names soil_stability.trim.contigs.good.unique.fasta

code code mothur > summary.seqs(fasta=soil_stability.trim.contigs.good.unique.fasta, name=soil_stability.trim.contigs.good.names, processors=4) ....... total # of seqs:   283406
 * 1) of unique seqs:    73327

code // Now, let’s create a count table that makes everything more affordable computationally. // code mothur > count.seqs(name=soil_stability.trim.contigs.good.names, group=soil_stability.contigs.good.groups)

Using 4 processors. It took 1 secs to create a table for 283406 sequences.

Total number of sequences: 283406

Output File Names: soil_stability.trim.contigs.good.count_table

code // We will need to align our sequences to a reference database (fasta & taxonomy). This is how mothur works! I downloaded silva_v119 ( //[|//http://www.mothur.org/wiki/Silva_reference_files//]// ) and trimmed it to the region of interest (V4) of the 16S rRNA gene with the “pcr.seqs” command. This is not necessary, but it will make your computer’s life easier. Let's see what I have there. // code mothur > summary.seqs(fasta=silva.v4.fasta, processors=4)

Using 4 processors.

Start   End    NBases    Ambigs    Polymer    NumSeqs Minimum:   1    20730    501    0    4    1 2.5%-tile:   2    21228    522    0    4    374 25%-tile:   2    21228    528    0    4    3740 Median:    2    21228    545    0    5    7479 75%-tile:   2    21228    549    0    5    11218 97.5%-tile:   2    21228    551    2    6    14583 Maximum:   10    21228    587    5    9    14956 Mean:   2.00053    21228    539.276    0.13192    4.97727
 * 1) of Seqs:    14956

code

// We can align our own sequences now. After the alignment, the previously ".fasta" file is now called ".align". // code mothur > align.seqs(fasta=soil_stability.trim.contigs.good.unique.fasta, reference=silva.v4.fasta)

Output File Names: soil_stability.trim.contigs.good.unique.align soil_stability.trim.contigs.good.unique.align.report soil_stability.trim.contigs.good.unique.flip.accnos code // We will re-run screen.seqs to make sure that everything overlaps the same region. We will get sequences to start at or before position 7436 and let mothur optimize the end (or optimize it at 17018). We will also set the maximum homopolymer length to 8 (this could have been done in the first execution of screen.seqs above). // code mothur > screen.seqs(fasta=soil_stability.trim.contigs.good.unique.align, count=soil_stability.trim.contigs.good.count_table, summary=soil_stability.trim.contigs.good.unique.summary, start=7436, optimize=end, maxhomop=8) code // Now that we know our sequences overlap the same alignment coordinates, we want to make sure they only overlap that region. So we will filter the sequences to remove the overhangs at both ends. // code mothur > filter.seqs(fasta=soil_stability.trim.contigs.good.unique.good.align, vertical=T, trump=.) code // If you did something wrong in the “screen.seqs” you will end up with nothing here. If you get the summary, you will see that all is well. //// Now, we will remove any redundancy that we created by running “unique.seqs” again. // code mothur > unique.seqs(fasta=soil_stability.trim.contigs.good.unique.good.filter.fasta, count=soil_stability.trim.contigs.good.good.count_table) code //In the summary, we see that we have 72,621 unique sequences out of a total of 282,291 sequences, which is about 26% of the total. This is a very high percentage and we we will try to reduce it as much as possible by running a couple of commands that will "remove" the sequencing error. We do not get rid of data, we just group them in meaningful ways and save our computer from crashing.//

//The first command is an algorith developed by Sue Huse (doi: 10.1111/j.1462-2920.2010.02193.x.). By default, mothur runs the pre.cluster command by looking for sequences that are within one (diffs=1) mismatch of the sequence being considered. You can change this threshold with the diffs option.//

code mothur > pre.cluster(fasta=soil_stability.trim.contigs.good.unique.good.filter.unique.fasta, count=soil_stability.trim.contigs.good.unique.good.filter.count_table, diffs=2) code //In the summary, we see that we have 37,219 unique sequences out of a total of 282,291 sequences, which is about a 12.5% reduction of the dataset. And now, let’s identify any chimeric sequences and remove them.// code mothur > chimera.uchime(fasta=soil_stability.trim.contigs.good.unique.good.filter.unique.precluster.fasta, count=soil_stability.trim.contigs.good.unique.good.filter.unique.precluster.count_table, dereplicate=t)

mothur > remove.seqs(fasta=soil_stability.trim.contigs.good.unique.good.filter.unique.precluster.fasta, accnos=soil_stability.trim.contigs.good.unique.good.filter.unique.precluster.uchime.accnos)

code

// We now have 18,179 unique sequences out of a total of //// 246,946, which is about 7%. //// It is time we classified our sequences. I like doing this so as to remove unspecific sequences within mothur. Others do this further down in R. I do not see the reason of not further reducing my dataset as soon as possible, but this is my personal preference. // code mothur > classify.seqs(fasta=soil_stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.fasta, count=soil_stability.trim.contigs.good.unique.good.filter.unique.precluster.uchime.pick.count_table, reference=silva.nr_v119.ng.fasta, taxonomy=silva.nr_v119.tax, cutoff=65) code

// We got some warnings such as: // code [WARNING]: M02248_9_000000000-AB7RF_1_1105_10669_16474 could not be classified. You can use the remove.lineage command with taxon=unknown; to remove such sequences. code // The primers we use are only supposed to amplify members of the Bacteria and if they are hitting Eukaryota or Archaea, it is a mistake. Also, chloroplasts and mitochondria have no functional role in a microbial community. There's also just the random stuff that we want to get rid of (“unknown” -> no domain). Now, we will remove these sequences. // code mothur > remove.lineage(fasta=soil_stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.fasta, count=soil_stability.trim.contigs.good.unique.good.filter.unique.precluster.uchime.pick.count_table, taxonomy=soil_stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.nr_v119.wang.taxonomy, taxon=Archaea-Chloroplast-mitochondria-Eukaryota-unknown) code code mothur > summary.seqs(fasta=current, count=current, processors=4) ............ total # of seqs:   244310
 * 1) of unique seqs:    17967

code // Our unique sequences are 7% of our dataset. I think this is still a high percentage of unique sequences and if you run the following command, you will be able to spit the abundance into abundant sequences and rare sequences. I will use a cutoff of one read to define a rare sequence. Then I will summarize the abundant fasta. // code mothur > split.abund(fasta=soil_stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.fasta, count=soil_stability.trim.contigs.good.unique.good.filter.unique.precluster.uchime.pick.pick.count_table, cutoff=1) code code mothur > summary.seqs(fasta=soil_stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.abund.fasta, count=soil_stability.trim.contigs.good.unique.good.filter.unique.precluster.uchime.pick.pick.abund.count_table) ........... total # of seqs:   235608
 * 1) of unique seqs:    9265

code // Interestingly, almost half of our unique sequences were singleton reads, and while this produced a great reduction in our unique sequences (which will make things easier for clustering and probably more meaningful ecologically), the total reduction of the whole dataset was only 3.5%. This is your call to make in your analysis. //

// Let’s continue with traditional analysis and proceed to clustering the sequences into OTUs. There are two ways to do this in mothur: 1) the traditional way that we will do consists of calculating uncorrected pairwise distances between aligned DNA sequences and the clustering into OTUs based on your favorite clustering method implemented in mothur (furthest, average, and nearest neighbor), 2) the faster and memory saving cluster.split way, where taxonomic information is used to split the sequences into bins and then cluster within each bin. // code mothur > dist.seqs(fasta=soil_stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.fasta, cutoff=0.20)

mothur > cluster(column=soil_stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.dist, count=soil_stability.trim.contigs.good.unique.good.filter.unique.precluster.uchime.pick.pick.count_table)

code // Now, we will create a table with the number of sequences per OTU and per sample. // code mothur > make.shared(list=soil_stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.an.unique_list.list, count=soil_stability.trim.contigs.good.unique.good.filter.unique.precluster.uchime.pick.pick.count_table, label=0.03) code //And we will need to classify these OTUs// code classify.otu(list=...., count=...., taxonomy=...., label=0.03) code //Let's give some easier names to our final files. Use "copy" instead of "cp" if you work in Windows OS.// code mothur > system(cp soil_stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.an.unique_list.shared soil_final.shared)

mothur > system(cp soil_stability.trim.contigs.good.unique.good.filter.unique.precluster.pick.pick.an.unique_list.0.03.cons.taxonomy soil_final.cons.taxonomy)

code // We need to figure out how many sequences we have per sample… // code mothur > count.groups(shared=soil_final.shared) 2A contains 68535. 3A contains 64649. 4A contains 55745. 61A contains 55381.

Total seqs: 244310.

code // …and then normalize the sampling effort by sub-sampling with the lowest number of sequences observed. // code mothur > sub.sample(shared=soil_final.shared, size=55381) code // We can evaluate the sequencing depth in our samples in mothur and plot it in R, as shown below. // code mothur > rarefaction.single(shared=soil_final.shared, label=0.03, freq=1000) code // Indices are important, not only to calculate and plot them, but also to understand the information they provide and the limitations they might have. // code mothur > summary.single(shared=soil_final.shared, calc=nseqs-coverage-sobs-shannon-invsimpson, subsample=55381) code // Mothur provides several other options for β-diversity analyses (e.g., calculation of ordination matrices, venn diagrams, heatmaps etc), as well as other ecological approaches such as the Indicator Species or the LEfSE (Linear discriminant analysis Effect Size) analyses. Plotting and other statistics can be done in R. // code mothur > dist.shared(shared= soil_final.shared, calc=braycurtis, subsample=55381) nmds(phylip=soil_final. braycurtis.0.03.lt.ave.dist)

code // Particularly useful commands might be the “**get.oturep**”, which generates a fasta-formatted sequence file containing only a representative sequence for each OTU, and the “**bin.seqs**”, which can provide the full list of sequences for each OTU and a fasta file, if you provide a fasta file in the optional parameters. //


 * Open R – Basic R Package - This is not a R workshop, so we will explore basic ways of plotting data from mothur with R. Thing can get more sophisticated with packages such as vegan, phyloseq etc. We will: **
 * 1) Plot a rarefaction curve **
 * 2) Plot a nMDS ordination plot using a mothur-outputted matrix. **

code Use the “phyloseq: package, make a “phyloseq object” and graph relative abundances bars.

library("phyloseq) library("ggplot2") library("plyr") library("vegan")

sharedFile = read.table("soil_final.0.03.subsample.shared") sharedFile = t(sharedFile) rownames(sharedFile) = sharedFile[,1] colnames(sharedFile) = sharedFile[2,] sharedFile = sharedFile[,2:5] sharedFile = sharedFile[4:5241,] class(sharedFile) <- "numeric" taxFile = read.table('soil_final.taxonomy.txt', header=T, sep='\t') rownames(taxFile) = taxFile[,1] taxFile = taxFile[,3:8] taxFile = as.matrix(taxFile) metaFile = read.table('soil_stability.meta.txt', header=T, sep='\t') rownames(metaFile) = metaFile[,1] metaFile = metaFile[,2:3]

OTU = otu_table(sharedFile, taxa_are_rows = TRUE) TAX = tax_table(taxFile) META = sample_data(metaFile) Soil_physeq = phyloseq(OTU, TAX, META)

Soil_physeq <- prune_taxa(taxa_sums(Soil_physeq) > 0, Soil_physeq) sum(taxa_sums(Soil_physeq) == 0) #check if you have OTUs with "0" abundance

Soil_physeqP = transform_sample_counts(Soil_physeq, function(x) 100 * x/sum(x))
 * 1) Transform to proportions

top20otus = names(sort(taxa_sums(Soil_physeqP), TRUE)[1:20]) taxtab20 = cbind(tax_table(Soil_physeqP), genus20 = NA) taxtab20[top20otus, "genus20"] <- as(tax_table(Soil_physeqP)[top20otus, "Genus"], "character") tax_table(Soil_physeqP) <- tax_table(taxtab20)
 * 1) Plot the top20 OTUs

Soil_physeq_P20 = prune_taxa(top20otus, Soil_physeqP) title = "Relative Abundance of Top20 OTUs" p1 = plot_bar(Soil_physeq_P20, fill = "genus20", title = title) + labs(colour = "genus")

p1 + facet_wrap(~Method) code
 * 1) plot with panelling by Method for an easier method-wise comparison

**WORKSHOP TUTORIAL**

==