space.template.RNASeq

=Normalization and Differential Expression in RNA-Seq Data=

=Introduction=

What is edgeR?
edgeR is a package written for [R] that performs differential tag/gene expression using count data under a negative binomial model. The methods used in edgeR do NOT support FPKM, RPKM or other types of data that are not counts. You can obtain more information about edgeR from the [|bioconductor website]. Here you’ll find a user’s guide/vignette with detailed examples, a reference manual for all edgeR functions, and the [R] code for installing the package. A useful discussion on a forum about edgeR can be found on a [|SeqAnswers Thread].

Getting Help with edgeR
The bioconductor website provides access to a mailing list that all can subscribe to. Any questions concerning any bio-conductor package can be sent to this mailing list. They are very responsive and very helpful; I have used it several times. To subscribe go to []. For guides to posting questions go to []. If you just need help understanding an edgeR function or want to see an example of how it’s used simply type code ?functionname code into the [R] command line. This will bring up a text page with detailed information about the function

Getting Help with [R]
Google your question! Write a short description and add on at the end “R-help”. This will most likely bring up a thread in the R-help forum that deals with the same or similar problem. Almost every problem you could have can be found answered on this forum. You can also subscribe to the [R] mailing list by going to []. A guide for posting questions can be found at [] Elizabeth Purdom has several [R] tutorials that can be found at []. In particular, see [] for using directories and saving objects and [|www.stat.berkeley.edu/~epurdom/BrownLab/]for creating and manipulating data sets in [R].

Installing edgeR
With an internet connection, edgeR can be easily installed by running the following commands in [R] code source("http://www.bioconductor.org/biocLite.R") biocLite("edgeR") code

Make sure you have a recent version of R (>=2.15), or some of the initial commands will not work.

=The Data=

Usually, a RNA-Seq data analysis starts with a set of [|FASTQ files] which contain information on both the sequence and the quality of the short reads.

There are several tools to align the reads to the reference genome (e.g. Bowtie, TopHat, ...). You have just seen how to align the reads to a reference genome using Tophat. Assume we already performed this step for all the replicates and we produced a table of read counts for each gene.

We will consider an example based on the data analyzed in [1]. The Sherlock Lab in the Department of Genetics at Stanford sequenced 14 strains of //Saccharomyces Cerevisiae// grown in three media, namely YPD, Delft and Glycerol each with 3-8 biological replicates. The samples were sequenced using Illumina’s Genome Analyzer II yielding 36 bp-long single-end reads. Reads were mapped to the reference genome (SGD release 64) using Bowtie, considering only unique mapping and allowing up to two mismatches. The read count for a given gene is defined as the number of reads with 5’-end falling within the corresponding region. The gene-level counts for this example are provided in the [|yeastRNASeqRisso2011 R package].

In addition to the read counts, the R package contains some sample and gene information, such as GC-content and gene length.

To install the package you need to download the file from this [|link] then in an R session type (on a Mac): code install.packages("~/Downloads/yeastRNASeqRisso2011_0.0.2.tar.gz",repos=NULL,type="source") code Windows users should simply replace "~/Downloads/" with the appropriate path to their Downloads directory.

If you can't install the package, download the data from here: ==

=Exploratory Analysis=

Setting up the count matrix
First, we load the data into [R] and make sure we can access the read counts code library(yeastRNASeqRisso2011)

data(geneLevelCounts) data(laneInfo) data(geneInfo)

head(geneLevelCounts) head(geneInfo) laneInfo code

We want to filter out the non-expressed genes. For simplicity, we consider only the genes with an average read count of 10 or more. code means <- rowMeans(geneLevelCounts) filter <- means >= 10 table(filter) geneLevelCounts <- geneLevelCounts[filter,] dim(geneLevelCounts) code

Looking at the data
One of the main characteristics of RNA-Seq data is that each library will generally have a different //sequencing depth//. We can visualize the total number of mapped reads to known genes with the //barplot// function. To check for systematic effects we can color-code the plot by different biological or technical variables code colors <- c(rep("yellow3",8),rep("blue",3),rep("green",3)) totCounts <- colSums(geneLevelCounts) barplot(totCounts, las=2, col=colors) barplot(totCounts, las=2, col=c("red","blue")[laneInfo[,4]]) code

The //boxplot// function provides an easy way to visualize the difference in distribution between each experiment. code boxplot(geneLevelCounts, las=2, col=colors) boxplot(log2(geneLevelCounts+1), las=2, col=colors) code

=Normalization=

Building the edgeR object
//DGEList// is the function that coverts the count matrix into an edgeR object. In addition to the counts, we need to group the samples according to the variable of interest in our experiment. We can then see the elements that the object contains by using the //names// function code library(edgeR) group <- laneInfo[,2] group <- droplevels(group) counts <- geneLevelCounts cds <- DGEList( counts, group = group ) names(cds) code

These elements can be accessed using the $ symbol. We give some examples here of looking at what is saved in this object. code head(cds$counts) # original count matrix cds$samples # contains a summary of your samples sum(cds$all.zeros) # How many genes have 0 counts across all samples code

Normalization in edgeR
We can calculate the normalization factors which correct for the different sequencing depth of each library (or //library size//). code cds <- calcNormFactors(cds) cds$samples code

By default, the function //calcNormFactors// normalize the data using the "weighted trimmed mean of M-values" (TMM) method, proposed by [2]. Other options are RLE [3] and upper-quartile [4]. If we want to use the upper-quartile to normalize, we can add an extra argument to the function code cds <- calcNormFactors(cds, method="upperquartile") cds$samples code

If we want the normalized pseudo-counts, useful for instance for cluster analysis, we can get them with the following commands. code scale <- cds$samples$lib.size*cds$samples$norm.factors normCounts <- round(t(t(counts)/scale)*mean(scale)) boxplot(log2(normCounts+1), las=2, col=colors) code

Multi-Dimensional Scaling Plot
An MDS plot measures the similarity of the samples and projects this measure into 2-dimensions. This can be useful for quality control and visualization of your samples. code plotMDS(cds, main="MDS Plot for Count Data", labels=colnames(cds$counts)) code = = =Differential Expression Testing= =A two-class comparison=

As a first example, let's focus on the comparison between the three Delft and the three Glycerol yeast libraries. code group <- laneInfo[9:14,2] group <- droplevels(group) counts <- geneLevelCounts[, 9:14] cds <- DGEList( counts, group = group ) cds <- calcNormFactors(cds, method="upperquartile") code

Estimating Dispersion
First, we need to estimate the //common dispersion//: this assumes that all the genes have the same dispersion. code cds <- estimateCommonDisp(cds, verbose=TRUE) code

To understand what this value means, recall the parameterization for the variance of the negative binomial is ν(μ) = μ+μ 2 ·φ. For poisson it’s ν(μ) = μ. The implied standard deviations are the square-roots of the variances. Now, suppose a gene had an average count of 200. Then the sd’s under the two models would be code sqrt(200) # poisson sd sqrt(200 + 200^2 * cds$common.dispersion) # negative binomial sqrt(200 + 200^2 * cds$common.dispersion) / sqrt(200) # ratio: NB sd is almost 5 times larger code

In real experiments, the assumption of common dispersion is almost never met. Often, we observe a relation between mean counts and dispersion, i.e., the more expressed genes have less dispersion.

The way edgeR estimates a tagwise (i.e. gene-wise) dispersion parameter is by "shrinking" the gene-wise dispersions toward a common value (the common dispersion estimated in the previous step). Alternatively, one can shrink the gene-wise estimates to a common trend, by estimating a smooth function prior to the shrinkage (using the //estimateTrendedDisp// function). Here we keep things simple and shrink the estimates to the common value code cds <- estimateTagwiseDisp(cds) code

We can plot the biological coefficient of variation (i.e. the square root of the dispersions) with the following code plotBCV(cds) code

Another useful plot is the mean-variance relation. code meanVarPlot <- plotMeanVar(cds, show.raw.vars=TRUE, show.tagwise.vars=TRUE, show.binned.common.disp.vars=FALSE, show.ave.raw.vars=FALSE, NBline = TRUE, nbins = 100 , #these are arguments about what is plotted   pch = 16 , xlab ="Mean Expression (Log10 Scale)" , ylab = "Variance (Log10 Scale)" , main = "Mean-Variance Plot" ) #these arguments are to make it look prettier code

Testing for DE
The function //exactTest// performs pair-wise tests for differential expression between two groups. The important parameter is pair which indicates which two groups should be compared. The output of //exactTest// is a list of elements: we can get the table of the results with the //topTags// function. code et <- exactTest(cds, pair=levels(group)) topTags(et) top <- topTags(et, n=nrow(cds$counts))$table head(top) code

We can store the ID of the DE genes and look at the distribution of the //p//-values code de <- rownames(top[top$FDR<0.05,]) length(de) head(de) hist(top$PValue, breaks=20) code

Visualizing the results
We can use the function //plotSmear// to produce a mean-difference plot code plotSmear(cds, de.tags=de) abline(h=c(-2, 2), col="green") code

Another useful plot is the "volcano plot" that relates log-fold-changes and //p//-values code plot(top$logFC, -log10(top$PValue), pch=20, cex=.5, ylab="-log10(p-value)", xlab="logFC", col=as.numeric(rownames(top) %in% de)+1) abline(v=c(-2, 2), col="green") code

Finally, we can plot a heatmap of the top genes and extract lists of up- and down-regulated genes. code heatmap(log(normCounts[de[1:500],9:14]+1), ColSideColor=colors[9:14], col=rev(heat.colors(20))) upreg <- rownames(top[rownames(top) %in% de & top$logFC > 0,]) downreg <- rownames(top[rownames(top) %in% de & top$logFC < 0,]) code

Outputting the results
We can write the results to a file that we can later open in excel or in other programs code write.table(top, file="two-class-results.txt", sep='\t', quote=FALSE) code

=A three-class comparison=

In some experiments, the design is more complicated than a two-class comparison. edgeR can deal with complicated designs via a Generalized Linear Model (GLM) approach. As an example, we will consider the three-class comparison of YPD, Delft and Glycerol in the yeast data.

The main ingredient that we will need in the GLM approach is the //design matrix//. This matrix tells the model what are the important variables in the experiment. These can be either biological (e.g., growth condition, KO/WT, disease status, ...) or technical (e.g., library prep protocol, flow-cell, ...). To keep things simple, we will model only the growth condition.

Building the edgeR object
Again, we need to build an edgeR object (this time with all the samples) code group <- laneInfo[,2] counts <- geneLevelCounts cds <- DGEList( counts, group = group ) names(cds) code

In addition, we need the design matrix. This can be constructed via the //model.matrix// function code design <- model.matrix(~group) design code

Normalization
The normalization step is the same as in the two-class comparison. code cds <- calcNormFactors(cds, method="upperquartile") cds$samples code

Estimating Dispersion
As before, we need to estimate dispersion. There are "GLM versions" of the functions we used before. Remember to add the design matrix as parameter. code cds <- estimateGLMCommonDisp(cds, design, verbose=TRUE) cds <- estimateGLMTagwiseDisp(cds, design) code

We can plot the biological coefficient of variation and the mean-variance relation code plotBCV(cds) meanVarPlot <- plotMeanVar(cds, show.raw.vars=TRUE, show.tagwise.vars=TRUE, show.binned.common.disp.vars=FALSE, show.ave.raw.vars=FALSE, NBline = TRUE, nbins = 100 , pch = 16 , xlab ="Mean Expression (Log10 Scale)" , ylab = "Variance (Log10 Scale)" , main = "Mean-Variance Plot" ) code

Testing for DE
The function //glmFit// fits a GLM to the data and the function //glmLRT// performs a Likelihood Ratio Test (LRT) for the null hypothesis that the variable of interest does not influence the gene expression. The //coef// argument refers to the column numbers of the design matrix to be tested. Alternatively, one can use contrasts to test a specific linear combination of the variables (see the package vignette for details). code fit <- glmFit(cds, design) lrt <- glmLRT(fit, coef=2:3) code

The //topTags// function can still be used to extract the significant genes. code top <- topTags(lrt, n=nrow(cds$counts))$table head(top) code

We can store the ID of the DE genes and look at the distribution of the //p//-values. code de <- rownames(top[top$FDR<0.05,]) length(de) head(de) hist(top$PValue, breaks=20) code

Visualizing the results
We can use the function //plotSmear// to produce a mean-difference plot. code plotSmear(cds, de.tags=de, pair=c("YPD", "Del")) abline(h=c(-2, 2), col="green") plotSmear(cds, de.tags=de, pair=c("YPD", "Gly")) abline(h=c(-2, 2), col="green") plotSmear(cds, de.tags=de, pair=c("Del", "Gly")) abline(h=c(-2, 2), col="green") code

Finally, we can plot a heatmap of the top genes and again extract lists of genes highly and lowly expressed in each condition. code heatmap(log(X.upperquartile[de[1:500],]+1), ColSideColor=colors, col=rev(heat.colors(20)))

upregDel <- rownames(top[rownames(top) %in% de & top$logFC.groupGly < 0 & top$logFC.groupYPD < 0,]) downregDel <- rownames(top[rownames(top) %in% de & top$logFC.groupGly > 0 & top$logFC.groupYPD > 0,])

code code upregYPD <- rownames(top[rownames(top) %in% de & top$logFC.groupYPD > 0 & top$logFC.groupYPD > top$logFC.groupGly,]) downregYPD <- rownames(top[rownames(top) %in% de & top$logFC.groupYPD < 0 & top$logFC.groupYPD < top$logFC.groupGly,])

code code upregGly <- rownames(top[rownames(top) %in% de & top$logFC.groupGly > 0 & top$logFC.groupGly > top$logFC.groupYPD,]) downregGly <- rownames(top[rownames(top) %in% de & top$logFC.groupGly < 0 & top$logFC.groupGly < top$logFC.groupYPD,]) code

Outputting the results
We can write the results to a file that we can later open in excel or in other programs. code write.table(top, file="three-class-results.txt", sep='\t', quote=FALSE) code

R script
Here is the script from the workshop demonstration, which just compiles most of the code provided above.

=References=
 * 1) Risso D, Schwartz K, Sherlock G, Dudoit S (2011). GC-content normalization for RNA-Seq data. BMC Bioinformatics 12:480
 * 2) Robinson MD, Oshlack A (2010). A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biology 11:R25.
 * 3) Anders S, Huber W (2010). Differential expression analysis for sequence count data. Genome Biology 11:R106.
 * 4) Bullard JH, Purdom E, Hansen KD, Dudoit S (2010). Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics 11:94
 * 5) Robinson MD, McCarthy DJ and Smyth GK (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139-140
 * 6) Robinson MD and Smyth GK (2007). Moderated statistical tests for assessing differences in tag abundance. Bioinformatics 23, 2881-2887
 * 7) Robinson MD and Smyth GK (2008). Small-sample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics, 9, 321-332