Normalization and Differential Expression in RNA-Seq Data
Introduction
What is edgeR?
edgeR is a package written for [R] that performs differential tag/gene expression using count data under a negative binomial model. The methods used in edgeR do NOT support FPKM, RPKM or other types of data that are not counts.
You can obtain more information about edgeR from the bioconductor website. Here you’ll find a user’s guide/vignette with detailed examples, a reference manual for all edgeR functions, and the [R] code for installing the package. A useful discussion on a forum about edgeR can be found on a SeqAnswers Thread .
Getting Help with edgeR
The bioconductor website provides access to a mailing list that all can subscribe to. Any questions concerning any bio-conductor package can be sent to this mailing list. They are very responsive and very helpful; I have used it several times. To subscribe go to https://stat.ethz.ch/mailman/listinfo/bioconductor . For guides to posting questions go to http://www.bioconductor.org/help/mailing-list/posting-guide/ .
If you just need help understanding an edgeR function or want to see an example of how it’s used simply type
?functionname
into the [R] command line. This will bring up a text page with detailed information about the function
Make sure you have a recent version of R (>=2.15), or some of the initial commands will not work.
The Data
Usually, a RNA-Seq data analysis starts with a set of FASTQ files which contain information on both the sequence and the quality of the short reads.
There are several tools to align the reads to the reference genome (e.g. Bowtie, TopHat, ...).
You have just seen how to align the reads to a reference genome using Tophat. Assume we already performed this step for all the replicates and we produced a table of read counts for each gene.
We will consider an example based on the data analyzed in [1]. The Sherlock Lab in the Department of Genetics at Stanford sequenced 10 strains of Saccharomyces Cerevisiae grown in three media, namely YPD, Delft and Glycerol each with 3-4 biological replicates.
The samples were sequenced using Illumina’s Genome Analyzer II yielding 36 bp-long single-end reads. Reads were mapped to the reference genome (SGD release 64) using Bowtie, considering only unique mapping and allowing up to two mismatches.
The read count for a given gene is defined as the number of reads with 5’-end falling within the corresponding region. The gene-level counts for this example are provided in the yeastRNASeqRisso2011 R package.
In addition to the read counts, the R package contains some sample and gene information, such as GC-content and gene length.
To install the package you need to download the file from this link and in the directory where you downloaded the file type
R CMD INSTALL yeastRNASeqRisso2011_0.0.2.tar.gz
If you can't install the package, download the data from here:
We want to filter out the non-expressed genes. For simplicity, we consider only the genes with an average read count of 10 or more.
means <- rowMeans(geneLevelCounts)
filter <- means >= 10
table(filter)
geneLevelCounts <- geneLevelCounts[filter,]
dim(geneLevelCounts)
Looking at the data
One of the main characteristics of RNA-Seq data is that each library will generally have a different sequencing depth.
We can visualize the total number of mapped reads to known genes with the barplot function. To check for systematic effects we can color-code the plot by different biological or technical variables
As a first example, let's focus on the comparison between the three Delft and the three Glycerol yeast libraries.
Building the edgeR object
DGEList is the function that coverts the count matrix into an edgeR object.
In addition to the counts, we need to group the samples according to the variable of interest in our experiment.
We can then see the elements that the object contains by using the names function
library(edgeR)
group <- laneInfo[9:14,2]
group <- droplevels(group)
counts <- geneLevelCounts[, 9:14]
cds <- DGEList( counts , group = group )
names(cds)
These elements can be accessed using the $ symbol. We give some examples here of looking at what is saved in this object.
head(cds$counts) # original count matrix
cds$samples # contains a summary of your samples
sum(cds$all.zeros) # How many genes have 0 counts across all samples
Normalization
We can calculate the normalization factors which correct for the different sequencing depth of each library (or library size).
cds <- calcNormFactors(cds)
cds$samples
By default, the function calcNormFactors normalize the data using the "weighted trimmed mean of M-values" (TMM) method, proposed by [2]. Other options are RLE [3] and upper-quartile [4]. If we want to use the upper-quartile to normalize, we can add an extra argument to the function
An MDS plot measures the similarity of the samples and projects this measure into 2-dimensions. This can be useful for quality control and visualization of your samples.
plotMDS(cds, main="MDS Plot for Count Data", labels=colnames(cds$counts))
Estimating Dispersion
First, we need to estimate the common dispersion: this assumes that all the genes have the same dispersion.
cds <- estimateCommonDisp(cds, verbose=TRUE)
To understand what this value means, recall the parameterization for the variance of the negative binomial is ν(μ) = μ+μ2 ·φ. For poisson it’s ν(μ) = μ. The implied standard deviations are the square-roots of the variances. Now, suppose a gene had an average count of 200. Then the sd’s under the two models would be
sqrt(200) # poisson sd
sqrt(200 + 200^2 * cds$common.dispersion) # negative binomial
sqrt(200 + 200^2 * cds$common.dispersion) / sqrt(200) # NB sd is almost 5 times larger
In real experiments, the assumption of common dispersion is almost never met. Often, we observe a relation between mean counts and dispersion, i.e., the more expressed genes have less dispersion.
The way edgeR estimates a tagwise (i.e. gene-wise) dispersion parameter is by "shrinking" the gene-wise dispersions toward a common value (the common dispersion estimated in the previous step). Alternatively, one can shrink the gene-wise estimates to a common trend, by estimating a smooth function prior to the shrinkage (using the estimateTrendedDisp function). Here we keep things simple and shrink the estimates to the common value
cds <- estimateTagwiseDisp(cds)
We can plot the biological coefficient of variation (i.e. the square root of the dispersions) with the following
plotBCV(cds)
Another useful plot is the mean-variance relation.
meanVarPlot <- plotMeanVar(cds, show.raw.vars=TRUE, show.tagwise.vars=TRUE, show.binned.common.disp.vars=FALSE, show.ave.raw.vars=FALSE, NBline = TRUE , nbins = 100 , #these are arguments about what is plotted
pch = 16 , xlab ="Mean Expression (Log10 Scale)" , ylab = "Variance (Log10 Scale)" , main = "Mean-Variance Plot" ) #these arguments are to make it look prettier
Testing for DE
The function exactTest performs pair-wise tests for differential expression between two groups. The important parameter is pair which indicates which two groups should be compared. The output of exactTest is a list of elements: we can get the table of the results with the topTags function.
et <- exactTest(cds, pair=levels(group))
topTags(et)
top <- topTags(et, n=nrow(cds$counts))$table
head(top)
We can store the ID of the DE genes and look at the distribution of the p-values
de <- rownames(top[top$FDR<0.05,])
length(de)
head(de)
hist(top$PValue, breaks=20)
Visualizing the results
We can use the function plotSmear to produce a mean-difference plot
In some experiments, the design is more complicated than a two-class comparison. edgeR can deal with complicated designs via a Generalized Linear Model (GLM) approach.
As an example, we will consider the three-class comparison of YPD, Delft and Glycerol in the yeast data.
The main ingredient that we will need in the GLM approach is the design matrix. This matrix tells the model what are the important variables in the experiment. These can be either biological (e.g., growth condition, KO/WT, disease status, ...) or technical (e.g., library prep protocol, flow-cell, ...). To keep things simple, we will model only the growth condition.
Building the edgeR object
Again, we need to build an edgeR object (this time with all the samples)
group <- laneInfo[,2]
counts <- geneLevelCounts
cds <- DGEList( counts , group = group )
names(cds)
In addition, we need the design matrix. This can be constructed via the model.matrix function
design <- model.matrix(~group)
design
Normalization
The normalization step is the same as in the two-class comparison.
An MDS plot measures the similarity of the samples and projects this measure into 2-dimensions. This can be useful for quality control and visualization of your samples.
plotMDS(cds, main="MDS Plot for Count Data", labels=colnames(cds$counts))
Estimating Dispersion
As before, we need to estimate dispersion. There are "GLM versions" of the functions we used before. Remember to add the design matrix as parameter.
The function glmFit fits a GLM to the data and the function glmLRT performs a Likelihood Ratio Test (LRT) for the null hypothesis that the variable of interest does not influence the gene expression. The coef argument refers to the column numbers of the design matrix to be tested. Alternatively, one can use contrasts to test a specific linear combination of the variables (see the package vignette for details).
fit <- glmFit(cds, design)
lrt <- glmLRT(fit, coef=2:3)
The topTags function can still be used to extract the significant genes.
top <- topTags(lrt, n=nrow(cds$counts))$table
head(top)
We can store the ID of the DE genes and look at the distribution of the p-values.
de <- rownames(top[top$FDR<0.05,])
length(de)
head(de)
hist(top$PValue, breaks=20)
Visualizing the results
We can use the function plotSmear to produce a mean-difference plot.
Risso D, Schwartz K, Sherlock G, Dudoit S (2011). GC-content normalization for RNA-Seq data. BMC Bioinformatics 12:480
Robinson MD, Oshlack A (2010). A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biology 11:R25.
Anders S, Huber W (2010). Differential expression analysis for sequence count data. Genome Biology 11:R106.
Bullard JH, Purdom E, Hansen KD, Dudoit S (2010). Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics 11:94
Robinson MD, McCarthy DJ and Smyth GK (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139-140
Robinson MD and Smyth GK (2007). Moderated statistical tests for assessing differences in tag abundance. Bioinformatics 23, 2881-2887
Robinson MD and Smyth GK (2008). Small-sample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics, 9, 321-332
Normalization and Differential Expression in RNA-Seq Data
Introduction
What is edgeR?
edgeR is a package written for [R] that performs differential tag/gene expression using count data under a negative binomial model. The methods used in edgeR do NOT support FPKM, RPKM or other types of data that are not counts.You can obtain more information about edgeR from the bioconductor website. Here you’ll find a user’s guide/vignette with detailed examples, a reference manual for all edgeR functions, and the [R] code for installing the package. A useful discussion on a forum about edgeR can be found on a SeqAnswers Thread .
Getting Help with edgeR
The bioconductor website provides access to a mailing list that all can subscribe to. Any questions concerning any bio-conductor package can be sent to this mailing list. They are very responsive and very helpful; I have used it several times. To subscribe go to https://stat.ethz.ch/mailman/listinfo/bioconductor . For guides to posting questions go to http://www.bioconductor.org/help/mailing-list/posting-guide/ .If you just need help understanding an edgeR function or want to see an example of how it’s used simply type
into the [R] command line. This will bring up a text page with detailed information about the function
Getting Help with [R]
Google your question! Write a short description and add on at the end “R-help”. This will most likely bring up a thread in the R-help forum that deals with the same or similar problem. Almost every problem you could have can be found answered on this forum. You can also subscribe to the [R] mailing list by going to https://stat.ethz.ch/mailman/listinfo/r-help. A guide for posting questions can be found at http://www.r-project.org/posting-guide.htmlElizabeth Purdom has several [R] tutorials that can be found at http://www.stat.berkeley.edu/~epurdom/. In particular, see http://www.stat.berkeley.edu/~epurdom/Saving/Saving.html for using directories and saving objects and www.stat.berkeley.edu/~epurdom/BrownLab/ for creating and manipulating data sets in [R].
Installing edgeR
With an internet connection, edgeR can be easily installed by running the following commands in [R]source("http://www.bioconductor.org/biocLite.R") biocLite("edgeR")Make sure you have a recent version of R (>=2.15), or some of the initial commands will not work.
The Data
Usually, a RNA-Seq data analysis starts with a set of FASTQ files which contain information on both the sequence and the quality of the short reads.
There are several tools to align the reads to the reference genome (e.g. Bowtie, TopHat, ...).
You have just seen how to align the reads to a reference genome using Tophat. Assume we already performed this step for all the replicates and we produced a table of read counts for each gene.
We will consider an example based on the data analyzed in [1]. The Sherlock Lab in the Department of Genetics at Stanford sequenced 10 strains of Saccharomyces Cerevisiae grown in three media, namely YPD, Delft and Glycerol each with 3-4 biological replicates.
The samples were sequenced using Illumina’s Genome Analyzer II yielding 36 bp-long single-end reads. Reads were mapped to the reference genome (SGD release 64) using Bowtie, considering only unique mapping and allowing up to two mismatches.
The read count for a given gene is defined as the number of reads with 5’-end falling within the corresponding region. The gene-level counts for this example are provided in the yeastRNASeqRisso2011 R package.
In addition to the read counts, the R package contains some sample and gene information, such as GC-content and gene length.
To install the package you need to download the file from this link and in the directory where you downloaded the file type
If you can't install the package, download the data from here:
Exploratory Analysis
Setting up the count matrix
First, we load the data into [R] and make sure we can access the read countsWe want to filter out the non-expressed genes. For simplicity, we consider only the genes with an average read count of 10 or more.
Looking at the data
One of the main characteristics of RNA-Seq data is that each library will generally have a different sequencing depth.We can visualize the total number of mapped reads to known genes with the barplot function. To check for systematic effects we can color-code the plot by different biological or technical variables
The boxplot function provides an easy way to visualize the difference in distribution between each experiment.
A two-class comparison
As a first example, let's focus on the comparison between the three Delft and the three Glycerol yeast libraries.
Building the edgeR object
DGEList is the function that coverts the count matrix into an edgeR object.In addition to the counts, we need to group the samples according to the variable of interest in our experiment.
We can then see the elements that the object contains by using the names function
These elements can be accessed using the $ symbol. We give some examples here of looking at what is saved in this object.
Normalization
We can calculate the normalization factors which correct for the different sequencing depth of each library (or library size).By default, the function calcNormFactors normalize the data using the "weighted trimmed mean of M-values" (TMM) method, proposed by [2]. Other options are RLE [3] and upper-quartile [4]. If we want to use the upper-quartile to normalize, we can add an extra argument to the function
If we want the normalized pseudo-counts, useful for instance for cluster analysis, we can get them with the following commands.
Multi-Dimensional Scaling Plot
An MDS plot measures the similarity of the samples and projects this measure into 2-dimensions. This can be useful for quality control and visualization of your samples.Estimating Dispersion
First, we need to estimate the common dispersion: this assumes that all the genes have the same dispersion.To understand what this value means, recall the parameterization for the variance of the negative binomial is ν(μ) = μ+μ2 ·φ. For poisson it’s ν(μ) = μ. The implied standard deviations are the square-roots of the variances. Now, suppose a gene had an average count of 200. Then the sd’s under the two models would be
In real experiments, the assumption of common dispersion is almost never met. Often, we observe a relation between mean counts and dispersion, i.e., the more expressed genes have less dispersion.
The way edgeR estimates a tagwise (i.e. gene-wise) dispersion parameter is by "shrinking" the gene-wise dispersions toward a common value (the common dispersion estimated in the previous step). Alternatively, one can shrink the gene-wise estimates to a common trend, by estimating a smooth function prior to the shrinkage (using the estimateTrendedDisp function). Here we keep things simple and shrink the estimates to the common value
We can plot the biological coefficient of variation (i.e. the square root of the dispersions) with the following
Another useful plot is the mean-variance relation.
meanVarPlot <- plotMeanVar(cds, show.raw.vars=TRUE, show.tagwise.vars=TRUE, show.binned.common.disp.vars=FALSE, show.ave.raw.vars=FALSE, NBline = TRUE , nbins = 100 , #these are arguments about what is plotted pch = 16 , xlab ="Mean Expression (Log10 Scale)" , ylab = "Variance (Log10 Scale)" , main = "Mean-Variance Plot" ) #these arguments are to make it look prettierTesting for DE
The function exactTest performs pair-wise tests for differential expression between two groups. The important parameter is pair which indicates which two groups should be compared. The output of exactTest is a list of elements: we can get the table of the results with the topTags function.We can store the ID of the DE genes and look at the distribution of the p-values
Visualizing the results
We can use the function plotSmear to produce a mean-difference plotAnother useful plot is the "volcano plot" that relates log-fold-changes and p-values
Finally, we can plot a heatmap of the top genes
Outputting the results
We can write the results to a file that we can later open in excel or in other programsA three-class comparison
In some experiments, the design is more complicated than a two-class comparison. edgeR can deal with complicated designs via a Generalized Linear Model (GLM) approach.
As an example, we will consider the three-class comparison of YPD, Delft and Glycerol in the yeast data.
The main ingredient that we will need in the GLM approach is the design matrix. This matrix tells the model what are the important variables in the experiment. These can be either biological (e.g., growth condition, KO/WT, disease status, ...) or technical (e.g., library prep protocol, flow-cell, ...). To keep things simple, we will model only the growth condition.
Building the edgeR object
Again, we need to build an edgeR object (this time with all the samples)In addition, we need the design matrix. This can be constructed via the model.matrix function
Normalization
The normalization step is the same as in the two-class comparison.We can retrieve the normalized pseudo-counts in the same way as in the previous section.
Multi-Dimensional Scaling Plot
An MDS plot measures the similarity of the samples and projects this measure into 2-dimensions. This can be useful for quality control and visualization of your samples.Estimating Dispersion
As before, we need to estimate dispersion. There are "GLM versions" of the functions we used before. Remember to add the design matrix as parameter.We can plot the biological coefficient of variation and the mean-variance relation
Testing for DE
The function glmFit fits a GLM to the data and the function glmLRT performs a Likelihood Ratio Test (LRT) for the null hypothesis that the variable of interest does not influence the gene expression. The coef argument refers to the column numbers of the design matrix to be tested. Alternatively, one can use contrasts to test a specific linear combination of the variables (see the package vignette for details).The topTags function can still be used to extract the significant genes.
We can store the ID of the DE genes and look at the distribution of the p-values.
Visualizing the results
We can use the function plotSmear to produce a mean-difference plot.plotSmear(cds, de.tags=de, pair=c("YPD", "Del")) abline(h=c(-2, 2), col=colors[2]) plotSmear(cds, de.tags=de, pair=c("YPD", "Gly")) abline(h=c(-2, 2), col=colors[2]) plotSmear(cds, de.tags=de, pair=c("Del", "Gly")) abline(h=c(-2, 2), col=colors[2])Finally, we can plot a heatmap of the top genes.
Outputting the results
We can write the results to a file that we can later open in excel or in other programs.References