edgeR Tutorial: Differential Expression in RNA-Seq Data
Introduction
What is edgeR?
edgeR is a package written for [R] that performs differential tag/gene expression using count data under a negative binomial model. The methods used in edgeR do NOT support FPKM, RPKM or other types of data that are not counts.
You can obtain more information about edgeR from the bioconductor website http://www.bioconductor.org/packages/2.8/bioc/html/edgeR.html . Here you’ll find a user’s guide/vignette with detailed examples, a reference manual for all edgeR functions, and the [R] code for installing the package. A useful discussion on a forum about edgeR can be found on a SeqAnswers Thread .
Getting Help with edgeR
The bioconductor website provides access to a mailing list that all can subscribe to. Any questions concerning any bio-conductor package can be sent to this mailing list. They are very responsive and very helpful; I have used it several times. To subscribe go to https://stat.ethz.ch/mailman/listinfo/bioconductor . For guides to posting questions go to http://www.bioconductor.org/help/mailing-list/posting-guide/ .
If you just need help understanding an edgeR function or want to see an example of how it’s used simply type
?functionname
into the [R] command line. This will bring up a text page with detailed information about the function.
Make sure you have the right version of R (2.13.1), or some of the initial commands will not work.
A Straightforward RNA-Seq Differential Expression Analysis using edgeR
Description
This is an RNA-Seq data set from an article by Li et al. [2008] titled, ”Determination of tag density required for digital transcriptome analysis: Application to an androgen-sensitive prostate cancer model.” The study compared hormone treated cancer cells to non-treated cancer cells. For the non-treated, there are 4 biological replicates. For the treated, there are 3 biological replicates. A full analysis of this data set can also be found in section 10 of the edgeR vignette.
Setting Up the Count Matrix
First, we load the egdeR package into [R] and set the directory to the folder where the data is contained. If copying from Windows finder, a “\\” is needed rather than the Windows "\" for the path. However, using the standard unix “/” is accepted on all platforms (Windows included). We will set the directory to be where the example data is contained.
DGEList() is the function that coverts the count matrix into an edgeR object. First, we create a group variable that tells edgeR with samples belong to which group and supply that to DGEList in addition to the count matrix. We can then see the elements that the object contains by using the names() function.
> group <- c(rep("C", 4) , rep("T", 3))
> cds <- DGEList( counts , group = group )
Calculating library sizes from column totals.
> names( cds )
[1] "samples" "counts" "all.zeros"
These elements can be accessed using the $ symbol. We give some examples here of looking at what is saved in this object.
> head(cds$counts) # original count matrix
C_R1 C_R2 C_R3 C_R4 T_R1 T_R2 T_R3
ENSG00000215696 0 0 0 0 0 0 0
ENSG00000215700 0 0 0 0 0 0 0
ENSG00000215699 0 0 0 0 0 0 0
ENSG00000215784 0 0 0 0 0 0 0
ENSG00000212914 0 0 0 0 0 0 0
ENSG00000212042 0 0 0 0 0 0 0
> cds$samples # contains a summary of your samples
group lib.size norm.factors
C_R1 C 978576 1
C_R2 C 1156844 1
C_R3 C 1442169 1
C_R4 C 1485604 1
T_R1 T 1823460 1
T_R2 T 1834335 1
T_R3 T 681743 1
> sum( cds$all.zeros ) # How many genes have 0 counts across all samples
[1] 15558
In fact, we can just type in cds and see summaries (but be warned that for some other objects, there may not be nice summarizations like this)
> cds # or type the name of the object
An object of class "DGEList"
$samples
group lib.size norm.factors
C_R1 C 978576 1
C_R2 C 1156844 1
C_R3 C 1442169 1
C_R4 C 1485604 1
T_R1 T 1823460 1
T_R2 T 1834335 1
T_R3 T 681743 1
$counts
C_R1 C_R2 C_R3 C_R4 T_R1 T_R2 T_R3
ENSG00000215696 0 0 0 0 0 0 0
ENSG00000215700 0 0 0 0 0 0 0
ENSG00000215699 0 0 0 0 0 0 0
ENSG00000215784 0 0 0 0 0 0 0
ENSG00000212914 0 0 0 0 0 0 0
37430 more rows ...
$all.zeros
ENSG00000215696 ENSG00000215700 ENSG00000215699 ENSG00000215784 ENSG00000212914
TRUE TRUE TRUE TRUE TRUE
37430 more elements ...
We need to filter out low count reads since it would be impossible to detect differential expression. The method used in the edgeR vignette is to keep only those genes that have at least 1 read per million in at least 3 samples.
Once this filtering is done, we can calculate the normalization factors which correct for the different compositions of the samples.
> cds <- calcNormFactors( cds )
> cds$samples
group lib.size norm.factors
C_R1 C 978576 1.0296636
C_R2 C 1156844 1.0372521
C_R3 C 1442169 1.0362662
C_R4 C 1485604 1.0378383
T_R1 T 1823460 0.9537095
T_R2 T 1834335 0.9525624
T_R3 T 681743 0.9583181
The effective library sizes are then the product of the actual library sizes and these factors.
An MDS plot measures the similarity of the samples and projects this measure into 2-dimensions. This can be useful for quality control and visualization of your samples.
# To view the plot immediately
plotMDS.dge( cds , main = "MDS Plot for Count Data", labels = colnames( cds$counts ) )
When creating plots in R, you can also save plots as pdf’s, png’s, etc. (see ?pdf ).
# Output plot as a pdf
pdf( "MDS_plot_1_ex1.pdf" , width = 7 , height = 7 ) #sets up the pdf that will save to; width/height in inches
plotMDS.dge( cds , main = "MDS Plot for Count Data", labels = colnames( cds$counts ) )
dev.off() # this tells [R] to close and stop writing to the pdf.
Estimating Dispersions
The first dispersion type to calculate is the common dispersion. In the common dispersion setting, each gene gets assigned the same dispersion estimate. The output of the estimation will include the estimate as well as some other elements added to the edgeR object, cds.
To understand what this value means, recall the parameterization for the variance of the negative binomial is ν(μ) = μ+μ2 ·φ. For poisson it’s ν(μ) = μ. The implied standard deviations are the square-roots of the variances. Now, suppose a gene had an average count of 200. Then the sd’s under the two models would be,
Once the common dispersion is estimated we can estimate the tagwise dispersions (i.e. gene-wise). In this scenario, each gene will get its own unique dispersion estimate, but the common dispersion is still used in the calculation. The tagwise dispersions are squeezed toward the common value. The amount of squeezing is governed by the paramter prior.n which by default is 10. The higher prior.n, the closer the estimates will be to the common dispersion. The recommended value is the nearest integer to 50/(#samples − #groups). For this data set that’s 50/(7 − 2) = 10, so it is equal to the default.
We can see what the effect is of changing the prior.n to 25. Not much changed, but the extremes got squeezed in quite a bit.
# More shrinkage/sqeezing toward the common
> cds <- estimateTagwiseDisp( cds , prior.n = 25 )
Using grid search to estimate tagwise dispersion.
> summary( cds$tagwise.dispersion )
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.01631 0.02151 0.02151 0.02083 0.02151 0.07089
The recommended setting for this data set is the default of 10. Let’s stick with that. We'll set it back to the recommended for the rest of the tutorial
> cds <- estimateTagwiseDisp( cds , prior.n = 10 )
Using grid search to estimate tagwise dispersion.
Mean-Variance Plot
Now that we’ve estimated the dispersion parameters we can see how well they fit the data by plotting the mean-variance relationship. If you are working on a remote server, you will only see this plot if you have already set your display settings appropriately. If you are working with a R GUI on your laptop, the plot should show up right away on your screen.
> meanVarPlot <- plotMeanVar( cds ,show.raw.vars=TRUE , show.tagwise.vars=TRUE , show.binned.common.disp.vars=FALSE , show.ave.raw.vars=FALSE , dispersion.method = "qcml" , NBline = TRUE , nbins = 100 , #these are arguments about what is plotted
pch = 16 , xlab ="Mean Expression (Log10 Scale)" , ylab = "Variance (Log10 Scale)" , main = "Mean-Variance Plot" ) #these arguments are to make it look prettier
Four things are shown in the plot: the raw variances of the counts (grey dots), the variances using the tagwise dispersions (light blue dots), the variances using the common dispersion (solid blue line), and the variance = mean a.k.a. poisson variance (solid black line). The plot function outputs the variances which will be stored in the data set meanVarPlot.
Testing
The function exactTest() performs pair-wise tests for differential expression between two groups. The important parameters are common.disp which if true uses the common dispersion estimate for testing and if false uses the tagwise dispersion estimates; and pair which indicates which two groups should be compared. The output of exactTest() is a list of elements, one of which is a table of the results. The third line of code uses the dispersion parameter to set a dispersion of zero, ie. the poisson model to which we can compare the negative binomial results.
> de.cmn <- exactTest( cds , common.disp = TRUE , pair = c( "C" , "T" ) ) #single common dispersion across all genes
Comparison of groups: T - C
> de.tgw <- exactTest( cds , common.disp = FALSE , pair = c( "C" , "T" ) ) #each gene its own dispersion
Comparison of groups: T - C
> de.poi <- exactTest( cds , dispersion = 1e-06 , pair = c( "C" , "T" ) ) #a poisson model (no dispersion)
Comparison of groups: T - C
The table from exactTest() does not contain p-values adjusted for multiple testing. The function topTags() takes the output from exactTest(), adjusts the raw p-values using the False Discovery Rate (FDR) correction, and returns the top differentially expressed genes. The output is similar to that of exactTest() but with a column of adjusted p-values and sorted by increasing p-value.
The sort.by argument allows you to sort the table by p-value, concentration or fold-change if desired. Also, by setting the n parameter to the total number of genes, we can save the entire topTags() results table.
Now that we have the full results table with the adjusted p-values we can compare those p-values to the significance level and see how many differentially expressed genes we find. Here we use a significance level of 0.05. In addition, we can use the function decideTestsDGE() to determine how many DE genes are up or down regulated compared to control.
First we find the names of the significantly different genes -- those with adjusted p values less than 0.05.
We can also can also compare the analyses to see how many DE gene are in common.
> sum( de.genes.tgw %in% de.genes.cmn ) / length( de.genes.tgw ) * 100 # Tagwise to Common
[1] 96.9
> sum( de.genes.cmn %in% de.genes.tgw ) / length( de.genes.cmn ) * 100 # Common to Tagwise
[1] 97.8
> sum( de.genes.tgw %in% de.genes.poi ) / length( de.genes.tgw ) * 100 # Tagwise to Poisson
[1] 100
>
> # Percent shared out of top 10, 100 & 1000 between tagwise and common
> sum( de.genes.tgw[1:10] %in% de.genes.cmn[1:10] ) / 10 * 100
[1] 80
> sum( de.genes.tgw[1:100] %in% de.genes.cmn[1:100] )
[1] 88
> sum( de.genes.tgw[1:1000] %in% de.genes.cmn[1:1000] ) / 1000 * 100
[1] 90.7
>
> # Percent shared out of top 10, 100 & 1000 between tagwise and poisson
> sum( de.genes.tgw[1:10] %in% de.genes.poi[1:10] ) / 10 * 100
[1] 90
> sum( de.genes.tgw[1:100] %in% de.genes.poi[1:100] )
[1] 55
> sum( de.genes.tgw[1:1000] %in% de.genes.poi[1:1000] ) / 1000 * 100
[1] 72.4
Visualizing Results
The first thing we’ll look at is the spread of expression levels for the top DE genes. Below, we plot a histogram of log concentrations for the top 100 genes for each analysis.
par( mfrow=c(3 ,1) ) #makes 3 plots on same screen
hist(resultsTbl.poi[de.genes.poi[1:100],"logConc"] , breaks=10 , xlab="Log Concentration" , col="red" , xlim=c(-18,-6) , ylim=c(0,0.4) , freq=FALSE , main="Poisson: Top 100" )
hist(resultsTbl.cmn[de.genes.cmn[1:100],"logConc"] , breaks=25 , xlab="Log Concentration" , col="green" , xlim=c(-18,-6) , ylim=c(0,0.4) , freq=FALSE , main="Common: Top 100" )
hist( resultsTbl.tgw[de.genes.tgw[1:100],"logConc"] , breaks=25 , xlab="Log Concentration" , col="blue" , xlim=c(-18,-6) , ylim=c(0,0.4) , freq=FALSE , main="Tagwise: Top 100" )
par( mfrow=c(1,1) ) #set it back to default
The next plot is an MA plot that shows the relationship between concentration and fold-change across the genes. The differentially expressed genes are colored red and the non-differentially expressed are colored black. The orange dots represent genes in which the counts were zero in all samples of one of the groups. Here, we just look at the tagwise and poisson analyses since the common analyses in this case is very similar. The blue line is added at a log-FC of 2 to represent a level for biological significance.
We can also see the difference in log concentrations between the DE genes under the negative binomial model and under the poisson model in the MA plot. Below, we plot only the top 500 DE genes and color the rest black. We can see that within the first 500 DE genes, the tagwise analysis puts more density toward the lower concentration values compared to the poisson.
The analysis is essentially complete and now we want to output a table or a csv file containing all the results. We’ll pull together the concentrations, fold-changes, p-values, adjusted p-values, up/down regulated variable, dispersions, and the count matrix. Here’s how that’s done:
First change column names in each table to be specific to the analysis; logConc and logFC are the same in both so we don't need to change those names.
We can combine these two datasets together with cbind(). In fact, we'll also throw in some other information, such as the tagwise dispersion estimates and whether the genes were up/down regulated.
Robinson MD, McCarthy DJ and Smyth GK (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139-140
Robinson MD and Smyth GK (2007). Moderated statistical tests for assessing differences in tag abundance. Bioinformatics 23, 2881-2887
Robinson MD and Smyth GK (2008). Small-sample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics, 9, 321-332
Benjamini, Y., and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B, 57, 289300.
edgeR Tutorial: Differential Expression in RNA-Seq Data
Introduction
What is edgeR?
edgeR is a package written for [R] that performs differential tag/gene expression using count data under a negative binomial model. The methods used in edgeR do NOT support FPKM, RPKM or other types of data that are not counts.You can obtain more information about edgeR from the bioconductor website http://www.bioconductor.org/packages/2.8/bioc/html/edgeR.html . Here you’ll find a user’s guide/vignette with detailed examples, a reference manual for all edgeR functions, and the [R] code for installing the package. A useful discussion on a forum about edgeR can be found on a SeqAnswers Thread .
Getting Help with edgeR
The bioconductor website provides access to a mailing list that all can subscribe to. Any questions concerning any bio-conductor package can be sent to this mailing list. They are very responsive and very helpful; I have used it several times. To subscribe go to https://stat.ethz.ch/mailman/listinfo/bioconductor . For guides to posting questions go to http://www.bioconductor.org/help/mailing-list/posting-guide/ .If you just need help understanding an edgeR function or want to see an example of how it’s used simply type
into the [R] command line. This will bring up a text page with detailed information about the function.
Getting Help with [R]
Google your question! Write a short description and add on at the end “R-help”. This will most likely bring up a thread in the R-help forum that deals with the same or similar problem. Almost every problem you could have can be found answered on this forum. You can also subscribe to the [R] mailing list by going to https://stat.ethz.ch/mailman/listinfo/r-help. A guide for posting questions can be found at http://www.r-project.org/posting-guide.htmlElizabeth Purdom has several [R] tutorials that can be found at http://www.stat.berkeley.edu/~epurdom/. In particular, see http://www.stat.berkeley.edu/~epurdom/Saving/Saving.html for using directories and saving objects and www.stat.berkeley.edu/~epurdom/BrownLab/ for creating and manipulating data sets in [R].
Installing edgeR
With an internet connection, edgeR can be easily installed by running the following commands in [R]source("http://www.bioconductor.org/biocLite.R") biocLite("edgeR")Make sure you have the right version of R (2.13.1), or some of the initial commands will not work.
A Straightforward RNA-Seq Differential Expression Analysis using edgeR
Description
This is an RNA-Seq data set from an article by Li et al. [2008] titled, ”Determination of tag density required for digital transcriptome analysis: Application to an androgen-sensitive prostate cancer model.” The study compared hormone treated cancer cells to non-treated cancer cells. For the non-treated, there are 4 biological replicates. For the treated, there are 3 biological replicates. A full analysis of this data set can also be found in section 10 of the edgeR vignette.Setting Up the Count Matrix
First, we load the egdeR package into [R] and set the directory to the folder where the data is contained. If copying from Windows finder, a “\\” is needed rather than the Windows "\" for the path. However, using the standard unix “/” is accepted on all platforms (Windows included). We will set the directory to be where the example data is contained.> library(edgeR) > setwd("/global/courses/rnaseqworkshop/web/RNA_SEQ")The data is in a text file and we need to reading it into R and then we can look at it> raw.data <- read.table( file = "pnas_expression.txt" , header = TRUE ) > head( raw.data ) ensembl_ID lane1 lane2 lane3 lane4 lane5 lane6 lane8 len 1 ENSG00000215696 0 0 0 0 0 0 0 330 2 ENSG00000215700 0 0 0 0 0 0 0 2370 3 ENSG00000215699 0 0 0 0 0 0 0 1842 4 ENSG00000215784 0 0 0 0 0 0 0 2393 5 ENSG00000212914 0 0 0 0 0 0 0 384 6 ENSG00000212042 0 0 0 0 0 0 0 92edgeR requires the the dataset to contain only the counts, with the row names as the gene ids and the column names as the sample ids:
> counts <- raw.data[ , -c(1,ncol(raw.data)) ] > rownames( counts ) <- raw.data[ , 1 ] # gene names > colnames( counts ) <- paste(c(rep("C_R",4),rep("T_R",3)),c(1:4,1:3),sep="") # sample names > head( counts ) C_R1 C_R2 C_R3 C_R4 T_R1 T_R2 T_R3 ENSG00000215696 0 0 0 0 0 0 0 ENSG00000215700 0 0 0 0 0 0 0 ENSG00000215699 0 0 0 0 0 0 0 ENSG00000215784 0 0 0 0 0 0 0 ENSG00000212914 0 0 0 0 0 0 0 ENSG00000212042 0 0 0 0 0 0 0Some quick summaries:
> dim( counts ) [1] 37435 7 > colSums( counts ) # Library Sizes C_R1 C_R2 C_R3 C_R4 T_R1 T_R2 T_R3 978576 1156844 1442169 1485604 1823460 1834335 681743 > colSums( counts ) / 1e06 # Library Sizes in millions of reads C_R1 C_R2 C_R3 C_R4 T_R1 T_R2 T_R3 0.978576 1.156844 1.442169 1.485604 1.823460 1.834335 0.681743 > table( rowSums( counts ) )[ 1:30 ] # Number of genes with low counts 0 1 2 3 4 5 6 7 8 9 10 11 12 15558 1826 1149 717 635 388 447 291 308 240 248 191 252 13 14 15 16 17 18 19 20 21 22 23 24 25 157 190 155 162 126 166 99 169 125 123 100 124 100 26 27 28 29 87 100 103 73Building the edgeR Object
DGEList() is the function that coverts the count matrix into an edgeR object. First, we create a group variable that tells edgeR with samples belong to which group and supply that to DGEList in addition to the count matrix. We can then see the elements that the object contains by using the names() function.
> group <- c(rep("C", 4) , rep("T", 3)) > cds <- DGEList( counts , group = group ) Calculating library sizes from column totals. > names( cds ) [1] "samples" "counts" "all.zeros"These elements can be accessed using the $ symbol. We give some examples here of looking at what is saved in this object.> head(cds$counts) # original count matrix C_R1 C_R2 C_R3 C_R4 T_R1 T_R2 T_R3 ENSG00000215696 0 0 0 0 0 0 0 ENSG00000215700 0 0 0 0 0 0 0 ENSG00000215699 0 0 0 0 0 0 0 ENSG00000215784 0 0 0 0 0 0 0 ENSG00000212914 0 0 0 0 0 0 0 ENSG00000212042 0 0 0 0 0 0 0 > cds$samples # contains a summary of your samples group lib.size norm.factors C_R1 C 978576 1 C_R2 C 1156844 1 C_R3 C 1442169 1 C_R4 C 1485604 1 T_R1 T 1823460 1 T_R2 T 1834335 1 T_R3 T 681743 1 > sum( cds$all.zeros ) # How many genes have 0 counts across all samples [1] 15558In fact, we can just type in cds and see summaries (but be warned that for some other objects, there may not be nice summarizations like this)> cds # or type the name of the object An object of class "DGEList" $samples group lib.size norm.factors C_R1 C 978576 1 C_R2 C 1156844 1 C_R3 C 1442169 1 C_R4 C 1485604 1 T_R1 T 1823460 1 T_R2 T 1834335 1 T_R3 T 681743 1 $counts C_R1 C_R2 C_R3 C_R4 T_R1 T_R2 T_R3 ENSG00000215696 0 0 0 0 0 0 0 ENSG00000215700 0 0 0 0 0 0 0 ENSG00000215699 0 0 0 0 0 0 0 ENSG00000215784 0 0 0 0 0 0 0 ENSG00000212914 0 0 0 0 0 0 0 37430 more rows ... $all.zeros ENSG00000215696 ENSG00000215700 ENSG00000215699 ENSG00000215784 ENSG00000212914 TRUE TRUE TRUE TRUE TRUE 37430 more elements ...We need to filter out low count reads since it would be impossible to detect differential expression. The method used in the edgeR vignette is to keep only those genes that have at least 1 read per million in at least 3 samples.
Once this filtering is done, we can calculate the normalization factors which correct for the different compositions of the samples.
> cds <- calcNormFactors( cds ) > cds$samples group lib.size norm.factors C_R1 C 978576 1.0296636 C_R2 C 1156844 1.0372521 C_R3 C 1442169 1.0362662 C_R4 C 1485604 1.0378383 T_R1 T 1823460 0.9537095 T_R2 T 1834335 0.9525624 T_R3 T 681743 0.9583181The effective library sizes are then the product of the actual library sizes and these factors.Multi-Dimensional Scaling Plot
An MDS plot measures the similarity of the samples and projects this measure into 2-dimensions. This can be useful for quality control and visualization of your samples.When creating plots in R, you can also save plots as pdf’s, png’s, etc. (see ?pdf ).
Estimating Dispersions
The first dispersion type to calculate is the common dispersion. In the common dispersion setting, each gene gets assigned the same dispersion estimate. The output of the estimation will include the estimate as well as some other elements added to the edgeR object, cds.
We can see the estimate of the common dispersion:
To understand what this value means, recall the parameterization for the variance of the negative binomial is ν(μ) = μ+μ2 ·φ. For poisson it’s ν(μ) = μ. The implied standard deviations are the square-roots of the variances. Now, suppose a gene had an average count of 200. Then the sd’s under the two models would be,
Once the common dispersion is estimated we can estimate the tagwise dispersions (i.e. gene-wise). In this scenario, each gene will get its own unique dispersion estimate, but the common dispersion is still used in the calculation. The tagwise dispersions are squeezed toward the common value. The amount of squeezing is governed by the paramter prior.n which by default is 10. The higher prior.n, the closer the estimates will be to the common dispersion. The recommended value is the nearest integer to 50/(#samples − #groups). For this data set that’s 50/(7 − 2) = 10, so it is equal to the default.
We can see what the effect is of changing the prior.n to 25. Not much changed, but the extremes got squeezed in quite a bit.
The recommended setting for this data set is the default of 10. Let’s stick with that. We'll set it back to the recommended for the rest of the tutorial
Mean-Variance Plot
Now that we’ve estimated the dispersion parameters we can see how well they fit the data by plotting the mean-variance relationship. If you are working on a remote server, you will only see this plot if you have already set your display settings appropriately. If you are working with a R GUI on your laptop, the plot should show up right away on your screen.> meanVarPlot <- plotMeanVar( cds ,show.raw.vars=TRUE , show.tagwise.vars=TRUE , show.binned.common.disp.vars=FALSE , show.ave.raw.vars=FALSE , dispersion.method = "qcml" , NBline = TRUE , nbins = 100 , #these are arguments about what is plotted pch = 16 , xlab ="Mean Expression (Log10 Scale)" , ylab = "Variance (Log10 Scale)" , main = "Mean-Variance Plot" ) #these arguments are to make it look prettierFour things are shown in the plot: the raw variances of the counts (grey dots), the variances using the tagwise dispersions (light blue dots), the variances using the common dispersion (solid blue line), and the variance = mean a.k.a. poisson variance (solid black line). The plot function outputs the variances which will be stored in the data set meanVarPlot.Testing
The function exactTest() performs pair-wise tests for differential expression between two groups. The important parameters are common.disp which if true uses the common dispersion estimate for testing and if false uses the tagwise dispersion estimates; and pair which indicates which two groups should be compared. The output of exactTest() is a list of elements, one of which is a table of the results. The third line of code uses the dispersion parameter to set a dispersion of zero, ie. the poisson model to which we can compare the negative binomial results.We can look at these objects
> names( de.tgw ) [1] "table" "comparison" "genes" > de.tgw$comparison # which groups have been compared [1] "C" "T" > head( de.tgw$table ) # results table in order of your count matrix. logConc logFC p.value ENSG00000124208 -11.26081 -0.43041686 0.01255780 ENSG00000182463 -15.32155 0.69290959 0.00436565 ENSG00000124201 -12.43969 -0.05991576 0.71782933 ENSG00000124207 -14.11940 -0.42126690 0.03996537 ENSG00000125835 -12.87722 -0.24892772 0.19096504 ENSG00000125834 -14.15590 0.30959561 0.15119062 > head( cds$counts ) C_R1 C_R2 C_R3 C_R4 T_R1 T_R2 T_R3 ENSG00000124208 478 619 628 744 483 716 240 ENSG00000182463 27 20 27 26 48 55 24 ENSG00000124201 180 218 293 275 373 301 88 ENSG00000124207 76 80 85 97 80 81 37 ENSG00000125835 132 200 200 228 280 204 52 ENSG00000125834 42 60 72 86 131 99 30The table from exactTest() does not contain p-values adjusted for multiple testing. The function topTags() takes the output from exactTest(), adjusts the raw p-values using the False Discovery Rate (FDR) correction, and returns the top differentially expressed genes. The output is similar to that of exactTest() but with a column of adjusted p-values and sorted by increasing p-value.> # Top tags for tagwise analysis > options( digits = 3 ) # print only 3 digits > topTags( de.tgw , n = 20 , sort.by = "p.value" ) # top 20 DE genes Comparison of groups: T-C logConc logFC P.Value FDR ENSG00000151503 -11.94 5.82 1.09e-311 1.79e-307 ENSG00000096060 -11.32 5.01 1.51e-269 1.25e-265 ENSG00000127954 -15.62 8.24 1.61e-171 8.85e-168 ENSG00000166451 -12.28 4.69 7.20e-159 2.97e-155 ENSG00000162772 -10.81 3.32 1.32e-144 4.35e-141 ENSG00000116133 -11.73 3.25 2.68e-126 7.38e-123 ENSG00000113594 -12.83 4.12 4.68e-119 1.10e-115 ENSG00000123983 -12.09 3.66 2.06e-109 4.25e-106 ENSG00000116285 -13.56 4.21 8.94e-107 1.64e-103 ENSG00000115648 -8.82 2.60 1.65e-102 2.73e-99 ENSG00000166086 -15.24 5.51 6.47e-100 9.71e-97 ENSG00000130066 -10.31 2.61 7.12e-98 9.78e-95 ENSG00000140263 -12.17 2.83 1.33e-92 1.68e-89 ENSG00000131016 -14.42 5.31 1.07e-91 1.26e-88 ENSG00000091879 -14.88 4.55 1.40e-83 1.54e-80 ENSG00000116883 -15.02 4.70 2.07e-83 2.13e-80 ENSG00000116574 -14.52 4.12 6.46e-81 6.27e-78 ENSG00000162545 -14.09 -4.27 6.59e-80 6.04e-77 ENSG00000167751 -11.67 2.95 3.92e-79 3.41e-76 ENSG00000067113 -10.28 2.32 7.50e-79 6.18e-76We can also use topTags() to return the original counts of the top differentially expressed genes.> cds$counts[ rownames( topTags( de.tgw , n = 15 )$table ) , ] C_R1 C_R2 C_R3 C_R4 T_R1 T_R2 T_R3 ENSG00000151503 35 35 49 59 3307 3439 1224 ENSG00000096060 65 79 105 113 3975 3727 1451 ENSG00000127954 0 0 3 3 607 602 220 ENSG00000166451 41 52 57 57 1750 1654 728 ENSG00000162772 172 204 250 304 2972 3269 1112 ENSG00000116133 97 107 154 143 1632 1611 555 ENSG00000113594 37 36 57 43 936 959 418 ENSG00000123983 62 76 94 108 1354 1258 628 ENSG00000116285 18 28 23 32 645 630 218 ENSG00000115648 940 1084 1317 1345 9730 9942 3272 ENSG00000166086 9 2 3 6 296 298 121 ENSG00000130066 309 386 497 477 3198 3306 1374 ENSG00000140263 79 103 122 122 978 1007 386 ENSG00000131016 9 5 18 6 564 377 213 ENSG00000091879 8 6 8 14 276 274 110The sort.by argument allows you to sort the table by p-value, concentration or fold-change if desired. Also, by setting the n parameter to the total number of genes, we can save the entire topTags() results table.> # Sort tagwise results by Fold-Change instead of p-value, and get all genes > resultsByFC.tgw <- topTags( de.tgw , n = nrow( de.tgw$table ) , sort.by = "logFC" )$table > head( resultsByFC.tgw ) logConc logFC P.Value adj.P.Val ENSG00000091972 -31.8 -36.5 7.87e-59 3.33e-56 ENSG00000164120 -32.2 35.5 1.88e-44 4.55e-42 ENSG00000100373 -33.0 -34.1 4.16e-16 1.79e-14 ENSG00000118513 -33.0 -34.0 1.37e-14 5.03e-13 ENSG00000081237 -33.2 -33.7 3.75e-12 1.09e-10 ENSG00000196660 -33.2 -33.5 2.07e-11 5.49e-10We can store these results for all three of the methods> # Store full topTags results table > resultsTbl.cmn <- topTags( de.cmn , n = nrow( de.cmn$table ) )$table > resultsTbl.tgw <- topTags( de.tgw , n = nrow( de.tgw$table ) )$table > resultsTbl.poi <- topTags( de.poi , n = nrow( de.poi$table ) )$table > head( resultsTbl.tgw ) logConc logFC P.Value adj.P.Val ENSG00000151503 -11.9 5.82 1.09e-311 1.79e-307 ENSG00000096060 -11.3 5.01 1.51e-269 1.25e-265 ENSG00000127954 -15.6 8.24 1.61e-171 8.85e-168 ENSG00000166451 -12.3 4.69 7.20e-159 2.97e-155 ENSG00000162772 -10.8 3.32 1.32e-144 4.35e-141 ENSG00000116133 -11.7 3.25 2.68e-126 7.38e-123Now that we have the full results table with the adjusted p-values we can compare those p-values to the significance level and see how many differentially expressed genes we find. Here we use a significance level of 0.05. In addition, we can use the function decideTestsDGE() to determine how many DE genes are up or down regulated compared to control.First we find the names of the significantly different genes -- those with adjusted p values less than 0.05.
We can see how many of them there are by looking at the length.
We can also get a summary of the up/down regulation. Here we show for the tagwise results.
We can also can also compare the analyses to see how many DE gene are in common.
Visualizing Results
The first thing we’ll look at is the spread of expression levels for the top DE genes. Below, we plot a histogram of log concentrations for the top 100 genes for each analysis.The next plot is an MA plot that shows the relationship between concentration and fold-change across the genes. The differentially expressed genes are colored red and the non-differentially expressed are colored black. The orange dots represent genes in which the counts were zero in all samples of one of the groups. Here, we just look at the tagwise and poisson analyses since the common analyses in this case is very similar. The blue line is added at a log-FC of 2 to represent a level for biological significance.
par( mfrow=c(2,1) ) plotSmear(cds , de.tags=de.genes.poi , main="Poisson" , pair = c("C","T") , cex = .35 , xlab="Log Concentration" , ylab="Log Fold-Change" ) = c(-2, 2) , col = "dodgerblue" ) abline( h=c(-2,2) , col="dodgerblue" ) plotSmear( cds , de.tags=de.genes.tgw , main="Tagwise" , pair = c("C","T") , cex = .35 , xlab="Log Concentration" , ylab="Log Fold-Change" ) abline( h=c(-2,2) , col="dodgerblue" ) par( mfrow=c(1,1) )We can also see the difference in log concentrations between the DE genes under the negative binomial model and under the poisson model in the MA plot. Below, we plot only the top 500 DE genes and color the rest black. We can see that within the first 500 DE genes, the tagwise analysis puts more density toward the lower concentration values compared to the poisson.par( mfrow = c(2,1) ) plotSmear( cds , de.tags=de.genes.poi[1:500] , main="Poisson" ,pair=c("C","T") , cex=.5 , xlab="Log Concentration" , ylab="Log Fold-Change" ) abline( h=c(-2,2) , col="dodgerblue" ) plotSmear( cds , de.tags=de.genes.tgw[1:500] , main="Tagwise" , pair=c("C","T") ,cex = .5 , xlab="Log Concentration" , ylab="Log Fold-Change" ) abline( h=c(-2,2) , col="dodgerblue" ) par( mfrow=c(1,1) )Outputting Results
The analysis is essentially complete and now we want to output a table or a csv file containing all the results. We’ll pull together the concentrations, fold-changes, p-values, adjusted p-values, up/down regulated variable, dispersions, and the count matrix. Here’s how that’s done:First change column names in each table to be specific to the analysis; logConc and logFC are the same in both so we don't need to change those names.
If we'd like, we could just write these tables to a file
If we'd like to keep the count information too, we need to re-order the count matrix to be in line with the order of the results
We can combine these two datasets together with cbind(). In fact, we'll also throw in some other information, such as the tagwise dispersion estimates and whether the genes were up/down regulated.
> combResults.tgw <- cbind(resultsTbl.tgw ,"Tgw.Disp" = cds$tagwise.dispersion[ wh.rows.tgw ] ,"UpDown.Tgw" = decideTestsDGE( de.tgw , p.value = 0.05 )[ wh.rows.tgw ] ,cds$counts[ wh.rows.tgw , ] ) > head( combResults.tgw ) logConc logFC pVal.Tgw adj.pVal.Tgw Tgw.Disp UpDown.Tgw C_R1 C_R2 C_R3 C_R4 T_R1 T_R2 T_R3 ENSG00000151503 -11.9 5.82 1.09e-311 1.79e-307 0.0112 1 35 35 49 59 3307 3439 1224 ENSG00000096060 -11.3 5.01 1.51e-269 1.25e-265 0.0112 1 65 79 105 113 3975 3727 1451 ENSG00000127954 -15.6 8.24 1.61e-171 8.85e-168 0.0163 1 0 0 3 3 607 602 220 ENSG00000166451 -12.3 4.69 7.20e-159 2.97e-155 0.0163 1 41 52 57 57 1750 1654 728 ENSG00000162772 -10.8 3.32 1.32e-144 4.35e-141 0.0112 1 172 204 250 304 2972 3269 1112 ENSG00000116133 -11.7 3.25 2.68e-126 7.38e-123 0.0112 1 97 107 154 143 1632 1611 555Again, we can write this to a fileWe can do the same for both and then combine all the information together
> combResults.cmn <- cbind(resultsTbl.cmn ,"Cmn.Disp" = cds$common.dispersion ,"UpDown.Cmn" = decideTestsDGE( de.cmn , p.value = 0.05 )[ wh.rows.cmn ] ,cds$counts[ wh.rows.cmn , ] ) > > wh.rows <- match( rownames( combResults.cmn ) , rownames( combResults.tgw ) ) > > combResults.all <- cbind(combResults.cmn[,1:4] ,combResults.tgw[wh.rows,3:4] ,"Cmn.Disp" = combResults.cmn[,5],"Tgw.Disp" = combResults.tgw[wh.rows,5],"UpDown.Cmn" = combResults.cmn[,6],"UpDown.Tgw" = combResults.tgw[wh.rows,6],combResults.cmn[,7:ncol(combResults.cmn)] ) > head( combResults.all ) logConc logFC pVal.Cmn adj.pVal.Cmn pVal.Tgw adj.pVal.Tgw Cmn.Disp Tgw.Disp UpDown.Cmn UpDown.Tgw ENSG00000151503 -11.9 5.82 6.51e-193 1.07e-188 1.09e-311 1.79e-307 0.02 0.0112 1 1 ENSG00000096060 -11.3 5.01 1.23e-162 1.02e-158 1.51e-269 1.25e-265 0.02 0.0112 1 1 ENSG00000127954 -15.6 8.24 2.43e-153 1.34e-149 1.61e-171 8.85e-168 0.02 0.0163 1 1 ENSG00000166451 -12.3 4.69 1.05e-134 4.32e-131 7.20e-159 2.97e-155 0.02 0.0163 1 1 ENSG00000131016 -14.4 5.31 4.12e-110 1.36e-106 1.07e-91 1.26e-88 0.02 0.0268 1 1 ENSG00000113594 -12.8 4.12 5.31e-102 1.46e-98 4.68e-119 1.10e-115 0.02 0.0163 1 1 C_R1 C_R2 C_R3 C_R4 T_R1 T_R2 T_R3 ENSG00000151503 35 35 49 59 3307 3439 1224 ENSG00000096060 65 79 105 113 3975 3727 1451 ENSG00000127954 0 0 3 3 607 602 220 ENSG00000166451 41 52 57 57 1750 1654 728 ENSG00000131016 9 5 18 6 564 377 213 ENSG00000113594 37 36 57 43 936 959 418 > write.table( combResults.all , file = "combResults_all_ex1.csv" , sep = "," , row.names = TRUE )References