edgeR Tutorial: Differential Expression in RNA-Seq Data


Introduction


What is edgeR?

edgeR is a package written for [R] that performs differential tag/gene expression using count data under a negative binomial model. The methods used in edgeR do NOT support FPKM, RPKM or other types of data that are not counts.
You can obtain more information about edgeR from the bioconductor website http://www.bioconductor.org/packages/2.8/bioc/html/edgeR.html . Here you’ll find a user’s guide/vignette with detailed examples, a reference manual for all edgeR functions, and the [R] code for installing the package. A useful discussion on a forum about edgeR can be found on a SeqAnswers Thread .

Getting Help with edgeR

The bioconductor website provides access to a mailing list that all can subscribe to. Any questions concerning any bio-conductor package can be sent to this mailing list. They are very responsive and very helpful; I have used it several times. To subscribe go to https://stat.ethz.ch/mailman/listinfo/bioconductor . For guides to posting questions go to http://www.bioconductor.org/help/mailing-list/posting-guide/ .
If you just need help understanding an edgeR function or want to see an example of how it’s used simply type
?functionname
into the [R] command line. This will bring up a text page with detailed information about the function.

Getting Help with [R]

Google your question! Write a short description and add on at the end “R-help”. This will most likely bring up a thread in the R-help forum that deals with the same or similar problem. Almost every problem you could have can be found answered on this forum. You can also subscribe to the [R] mailing list by going to https://stat.ethz.ch/mailman/listinfo/r-help. A guide for posting questions can be found at http://www.r-project.org/posting-guide.html
Elizabeth Purdom has several [R] tutorials that can be found at http://www.stat.berkeley.edu/~epurdom/. In particular, see http://www.stat.berkeley.edu/~epurdom/Saving/Saving.html for using directories and saving objects and www.stat.berkeley.edu/~epurdom/BrownLab/ for creating and manipulating data sets in [R].

Installing edgeR

With an internet connection, edgeR can be easily installed by running the following commands in [R]
source("http://www.bioconductor.org/biocLite.R")
biocLite("edgeR")

Make sure you have the right version of R (2.13.1), or some of the initial commands will not work.



A Straightforward RNA-Seq Differential Expression Analysis using edgeR


Description

This is an RNA-Seq data set from an article by Li et al. [2008] titled, ”Determination of tag density required for digital transcriptome analysis: Application to an androgen-sensitive prostate cancer model.” The study compared hormone treated cancer cells to non-treated cancer cells. For the non-treated, there are 4 biological replicates. For the treated, there are 3 biological replicates. A full analysis of this data set can also be found in section 10 of the edgeR vignette.

Setting Up the Count Matrix

First, we load the egdeR package into [R] and set the directory to the folder where the data is contained. If copying from Windows finder, a “\\” is needed rather than the Windows "\" for the path. However, using the standard unix “/” is accepted on all platforms (Windows included). We will set the directory to be where the example data is contained.

> library(edgeR)
> setwd("/global/courses/rnaseqworkshop/web/RNA_SEQ")
The data is in a text file and we need to reading it into R and then we can look at it
> raw.data <- read.table( file = "pnas_expression.txt" , header = TRUE )
> head( raw.data )
       ensembl_ID lane1 lane2 lane3 lane4 lane5 lane6 lane8  len
1 ENSG00000215696     0     0     0     0     0     0     0  330
2 ENSG00000215700     0     0     0     0     0     0     0 2370
3 ENSG00000215699     0     0     0     0     0     0     0 1842
4 ENSG00000215784     0     0     0     0     0     0     0 2393
5 ENSG00000212914     0     0     0     0     0     0     0  384
6 ENSG00000212042     0     0     0     0     0     0     0   92
 

edgeR requires the the dataset to contain only the counts, with the row names as the gene ids and the column names as the sample ids:
> counts <- raw.data[ , -c(1,ncol(raw.data)) ]
> rownames( counts ) <- raw.data[ , 1 ] # gene names
> colnames( counts ) <- paste(c(rep("C_R",4),rep("T_R",3)),c(1:4,1:3),sep="") # sample names
> head( counts )
                C_R1 C_R2 C_R3 C_R4 T_R1 T_R2 T_R3
ENSG00000215696    0    0    0    0    0    0    0
ENSG00000215700    0    0    0    0    0    0    0
ENSG00000215699    0    0    0    0    0    0    0
ENSG00000215784    0    0    0    0    0    0    0
ENSG00000212914    0    0    0    0    0    0    0
ENSG00000212042    0    0    0    0    0    0    0
 

Some quick summaries:
> dim( counts )
[1] 37435     7
> colSums( counts ) # Library Sizes
   C_R1    C_R2    C_R3    C_R4    T_R1    T_R2    T_R3
 978576 1156844 1442169 1485604 1823460 1834335  681743
> colSums( counts ) / 1e06 # Library Sizes in millions of reads
    C_R1     C_R2     C_R3     C_R4     T_R1     T_R2     T_R3
0.978576 1.156844 1.442169 1.485604 1.823460 1.834335 0.681743
> table( rowSums( counts ) )[ 1:30 ] # Number of genes with low counts
 
    0     1     2     3     4     5     6     7     8     9    10    11    12
15558  1826  1149   717   635   388   447   291   308   240   248   191   252
   13    14    15    16    17    18    19    20    21    22    23    24    25
  157   190   155   162   126   166    99   169   125   123   100   124   100
   26    27    28    29
   87   100   103    73




Building the edgeR Object


DGEList() is the function that coverts the count matrix into an edgeR object. First, we create a group variable that tells edgeR with samples belong to which group and supply that to DGEList in addition to the count matrix. We can then see the elements that the object contains by using the names() function.

> group <- c(rep("C", 4) , rep("T", 3))
> cds <- DGEList( counts , group = group )
Calculating library sizes from column totals.
> names( cds )
[1] "samples"   "counts"    "all.zeros"
 
These elements can be accessed using the $ symbol. We give some examples here of looking at what is saved in this object.
> head(cds$counts) # original count matrix
                C_R1 C_R2 C_R3 C_R4 T_R1 T_R2 T_R3
ENSG00000215696    0    0    0    0    0    0    0
ENSG00000215700    0    0    0    0    0    0    0
ENSG00000215699    0    0    0    0    0    0    0
ENSG00000215784    0    0    0    0    0    0    0
ENSG00000212914    0    0    0    0    0    0    0
ENSG00000212042    0    0    0    0    0    0    0
> cds$samples # contains a summary of your samples
     group lib.size norm.factors
C_R1     C   978576            1
C_R2     C  1156844            1
C_R3     C  1442169            1
C_R4     C  1485604            1
T_R1     T  1823460            1
T_R2     T  1834335            1
T_R3     T   681743            1
> sum( cds$all.zeros ) # How many genes have 0 counts across all samples
[1] 15558
 
In fact, we can just type in cds and see summaries (but be warned that for some other objects, there may not be nice summarizations like this)
> cds # or type the name of the object
An object of class "DGEList"
$samples
     group lib.size norm.factors
C_R1     C   978576            1
C_R2     C  1156844            1
C_R3     C  1442169            1
C_R4     C  1485604            1
T_R1     T  1823460            1
T_R2     T  1834335            1
T_R3     T   681743            1
 
$counts
                C_R1 C_R2 C_R3 C_R4 T_R1 T_R2 T_R3
ENSG00000215696    0    0    0    0    0    0    0
ENSG00000215700    0    0    0    0    0    0    0
ENSG00000215699    0    0    0    0    0    0    0
ENSG00000215784    0    0    0    0    0    0    0
ENSG00000212914    0    0    0    0    0    0    0
37430 more rows ...
 
$all.zeros
ENSG00000215696 ENSG00000215700 ENSG00000215699 ENSG00000215784 ENSG00000212914
           TRUE            TRUE            TRUE            TRUE            TRUE
37430 more elements ...
 

We need to filter out low count reads since it would be impossible to detect differential expression. The method used in the edgeR vignette is to keep only those genes that have at least 1 read per million in at least 3 samples.

> cds <- cds[rowSums(1e+06 * cds$counts/expandAsMatrix(cds$samples$lib.size, dim(cds)) > 1) >= 3, ]
> dim( cds )
[1] 16494     7
Once this filtering is done, we can calculate the normalization factors which correct for the different compositions of the samples.
> cds <- calcNormFactors( cds )
> cds$samples
     group lib.size norm.factors
C_R1     C   978576    1.0296636
C_R2     C  1156844    1.0372521
C_R3     C  1442169    1.0362662
C_R4     C  1485604    1.0378383
T_R1     T  1823460    0.9537095
T_R2     T  1834335    0.9525624
T_R3     T   681743    0.9583181
The effective library sizes are then the product of the actual library sizes and these factors.
# effective library sizes
> cds$samples$lib.size*cds$samples$norm.factors
[1] 1007604.1 1199938.9 1494471.0 1541816.7 1739051.1 1747318.6 653326.7

Multi-Dimensional Scaling Plot

An MDS plot measures the similarity of the samples and projects this measure into 2-dimensions. This can be useful for quality control and visualization of your samples.
# To view the plot immediately
plotMDS.dge( cds , main = "MDS Plot for Count Data", labels = colnames( cds$counts ) )
When creating plots in R, you can also save plots as pdf’s, png’s, etc. (see ?pdf ).
# Output plot as a pdf
pdf( "MDS_plot_1_ex1.pdf" , width = 7 , height = 7 ) #sets up the pdf that will save to; width/height in inches
plotMDS.dge( cds , main = "MDS Plot for Count Data", labels = colnames( cds$counts ) )
dev.off() # this tells [R] to close and stop writing to the pdf.

Estimating Dispersions


The first dispersion type to calculate is the common dispersion. In the common dispersion setting, each gene gets assigned the same dispersion estimate. The output of the estimation will include the estimate as well as some other elements added to the edgeR object, cds.
> cds <- estimateCommonDisp( cds )
> names( cds )
[1] "samples"           "common.dispersion" "counts"            "pseudo.alt"        "genes"
[6] "all.zeros"         "conc"              "common.lib.size"
We can see the estimate of the common dispersion:
> cds$common.dispersion
 [1] 0.0199892
 
To understand what this value means, recall the parameterization for the variance of the negative binomial is ν(μ) = μ+μ2 ·φ. For poisson it’s ν(μ) = μ. The implied standard deviations are the square-roots of the variances. Now, suppose a gene had an average count of 200. Then the sd’s under the two models would be,
> sqrt( 200 ) # poisson sd
[1] 14.14214
> sqrt( 200 + 200^2 * cds$common.dispersion ) # negative binomial
[1] 31.61595
> sqrt( 200 + 200^2 * cds$common.dispersion ) / sqrt( 200 ) # NB sd is over 2 times larger
[1] 2.235585
 
Once the common dispersion is estimated we can estimate the tagwise dispersions (i.e. gene-wise). In this scenario, each gene will get its own unique dispersion estimate, but the common dispersion is still used in the calculation. The tagwise dispersions are squeezed toward the common value. The amount of squeezing is governed by the paramter prior.n which by default is 10. The higher prior.n, the closer the estimates will be to the common dispersion. The recommended value is the nearest integer to 50/(#samples − #groups). For this data set that’s 50/(7 − 2) = 10, so it is equal to the default.
> cds <- estimateTagwiseDisp( cds , prior.n = 10 )
Using grid search to estimate tagwise dispersion.
> names( cds )
 [1] "samples"            "common.dispersion"  "prior.n"            "tagwise.dispersion" "counts"
 [6] "pseudo.alt"         "genes"              "all.zeros"          "conc"               "common.lib.size"
> summary( cds$tagwise.dispersion )
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
0.01115 0.01631 0.02151 0.02094 0.02151 0.15130
 
We can see what the effect is of changing the prior.n to 25. Not much changed, but the extremes got squeezed in quite a bit.
# More shrinkage/sqeezing toward the common
> cds <- estimateTagwiseDisp( cds , prior.n = 25 )
Using grid search to estimate tagwise dispersion.
> summary( cds$tagwise.dispersion )
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
0.01631 0.02151 0.02151 0.02083 0.02151 0.07089
 
The recommended setting for this data set is the default of 10. Let’s stick with that. We'll set it back to the recommended for the rest of the tutorial
> cds <- estimateTagwiseDisp( cds , prior.n = 10 )
Using grid search to estimate tagwise dispersion.

Mean-Variance Plot

Now that we’ve estimated the dispersion parameters we can see how well they fit the data by plotting the mean-variance relationship. If you are working on a remote server, you will only see this plot if you have already set your display settings appropriately. If you are working with a R GUI on your laptop, the plot should show up right away on your screen.
> meanVarPlot <- plotMeanVar( cds ,show.raw.vars=TRUE , show.tagwise.vars=TRUE , show.binned.common.disp.vars=FALSE , show.ave.raw.vars=FALSE , dispersion.method = "qcml" , NBline = TRUE , nbins = 100 , #these are arguments about what is plotted
    pch = 16 , xlab ="Mean Expression (Log10 Scale)" , ylab = "Variance (Log10 Scale)" , main = "Mean-Variance Plot" ) #these arguments are to make it look prettier
Four things are shown in the plot: the raw variances of the counts (grey dots), the variances using the tagwise dispersions (light blue dots), the variances using the common dispersion (solid blue line), and the variance = mean a.k.a. poisson variance (solid black line). The plot function outputs the variances which will be stored in the data set meanVarPlot.



Testing

The function exactTest() performs pair-wise tests for differential expression between two groups. The important parameters are common.disp which if true uses the common dispersion estimate for testing and if false uses the tagwise dispersion estimates; and pair which indicates which two groups should be compared. The output of exactTest() is a list of elements, one of which is a table of the results. The third line of code uses the dispersion parameter to set a dispersion of zero, ie. the poisson model to which we can compare the negative binomial results.

> de.cmn <- exactTest( cds , common.disp = TRUE , pair = c( "C" , "T" ) ) #single common dispersion across all genes
Comparison of groups:  T - C
> de.tgw <- exactTest( cds , common.disp = FALSE , pair = c( "C" , "T" ) ) #each gene its own dispersion
Comparison of groups:  T - C
> de.poi <- exactTest( cds , dispersion = 1e-06 , pair = c( "C" , "T" ) ) #a poisson model (no dispersion)
Comparison of groups:  T - C
 
We can look at these objects
> names( de.tgw )
[1] "table"      "comparison" "genes"
> de.tgw$comparison # which groups have been compared
[1] "C" "T"
> head( de.tgw$table ) # results table in order of your count matrix.
                  logConc       logFC    p.value
ENSG00000124208 -11.26081 -0.43041686 0.01255780
ENSG00000182463 -15.32155  0.69290959 0.00436565
ENSG00000124201 -12.43969 -0.05991576 0.71782933
ENSG00000124207 -14.11940 -0.42126690 0.03996537
ENSG00000125835 -12.87722 -0.24892772 0.19096504
ENSG00000125834 -14.15590  0.30959561 0.15119062
> head( cds$counts )
                C_R1 C_R2 C_R3 C_R4 T_R1 T_R2 T_R3
ENSG00000124208  478  619  628  744  483  716  240
ENSG00000182463   27   20   27   26   48   55   24
ENSG00000124201  180  218  293  275  373  301   88
ENSG00000124207   76   80   85   97   80   81   37
ENSG00000125835  132  200  200  228  280  204   52
ENSG00000125834   42   60   72   86  131   99   30
The table from exactTest() does not contain p-values adjusted for multiple testing. The function topTags() takes the output from exactTest(), adjusts the raw p-values using the False Discovery Rate (FDR) correction, and returns the top differentially expressed genes. The output is similar to that of exactTest() but with a column of adjusted p-values and sorted by increasing p-value.
> # Top tags for tagwise analysis
> options( digits = 3 ) # print only 3 digits
> topTags( de.tgw , n = 20 , sort.by = "p.value" ) # top 20 DE genes
Comparison of groups: T-C
                logConc logFC   P.Value       FDR
ENSG00000151503  -11.94  5.82 1.09e-311 1.79e-307
ENSG00000096060  -11.32  5.01 1.51e-269 1.25e-265
ENSG00000127954  -15.62  8.24 1.61e-171 8.85e-168
ENSG00000166451  -12.28  4.69 7.20e-159 2.97e-155
ENSG00000162772  -10.81  3.32 1.32e-144 4.35e-141
ENSG00000116133  -11.73  3.25 2.68e-126 7.38e-123
ENSG00000113594  -12.83  4.12 4.68e-119 1.10e-115
ENSG00000123983  -12.09  3.66 2.06e-109 4.25e-106
ENSG00000116285  -13.56  4.21 8.94e-107 1.64e-103
ENSG00000115648   -8.82  2.60 1.65e-102  2.73e-99
ENSG00000166086  -15.24  5.51 6.47e-100  9.71e-97
ENSG00000130066  -10.31  2.61  7.12e-98  9.78e-95
ENSG00000140263  -12.17  2.83  1.33e-92  1.68e-89
ENSG00000131016  -14.42  5.31  1.07e-91  1.26e-88
ENSG00000091879  -14.88  4.55  1.40e-83  1.54e-80
ENSG00000116883  -15.02  4.70  2.07e-83  2.13e-80
ENSG00000116574  -14.52  4.12  6.46e-81  6.27e-78
ENSG00000162545  -14.09 -4.27  6.59e-80  6.04e-77
ENSG00000167751  -11.67  2.95  3.92e-79  3.41e-76
ENSG00000067113  -10.28  2.32  7.50e-79  6.18e-76
We can also use topTags() to return the original counts of the top differentially expressed genes.
> cds$counts[ rownames( topTags( de.tgw , n = 15 )$table ) , ]
                C_R1 C_R2 C_R3 C_R4 T_R1 T_R2 T_R3
ENSG00000151503   35   35   49   59 3307 3439 1224
ENSG00000096060   65   79  105  113 3975 3727 1451
ENSG00000127954    0    0    3    3  607  602  220
ENSG00000166451   41   52   57   57 1750 1654  728
ENSG00000162772  172  204  250  304 2972 3269 1112
ENSG00000116133   97  107  154  143 1632 1611  555
ENSG00000113594   37   36   57   43  936  959  418
ENSG00000123983   62   76   94  108 1354 1258  628
ENSG00000116285   18   28   23   32  645  630  218
ENSG00000115648  940 1084 1317 1345 9730 9942 3272
ENSG00000166086    9    2    3    6  296  298  121
ENSG00000130066  309  386  497  477 3198 3306 1374
ENSG00000140263   79  103  122  122  978 1007  386
ENSG00000131016    9    5   18    6  564  377  213
ENSG00000091879    8    6    8   14  276  274  110
 
The sort.by argument allows you to sort the table by p-value, concentration or fold-change if desired. Also, by setting the n parameter to the total number of genes, we can save the entire topTags() results table.
> # Sort tagwise results by Fold-Change instead of p-value, and get all genes
> resultsByFC.tgw <- topTags( de.tgw , n = nrow( de.tgw$table ) , sort.by = "logFC" )$table
> head( resultsByFC.tgw )
                logConc logFC  P.Value adj.P.Val
ENSG00000091972   -31.8 -36.5 7.87e-59  3.33e-56
ENSG00000164120   -32.2  35.5 1.88e-44  4.55e-42
ENSG00000100373   -33.0 -34.1 4.16e-16  1.79e-14
ENSG00000118513   -33.0 -34.0 1.37e-14  5.03e-13
ENSG00000081237   -33.2 -33.7 3.75e-12  1.09e-10
ENSG00000196660   -33.2 -33.5 2.07e-11  5.49e-10
We can store these results for all three of the methods
> # Store full topTags results table
> resultsTbl.cmn <- topTags( de.cmn , n = nrow( de.cmn$table ) )$table
> resultsTbl.tgw <- topTags( de.tgw , n = nrow( de.tgw$table ) )$table
> resultsTbl.poi <- topTags( de.poi , n = nrow( de.poi$table ) )$table
> head( resultsTbl.tgw )
                logConc logFC   P.Value adj.P.Val
ENSG00000151503   -11.9  5.82 1.09e-311 1.79e-307
ENSG00000096060   -11.3  5.01 1.51e-269 1.25e-265
ENSG00000127954   -15.6  8.24 1.61e-171 8.85e-168
ENSG00000166451   -12.3  4.69 7.20e-159 2.97e-155
ENSG00000162772   -10.8  3.32 1.32e-144 4.35e-141
ENSG00000116133   -11.7  3.25 2.68e-126 7.38e-123
 
Now that we have the full results table with the adjusted p-values we can compare those p-values to the significance level and see how many differentially expressed genes we find. Here we use a significance level of 0.05. In addition, we can use the function decideTestsDGE() to determine how many DE genes are up or down regulated compared to control.

First we find the names of the significantly different genes -- those with adjusted p values less than 0.05.
# Names/IDs of DE genes
> de.genes.cmn <- rownames( resultsTbl.cmn )[ resultsTbl.cmn$adj.P.Val <= 0.05 ]
> de.genes.tgw <- rownames( resultsTbl.tgw )[ resultsTbl.tgw$adj.P.Val <= 0.05 ]
> de.genes.poi <- rownames( resultsTbl.poi )[ resultsTbl.poi$adj.P.Val <= 0.05 ]
We can see how many of them there are by looking at the length.
> length( de.genes.cmn )
[1] 4936
> length( de.genes.tgw )
[1] 4979
> length( de.genes.poi )
[1] 7048
We can also get a summary of the up/down regulation. Here we show for the tagwise results.
> summary( decideTestsDGE( de.tgw , p.value = 0.05 ) )
   [,1]
-1  2465
0  11515
1   2514
 
We can also can also compare the analyses to see how many DE gene are in common.
> sum( de.genes.tgw %in% de.genes.cmn ) / length( de.genes.tgw ) * 100 # Tagwise to Common
[1] 96.9
> sum( de.genes.cmn %in% de.genes.tgw ) / length( de.genes.cmn ) * 100 # Common to Tagwise
[1] 97.8
> sum( de.genes.tgw %in% de.genes.poi ) / length( de.genes.tgw ) * 100 # Tagwise to Poisson
[1] 100
>
> # Percent shared out of top 10, 100 & 1000 between tagwise and common
> sum( de.genes.tgw[1:10] %in% de.genes.cmn[1:10] ) / 10 * 100
[1] 80
> sum( de.genes.tgw[1:100] %in% de.genes.cmn[1:100] )
[1] 88
> sum( de.genes.tgw[1:1000] %in% de.genes.cmn[1:1000] ) / 1000 * 100
[1] 90.7
>
> # Percent shared out of top 10, 100 & 1000 between tagwise and poisson
> sum( de.genes.tgw[1:10] %in% de.genes.poi[1:10] ) / 10 * 100
[1] 90
> sum( de.genes.tgw[1:100] %in% de.genes.poi[1:100] )
[1] 55
> sum( de.genes.tgw[1:1000] %in% de.genes.poi[1:1000] ) / 1000 * 100
[1] 72.4
 



Visualizing Results

The first thing we’ll look at is the spread of expression levels for the top DE genes. Below, we plot a histogram of log concentrations for the top 100 genes for each analysis.
par( mfrow=c(3 ,1) ) #makes 3 plots on same screen
 
hist(resultsTbl.poi[de.genes.poi[1:100],"logConc"] , breaks=10 , xlab="Log Concentration" , col="red" , xlim=c(-18,-6) , ylim=c(0,0.4) , freq=FALSE , main="Poisson: Top 100" )
hist(resultsTbl.cmn[de.genes.cmn[1:100],"logConc"] , breaks=25 , xlab="Log Concentration" , col="green" , xlim=c(-18,-6) , ylim=c(0,0.4) , freq=FALSE , main="Common: Top 100" )
hist( resultsTbl.tgw[de.genes.tgw[1:100],"logConc"] , breaks=25 , xlab="Log Concentration" , col="blue" , xlim=c(-18,-6) , ylim=c(0,0.4) , freq=FALSE , main="Tagwise: Top 100" )
 
par( mfrow=c(1,1) ) #set it back to default
The next plot is an MA plot that shows the relationship between concentration and fold-change across the genes. The differentially expressed genes are colored red and the non-differentially expressed are colored black. The orange dots represent genes in which the counts were zero in all samples of one of the groups. Here, we just look at the tagwise and poisson analyses since the common analyses in this case is very similar. The blue line is added at a log-FC of 2 to represent a level for biological significance.

par( mfrow=c(2,1) )
 
plotSmear(cds , de.tags=de.genes.poi , main="Poisson" , pair = c("C","T") , cex = .35 , xlab="Log Concentration" , ylab="Log Fold-Change" ) = c(-2, 2) , col = "dodgerblue" )
abline( h=c(-2,2) , col="dodgerblue" )
 
plotSmear( cds , de.tags=de.genes.tgw , main="Tagwise" , pair = c("C","T") , cex = .35 , xlab="Log Concentration" , ylab="Log Fold-Change" )
abline( h=c(-2,2) , col="dodgerblue" )
 
par( mfrow=c(1,1) )
We can also see the difference in log concentrations between the DE genes under the negative binomial model and under the poisson model in the MA plot. Below, we plot only the top 500 DE genes and color the rest black. We can see that within the first 500 DE genes, the tagwise analysis puts more density toward the lower concentration values compared to the poisson.
par( mfrow = c(2,1) )
plotSmear( cds , de.tags=de.genes.poi[1:500] , main="Poisson" ,pair=c("C","T") , cex=.5 , xlab="Log Concentration" , ylab="Log Fold-Change" )
abline( h=c(-2,2) , col="dodgerblue" )
 
plotSmear( cds , de.tags=de.genes.tgw[1:500] , main="Tagwise" , pair=c("C","T") ,cex = .5 ,
xlab="Log Concentration" , ylab="Log Fold-Change" )
abline( h=c(-2,2) , col="dodgerblue" )
 
par( mfrow=c(1,1) )



Outputting Results

The analysis is essentially complete and now we want to output a table or a csv file containing all the results. We’ll pull together the concentrations, fold-changes, p-values, adjusted p-values, up/down regulated variable, dispersions, and the count matrix. Here’s how that’s done:

First change column names in each table to be specific to the analysis; logConc and logFC are the same in both so we don't need to change those names.
colnames( resultsTbl.cmn ) <- c( "logConc" , "logFC" , "pVal.Cmn" , "adj.pVal.Cmn" )
colnames( resultsTbl.tgw ) <- c( "logConc" , "logFC" , "pVal.Tgw" , "adj.pVal.Tgw" )
If we'd like, we could just write these tables to a file
write.table(resultsTbl.cmn, file=""tagwiseResults.csv", sep = "," , row.names = TRUE)
If we'd like to keep the count information too, we need to re-order the count matrix to be in line with the order of the results
> wh.rows.tgw <- match( rownames( resultsTbl.tgw ) , rownames( cds$counts ) )
> wh.rows.cmn <- match( rownames( resultsTbl.cmn ) , rownames( cds$counts ) )
> head( wh.rows.tgw )
[1] 8321 7435 5428 4344 13950 13995

We can combine these two datasets together with cbind(). In fact, we'll also throw in some other information, such as the tagwise dispersion estimates and whether the genes were up/down regulated.
> combResults.tgw <- cbind(resultsTbl.tgw ,"Tgw.Disp" = cds$tagwise.dispersion[ wh.rows.tgw ] ,"UpDown.Tgw" = decideTestsDGE( de.tgw , p.value = 0.05 )[ wh.rows.tgw ] ,cds$counts[ wh.rows.tgw , ] )
> head( combResults.tgw )
                logConc logFC  pVal.Tgw adj.pVal.Tgw Tgw.Disp UpDown.Tgw C_R1 C_R2 C_R3 C_R4 T_R1 T_R2 T_R3
ENSG00000151503   -11.9  5.82 1.09e-311    1.79e-307   0.0112          1   35   35   49   59 3307 3439 1224
ENSG00000096060   -11.3  5.01 1.51e-269    1.25e-265   0.0112          1   65   79  105  113 3975 3727 1451
ENSG00000127954   -15.6  8.24 1.61e-171    8.85e-168   0.0163          1    0    0    3    3  607  602  220
ENSG00000166451   -12.3  4.69 7.20e-159    2.97e-155   0.0163          1   41   52   57   57 1750 1654  728
ENSG00000162772   -10.8  3.32 1.32e-144    4.35e-141   0.0112          1  172  204  250  304 2972 3269 1112
ENSG00000116133   -11.7  3.25 2.68e-126    7.38e-123   0.0112          1   97  107  154  143 1632 1611  555
Again, we can write this to a file
write.table(combResults.tgw, file=""tagwiseResultsDetailed.csv", sep = "," , row.names = TRUE)
We can do the same for both and then combine all the information together
> combResults.cmn <- cbind(resultsTbl.cmn ,"Cmn.Disp" = cds$common.dispersion ,"UpDown.Cmn" = decideTestsDGE( de.cmn , p.value = 0.05 )[ wh.rows.cmn ] ,cds$counts[ wh.rows.cmn , ] )
>
> wh.rows <- match( rownames( combResults.cmn ) , rownames( combResults.tgw ) )
>
> combResults.all <- cbind(combResults.cmn[,1:4] ,combResults.tgw[wh.rows,3:4] ,"Cmn.Disp" = combResults.cmn[,5],"Tgw.Disp" = combResults.tgw[wh.rows,5],"UpDown.Cmn" = combResults.cmn[,6],"UpDown.Tgw" = combResults.tgw[wh.rows,6],combResults.cmn[,7:ncol(combResults.cmn)] )
> head( combResults.all )
                logConc logFC  pVal.Cmn adj.pVal.Cmn  pVal.Tgw adj.pVal.Tgw Cmn.Disp Tgw.Disp UpDown.Cmn UpDown.Tgw
ENSG00000151503   -11.9  5.82 6.51e-193    1.07e-188 1.09e-311    1.79e-307     0.02   0.0112          1          1
ENSG00000096060   -11.3  5.01 1.23e-162    1.02e-158 1.51e-269    1.25e-265     0.02   0.0112          1          1
ENSG00000127954   -15.6  8.24 2.43e-153    1.34e-149 1.61e-171    8.85e-168     0.02   0.0163          1          1
ENSG00000166451   -12.3  4.69 1.05e-134    4.32e-131 7.20e-159    2.97e-155     0.02   0.0163          1          1
ENSG00000131016   -14.4  5.31 4.12e-110    1.36e-106  1.07e-91     1.26e-88     0.02   0.0268          1          1
ENSG00000113594   -12.8  4.12 5.31e-102     1.46e-98 4.68e-119    1.10e-115     0.02   0.0163          1          1
                C_R1 C_R2 C_R3 C_R4 T_R1 T_R2 T_R3
ENSG00000151503   35   35   49   59 3307 3439 1224
ENSG00000096060   65   79  105  113 3975 3727 1451
ENSG00000127954    0    0    3    3  607  602  220
ENSG00000166451   41   52   57   57 1750 1654  728
ENSG00000131016    9    5   18    6  564  377  213
ENSG00000113594   37   36   57   43  936  959  418
 
> write.table( combResults.all , file = "combResults_all_ex1.csv" , sep = "," , row.names = TRUE )


References

  1. Robinson MD, McCarthy DJ and Smyth GK (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139-140
  2. Robinson MD and Smyth GK (2007). Moderated statistical tests for assessing differences in tag abundance. Bioinformatics 23, 2881-2887
  3. Robinson MD and Smyth GK (2008). Small-sample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics, 9, 321-332
  4. Benjamini, Y., and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B, 57, 289300.