edgeRWikiVersion

=edgeR Tutorial: Differential Expression in RNA-Seq Data=

=Introduction=

What is edgeR?
edgeR is a package written for [R] that performs differential tag/gene expression using count data under a negative binomial model. The methods used in edgeR do NOT support FPKM, RPKM or other types of data that are not counts. You can obtain more information about edgeR from the bioconductor website []. Here you’ll find a user’s guide/vignette with detailed examples, a reference manual for all edgeR functions, and the [R] code for installing the package. A useful discussion on a forum about edgeR can be found on a [|SeqAnswers Thread].

Getting Help with edgeR
The bioconductor website provides access to a mailing list that all can subscribe to. Any questions concerning any bio-conductor package can be sent to this mailing list. They are very responsive and very helpful; I have used it several times. To subscribe go to []. For guides to posting questions go to []. If you just need help understanding an edgeR function or want to see an example of how it’s used simply type code ?functionname code into the [R] command line. This will bring up a text page with detailed information about the function.

Getting Help with [R]
Google your question! Write a short description and add on at the end “R-help”. This will most likely bring up a thread in the R-help forum that deals with the same or similar problem. Almost every problem you could have can be found answered on this forum. You can also subscribe to the [R] mailing list by going to []. A guide for posting questions can be found at [] Elizabeth Purdom has several [R] tutorials that can be found at []. In particular, see [] for using directories and saving objects and [|www.stat.berkeley.edu/~epurdom/BrownLab/]for creating and manipulating data sets in [R].

Installing edgeR
With an internet connection, edgeR can be easily installed by running the following commands in [R] code source("http://www.bioconductor.org/biocLite.R") biocLite("edgeR") code

Make sure you have the right version of R (2.13.1), or some of the initial commands will not work.

=A Straightforward RNA-Seq Differential Expression Analysis using edgeR=

Description
This is an RNA-Seq data set from an article by Li et al. [2008] titled, ”Determination of tag density required for digital transcriptome analysis: Application to an androgen-sensitive prostate cancer model.” The study compared hormone treated cancer cells to non-treated cancer cells. For the non-treated, there are 4 biological replicates. For the treated, there are 3 biological replicates. A full analysis of this data set can also be found in section 10 of the edgeR vignette.

Setting Up the Count Matrix
First, we load the egdeR package into [R] and set the directory to the folder where the data is contained. If copying from Windows finder, a “\\” is needed rather than the Windows "\" for the path. However, using the standard unix “/” is accepted on all platforms (Windows included). We will set the directory to be where the example data is contained.

code > library(edgeR) > setwd("/global/courses/rnaseqworkshop/web/RNA_SEQ") code The data is in a text file and we need to reading it into R and then we can look at it code > raw.data <- read.table( file = "pnas_expression.txt", header = TRUE ) > head( raw.data ) ensembl_ID lane1 lane2 lane3 lane4 lane5 lane6 lane8 len 1 ENSG00000215696    0     0     0     0     0     0     0  330 2 ENSG00000215700    0     0     0     0     0     0     0 2370 3 ENSG00000215699    0     0     0     0     0     0     0 1842 4 ENSG00000215784    0     0     0     0     0     0     0 2393 5 ENSG00000212914    0     0     0     0     0     0     0  384 6 ENSG00000212042    0     0     0     0     0     0     0   92

code

edgeR requires the the dataset to contain only the counts, with the row names as the gene ids and the column names as the sample ids: code > counts <- raw.data[, -c(1,ncol(raw.data)) ] > rownames( counts ) <- raw.data[, 1 ] # gene names > colnames( counts ) <- paste(c(rep("C_R",4),rep("T_R",3)),c(1:4,1:3),sep="") # sample names > head( counts ) C_R1 C_R2 C_R3 C_R4 T_R1 T_R2 T_R3 ENSG00000215696   0    0    0    0    0    0    0 ENSG00000215700   0    0    0    0    0    0    0 ENSG00000215699   0    0    0    0    0    0    0 ENSG00000215784   0    0    0    0    0    0    0 ENSG00000212914   0    0    0    0    0    0    0 ENSG00000212042   0    0    0    0    0    0    0

code

Some quick summaries: code > dim( counts ) [1] 37435    7 > colSums( counts ) # Library Sizes C_R1   C_R2    C_R3    C_R4    T_R1    T_R2    T_R3 978576 1156844 1442169 1485604 1823460 1834335 681743 > colSums( counts ) / 1e06 # Library Sizes in millions of reads C_R1    C_R2     C_R3     C_R4     T_R1     T_R2     T_R3 0.978576 1.156844 1.442169 1.485604 1.823460 1.834335 0.681743 > table( rowSums( counts ) )[ 1:30 ] # Number of genes with low counts

0    1     2     3     4     5     6     7     8     9    10    11    12 15558  1826  1149   717   635   388   447   291   308   240   248   191   252   13    14    15    16    17    18    19    20    21    22    23    24    25  157   190   155   162   126   166    99   169   125   123   100   124   100   26    27    28    29   87   100   103    73 code

=Building the edgeR Object=

DGEList is the function that coverts the count matrix into an edgeR object. First, we create a group variable that tells edgeR with samples belong to which group and supply that to DGEList in addition to the count matrix. We can then see the elements that the object contains by using the names function.

code > group <- c(rep("C", 4), rep("T", 3)) > cds <- DGEList( counts, group = group ) Calculating library sizes from column totals. > names( cds ) [1] "samples"  "counts"    "all.zeros"

code These elements can be accessed using the $ symbol. We give some examples here of looking at what is saved in this object. code > head(cds$counts) # original count matrix C_R1 C_R2 C_R3 C_R4 T_R1 T_R2 T_R3 ENSG00000215696   0    0    0    0    0    0    0 ENSG00000215700   0    0    0    0    0    0    0 ENSG00000215699   0    0    0    0    0    0    0 ENSG00000215784   0    0    0    0    0    0    0 ENSG00000212914   0    0    0    0    0    0    0 ENSG00000212042   0    0    0    0    0    0    0 > cds$samples # contains a summary of your samples group lib.size norm.factors C_R1    C   978576            1 C_R2    C  1156844            1 C_R3    C  1442169            1 C_R4    C  1485604            1 T_R1    T  1823460            1 T_R2    T  1834335            1 T_R3    T   681743            1 > sum( cds$all.zeros ) # How many genes have 0 counts across all samples [1] 15558

code In fact, we can just type in cds and see summaries (but be warned that for some other objects, there may not be nice summarizations like this) code > cds # or type the name of the object An object of class "DGEList" $samples group lib.size norm.factors C_R1    C   978576            1 C_R2    C  1156844            1 C_R3    C  1442169            1 C_R4    C  1485604            1 T_R1    T  1823460            1 T_R2    T  1834335            1 T_R3    T   681743            1

$counts C_R1 C_R2 C_R3 C_R4 T_R1 T_R2 T_R3 ENSG00000215696   0    0    0    0    0    0    0 ENSG00000215700   0    0    0    0    0    0    0 ENSG00000215699   0    0    0    0    0    0    0 ENSG00000215784   0    0    0    0    0    0    0 ENSG00000212914   0    0    0    0    0    0    0 37430 more rows ...

$all.zeros ENSG00000215696 ENSG00000215700 ENSG00000215699 ENSG00000215784 ENSG00000212914 TRUE           TRUE            TRUE            TRUE            TRUE 37430 more elements ...

code

We need to filter out low count reads since it would be impossible to detect differential expression. The method used in the edgeR vignette is to keep only those genes that have at least 1 read per million in at least 3 samples.

code > cds <- cds[rowSums(1e+06 * cds$counts/expandAsMatrix(cds$samples$lib.size, dim(cds)) > 1) >= 3, ] > dim( cds ) [1] 16494    7 code Once this filtering is done, we can calculate the normalization factors which correct for the different compositions of the samples. code > cds <- calcNormFactors( cds ) > cds$samples group lib.size norm.factors C_R1    C   978576    1.0296636 C_R2    C  1156844    1.0372521 C_R3    C  1442169    1.0362662 C_R4    C  1485604    1.0378383 T_R1    T  1823460    0.9537095 T_R2    T  1834335    0.9525624 T_R3    T   681743    0.9583181 code The effective library sizes are then the product of the actual library sizes and these factors. code > cds$samples$lib.size*cds$samples$norm.factors [1] 1007604.1 1199938.9 1494471.0 1541816.7 1739051.1 1747318.6 653326.7 code
 * 1) effective library sizes

Multi-Dimensional Scaling Plot
An MDS plot measures the similarity of the samples and projects this measure into 2-dimensions. This can be useful for quality control and visualization of your samples. code plotMDS.dge( cds, main = "MDS Plot for Count Data", labels = colnames( cds$counts ) ) code When creating plots in R, you can also save plots as pdf’s, png’s, etc. (see ?pdf ). code pdf( "MDS_plot_1_ex1.pdf", width = 7 , height = 7 ) #sets up the pdf that will save to; width/height in inches plotMDS.dge( cds, main = "MDS Plot for Count Data", labels = colnames( cds$counts ) ) dev.off # this tells [R] to close and stop writing to the pdf. code
 * 1) To view the plot immediately
 * 1) Output plot as a pdf

Estimating Dispersions
The first dispersion type to calculate is the common dispersion. In the common dispersion setting, each gene gets assigned the same dispersion estimate. The output of the estimation will include the estimate as well as some other elements added to the edgeR object, cds. code > cds <- estimateCommonDisp( cds ) > names( cds ) [1] "samples"          "common.dispersion" "counts"            "pseudo.alt"        "genes" [6] "all.zeros"        "conc"              "common.lib.size" code We can see the estimate of the common dispersion: code > cds$common.dispersion [1] 0.0199892

code To understand what this value means, recall the parameterization for the variance of the negative binomial is ν(μ) = μ+μ 2 ·φ. For poisson it’s ν(μ) = μ. The implied standard deviations are the square-roots of the variances. Now, suppose a gene had an average count of 200. Then the sd’s under the two models would be, code > sqrt( 200 ) # poisson sd [1] 14.14214 > sqrt( 200 + 200^2 * cds$common.dispersion ) # negative binomial [1] 31.61595 > sqrt( 200 + 200^2 * cds$common.dispersion ) / sqrt( 200 ) # NB sd is over 2 times larger [1] 2.235585

code Once the common dispersion is estimated we can estimate the tagwise dispersions (i.e. gene-wise). In this scenario, each gene will get its own unique dispersion estimate, but the common dispersion is still used in the calculation. The tagwise dispersions are squeezed toward the common value. The amount of squeezing is governed by the paramter prior.n which by default is 10. The higher prior.n, the closer the estimates will be to the common dispersion. The recommended value is the nearest integer to 50/(#samples − #groups). For this data set that’s 50/(7 − 2) = 10, so it is equal to the default. code > cds <- estimateTagwiseDisp( cds, prior.n = 10 ) Using grid search to estimate tagwise dispersion. > names( cds ) [1] "samples"           "common.dispersion"  "prior.n"            "tagwise.dispersion" "counts" [6] "pseudo.alt"        "genes"              "all.zeros"          "conc"               "common.lib.size" > summary( cds$tagwise.dispersion ) Min. 1st Qu. Median   Mean 3rd Qu. Max. 0.01115 0.01631 0.02151 0.02094 0.02151 0.15130

code We can see what the effect is of changing the prior.n to 25. Not much changed, but the extremes got squeezed in quite a bit. code > cds <- estimateTagwiseDisp( cds, prior.n = 25 ) Using grid search to estimate tagwise dispersion. > summary( cds$tagwise.dispersion ) Min. 1st Qu. Median   Mean 3rd Qu. Max. 0.01631 0.02151 0.02151 0.02083 0.02151 0.07089
 * 1) More shrinkage/sqeezing toward the common

code The recommended setting for this data set is the default of 10. Let’s stick with that. We'll set it back to the recommended for the rest of the tutorial code > cds <- estimateTagwiseDisp( cds, prior.n = 10 ) Using grid search to estimate tagwise dispersion. code

Mean-Variance Plot
Now that we’ve estimated the dispersion parameters we can see how well they fit the data by plotting the mean-variance relationship. If you are working on a remote server, you will only see this plot if you have already set your display settings appropriately. If you are working with a R GUI on your laptop, the plot should show up right away on your screen. code > meanVarPlot <- plotMeanVar( cds ,show.raw.vars=TRUE, show.tagwise.vars=TRUE , show.binned.common.disp.vars=FALSE , show.ave.raw.vars=FALSE , dispersion.method = "qcml" , NBline = TRUE , nbins = 100 , #these are arguments about what is plotted   pch = 16 , xlab ="Mean Expression (Log10 Scale)" , ylab = "Variance (Log10 Scale)" , main = "Mean-Variance Plot" ) #these arguments are to make it look prettier code Four things are shown in the plot: the raw variances of the counts (grey dots), the variances using the tagwise dispersions (light blue dots), the variances using the common dispersion (solid blue line), and the variance = mean a.k.a. poisson variance (solid black line). The plot function outputs the variances which will be stored in the data set meanVarPlot.

=Testing= The function exactTest performs pair-wise tests for differential expression between two groups. The important parameters are common.disp which if true uses the common dispersion estimate for testing and if false uses the tagwise dispersion estimates; and pair which indicates which two groups should be compared. The output of exactTest is a list of elements, one of which is a table of the results. The third line of code uses the dispersion parameter to set a dispersion of zero, ie. the poisson model to which we can compare the negative binomial results.

code > de.cmn <- exactTest( cds, common.disp = TRUE , pair = c( "C" , "T" ) ) #single common dispersion across all genes Comparison of groups: T - C > de.tgw <- exactTest( cds, common.disp = FALSE , pair = c( "C" , "T" ) ) #each gene its own dispersion Comparison of groups: T - C > de.poi <- exactTest( cds, dispersion = 1e-06 , pair = c( "C" , "T" ) ) #a poisson model (no dispersion) Comparison of groups: T - C

code We can look at these objects code > names( de.tgw ) [1] "table"     "comparison" "genes" > de.tgw$comparison # which groups have been compared [1] "C" "T" > head( de.tgw$table ) # results table in order of your count matrix. logConc      logFC    p.value ENSG00000124208 -11.26081 -0.43041686 0.01255780 ENSG00000182463 -15.32155 0.69290959 0.00436565 ENSG00000124201 -12.43969 -0.05991576 0.71782933 ENSG00000124207 -14.11940 -0.42126690 0.03996537 ENSG00000125835 -12.87722 -0.24892772 0.19096504 ENSG00000125834 -14.15590 0.30959561 0.15119062 > head( cds$counts ) C_R1 C_R2 C_R3 C_R4 T_R1 T_R2 T_R3 ENSG00000124208 478  619  628  744  483  716  240 ENSG00000182463  27   20   27   26   48   55   24 ENSG00000124201 180  218  293  275  373  301   88 ENSG00000124207  76   80   85   97   80   81   37 ENSG00000125835 132  200  200  228  280  204   52 ENSG00000125834  42   60   72   86  131   99   30 code The table from exactTest does not contain p-values adjusted for multiple testing. The function topTags takes the output from exactTest, adjusts the raw p-values using the False Discovery Rate (FDR) correction, and returns the top differentially expressed genes. The output is similar to that of exactTest but with a column of adjusted p-values and sorted by increasing p-value. code > # Top tags for tagwise analysis > options( digits = 3 ) # print only 3 digits > topTags( de.tgw, n = 20 , sort.by = "p.value" ) # top 20 DE genes Comparison of groups: T-C logConc logFC  P.Value       FDR ENSG00000151503 -11.94  5.82 1.09e-311 1.79e-307 ENSG00000096060 -11.32  5.01 1.51e-269 1.25e-265 ENSG00000127954 -15.62  8.24 1.61e-171 8.85e-168 ENSG00000166451 -12.28  4.69 7.20e-159 2.97e-155 ENSG00000162772 -10.81  3.32 1.32e-144 4.35e-141 ENSG00000116133 -11.73  3.25 2.68e-126 7.38e-123 ENSG00000113594 -12.83  4.12 4.68e-119 1.10e-115 ENSG00000123983 -12.09  3.66 2.06e-109 4.25e-106 ENSG00000116285 -13.56  4.21 8.94e-107 1.64e-103 ENSG00000115648  -8.82  2.60 1.65e-102  2.73e-99 ENSG00000166086 -15.24  5.51 6.47e-100  9.71e-97 ENSG00000130066 -10.31  2.61  7.12e-98  9.78e-95 ENSG00000140263 -12.17  2.83  1.33e-92  1.68e-89 ENSG00000131016 -14.42  5.31  1.07e-91  1.26e-88 ENSG00000091879 -14.88  4.55  1.40e-83  1.54e-80 ENSG00000116883 -15.02  4.70  2.07e-83  2.13e-80 ENSG00000116574 -14.52  4.12  6.46e-81  6.27e-78 ENSG00000162545 -14.09 -4.27  6.59e-80  6.04e-77 ENSG00000167751 -11.67  2.95  3.92e-79  3.41e-76 ENSG00000067113 -10.28  2.32  7.50e-79  6.18e-76 code We can also use topTags to return the original counts of the top differentially expressed genes. code > cds$counts[ rownames( topTags( de.tgw, n = 15 )$table ) , ] C_R1 C_R2 C_R3 C_R4 T_R1 T_R2 T_R3 ENSG00000151503  35   35   49   59 3307 3439 1224 ENSG00000096060  65   79  105  113 3975 3727 1451 ENSG00000127954   0    0    3    3  607  602  220 ENSG00000166451  41   52   57   57 1750 1654  728 ENSG00000162772 172  204  250  304 2972 3269 1112 ENSG00000116133  97  107  154  143 1632 1611  555 ENSG00000113594  37   36   57   43  936  959  418 ENSG00000123983  62   76   94  108 1354 1258  628 ENSG00000116285  18   28   23   32  645  630  218 ENSG00000115648 940 1084 1317 1345 9730 9942 3272 ENSG00000166086   9    2    3    6  296  298  121 ENSG00000130066 309  386  497  477 3198 3306 1374 ENSG00000140263  79  103  122  122  978 1007  386 ENSG00000131016   9    5   18    6  564  377  213 ENSG00000091879   8    6    8   14  276  274  110

code The sort.by argument allows you to sort the table by p-value, concentration or fold-change if desired. Also, by setting the n parameter to the total number of genes, we can save the entire topTags results table. code > # Sort tagwise results by Fold-Change instead of p-value, and get all genes > resultsByFC.tgw <- topTags( de.tgw, n = nrow( de.tgw$table ) , sort.by = "logFC" )$table > head( resultsByFC.tgw ) logConc logFC P.Value adj.P.Val ENSG00000091972  -31.8 -36.5 7.87e-59  3.33e-56 ENSG00000164120  -32.2  35.5 1.88e-44  4.55e-42 ENSG00000100373  -33.0 -34.1 4.16e-16  1.79e-14 ENSG00000118513  -33.0 -34.0 1.37e-14  5.03e-13 ENSG00000081237  -33.2 -33.7 3.75e-12  1.09e-10 ENSG00000196660  -33.2 -33.5 2.07e-11  5.49e-10 code We can store these results for all three of the methods code > # Store full topTags results table > resultsTbl.cmn <- topTags( de.cmn, n = nrow( de.cmn$table ) )$table > resultsTbl.tgw <- topTags( de.tgw, n = nrow( de.tgw$table ) )$table > resultsTbl.poi <- topTags( de.poi, n = nrow( de.poi$table ) )$table > head( resultsTbl.tgw ) logConc logFC  P.Value adj.P.Val ENSG00000151503  -11.9  5.82 1.09e-311 1.79e-307 ENSG00000096060  -11.3  5.01 1.51e-269 1.25e-265 ENSG00000127954  -15.6  8.24 1.61e-171 8.85e-168 ENSG00000166451  -12.3  4.69 7.20e-159 2.97e-155 ENSG00000162772  -10.8  3.32 1.32e-144 4.35e-141 ENSG00000116133  -11.7  3.25 2.68e-126 7.38e-123

code Now that we have the full results table with the adjusted p-values we can compare those p-values to the significance level and see how many differentially expressed genes we find. Here we use a significance level of 0.05. In addition, we can use the function decideTestsDGE to determine how many DE genes are up or down regulated compared to control.

First we find the names of the significantly different genes -- those with adjusted p values less than 0.05. code > de.genes.cmn <- rownames( resultsTbl.cmn )[ resultsTbl.cmn$adj.P.Val <= 0.05 ] > de.genes.tgw <- rownames( resultsTbl.tgw )[ resultsTbl.tgw$adj.P.Val <= 0.05 ] > de.genes.poi <- rownames( resultsTbl.poi )[ resultsTbl.poi$adj.P.Val <= 0.05 ] code We can see how many of them there are by looking at the length. code > length( de.genes.cmn ) [1] 4936 > length( de.genes.tgw ) [1] 4979 > length( de.genes.poi ) [1] 7048 code We can also get a summary of the up/down regulation. Here we show for the tagwise results. code > summary( decideTestsDGE( de.tgw, p.value = 0.05 ) ) [,1] -1 2465 0  11515 1   2514
 * 1) Names/IDs of DE genes

code We can also can also compare the analyses to see how many DE gene are in common. code > sum( de.genes.tgw %in% de.genes.cmn ) / length( de.genes.tgw ) * 100 # Tagwise to Common [1] 96.9 > sum( de.genes.cmn %in% de.genes.tgw ) / length( de.genes.cmn ) * 100 # Common to Tagwise [1] 97.8 > sum( de.genes.tgw %in% de.genes.poi ) / length( de.genes.tgw ) * 100 # Tagwise to Poisson [1] 100 > > # Percent shared out of top 10, 100 & 1000 between tagwise and common > sum( de.genes.tgw[1:10] %in% de.genes.cmn[1:10] ) / 10 * 100 [1] 80 > sum( de.genes.tgw[1:100] %in% de.genes.cmn[1:100] ) [1] 88 > sum( de.genes.tgw[1:1000] %in% de.genes.cmn[1:1000] ) / 1000 * 100 [1] 90.7 > > # Percent shared out of top 10, 100 & 1000 between tagwise and poisson > sum( de.genes.tgw[1:10] %in% de.genes.poi[1:10] ) / 10 * 100 [1] 90 > sum( de.genes.tgw[1:100] %in% de.genes.poi[1:100] ) [1] 55 > sum( de.genes.tgw[1:1000] %in% de.genes.poi[1:1000] ) / 1000 * 100 [1] 72.4

code

=Visualizing Results= The first thing we’ll look at is the spread of expression levels for the top DE genes. Below, we plot a histogram of log concentrations for the top 100 genes for each analysis. code par( mfrow=c(3 ,1) ) #makes 3 plots on same screen

hist(resultsTbl.poi[de.genes.poi[1:100],"logConc"], breaks=10 , xlab="Log Concentration" , col="red" , xlim=c(-18,-6) , ylim=c(0,0.4) , freq=FALSE , main="Poisson: Top 100" ) hist(resultsTbl.cmn[de.genes.cmn[1:100],"logConc"], breaks=25 , xlab="Log Concentration" , col="green" , xlim=c(-18,-6) , ylim=c(0,0.4) , freq=FALSE , main="Common: Top 100" ) hist( resultsTbl.tgw[de.genes.tgw[1:100],"logConc"], breaks=25 , xlab="Log Concentration" , col="blue" , xlim=c(-18,-6) , ylim=c(0,0.4) , freq=FALSE , main="Tagwise: Top 100" )

par( mfrow=c(1,1) ) #set it back to default code The next plot is an MA plot that shows the relationship between concentration and fold-change across the genes. The differentially expressed genes are colored red and the non-differentially expressed are colored black. The orange dots represent genes in which the counts were zero in all samples of one of the groups. Here, we just look at the tagwise and poisson analyses since the common analyses in this case is very similar. The blue line is added at a log-FC of 2 to represent a level for biological significance.

code par( mfrow=c(2,1) )

plotSmear(cds, de.tags=de.genes.poi , main="Poisson" , pair = c("C","T") , cex = .35 , xlab="Log Concentration" , ylab="Log Fold-Change" ) = c(-2, 2) , col = "dodgerblue" ) abline( h=c(-2,2) , col="dodgerblue" )

plotSmear( cds, de.tags=de.genes.tgw , main="Tagwise" , pair = c("C","T") , cex = .35 , xlab="Log Concentration" , ylab="Log Fold-Change" ) abline( h=c(-2,2), col="dodgerblue" )

par( mfrow=c(1,1) ) code We can also see the difference in log concentrations between the DE genes under the negative binomial model and under the poisson model in the MA plot. Below, we plot only the top 500 DE genes and color the rest black. We can see that within the first 500 DE genes, the tagwise analysis puts more density toward the lower concentration values compared to the poisson. code par( mfrow = c(2,1) ) plotSmear( cds, de.tags=de.genes.poi[1:500] , main="Poisson" ,pair=c("C","T") , cex=.5 , xlab="Log Concentration" , ylab="Log Fold-Change" ) abline( h=c(-2,2), col="dodgerblue" )

plotSmear( cds, de.tags=de.genes.tgw[1:500] , main="Tagwise" , pair=c("C","T") ,cex = .5 , xlab="Log Concentration" , ylab="Log Fold-Change" ) abline( h=c(-2,2), col="dodgerblue" )

par( mfrow=c(1,1) ) code

=Outputting Results= The analysis is essentially complete and now we want to output a table or a csv file containing all the results. We’ll pull together the concentrations, fold-changes, p-values, adjusted p-values, up/down regulated variable, dispersions, and the count matrix. Here’s how that’s done:

First change column names in each table to be specific to the analysis; logConc and logFC are the same in both so we don't need to change those names. code colnames( resultsTbl.cmn ) <- c( "logConc", "logFC" , "pVal.Cmn" , "adj.pVal.Cmn" ) colnames( resultsTbl.tgw ) <- c( "logConc", "logFC" , "pVal.Tgw" , "adj.pVal.Tgw" ) code If we'd like, we could just write these tables to a file code write.table(resultsTbl.cmn, file=""tagwiseResults.csv", sep = ",", row.names = TRUE) code If we'd like to keep the count information too, we need to re-order the count matrix to be in line with the order of the results code > wh.rows.tgw <- match( rownames( resultsTbl.tgw ) , rownames( cds$counts ) ) > wh.rows.cmn <- match( rownames( resultsTbl.cmn ) , rownames( cds$counts ) ) > head( wh.rows.tgw ) [1] 8321 7435 5428 4344 13950 13995 code

We can combine these two datasets together with cbind. In fact, we'll also throw in some other information, such as the tagwise dispersion estimates and whether the genes were up/down regulated. code > combResults.tgw <- cbind(resultsTbl.tgw ,"Tgw.Disp" = cds$tagwise.dispersion[ wh.rows.tgw ] ,"UpDown.Tgw" = decideTestsDGE( de.tgw, p.value = 0.05 )[ wh.rows.tgw ] ,cds$counts[ wh.rows.tgw , ] ) > head( combResults.tgw ) logConc logFC pVal.Tgw adj.pVal.Tgw Tgw.Disp UpDown.Tgw C_R1 C_R2 C_R3 C_R4 T_R1 T_R2 T_R3 ENSG00000151503  -11.9  5.82 1.09e-311    1.79e-307   0.0112          1   35   35   49   59 3307 3439 1224 ENSG00000096060  -11.3  5.01 1.51e-269    1.25e-265   0.0112          1   65   79  105  113 3975 3727 1451 ENSG00000127954  -15.6  8.24 1.61e-171    8.85e-168   0.0163          1    0    0    3    3  607  602  220 ENSG00000166451  -12.3  4.69 7.20e-159    2.97e-155   0.0163          1   41   52   57   57 1750 1654  728 ENSG00000162772  -10.8  3.32 1.32e-144    4.35e-141   0.0112          1  172  204  250  304 2972 3269 1112 ENSG00000116133  -11.7  3.25 2.68e-126    7.38e-123   0.0112          1   97  107  154  143 1632 1611  555 code Again, we can write this to a file code write.table(combResults.tgw, file=""tagwiseResultsDetailed.csv", sep = ",", row.names = TRUE) code We can do the same for both and then combine all the information together code > combResults.cmn <- cbind(resultsTbl.cmn ,"Cmn.Disp" = cds$common.dispersion ,"UpDown.Cmn" = decideTestsDGE( de.cmn , p.value = 0.05 )[ wh.rows.cmn ] ,cds$counts[ wh.rows.cmn , ] ) > > wh.rows <- match( rownames( combResults.cmn ) , rownames( combResults.tgw ) ) > > combResults.all <- cbind(combResults.cmn[,1:4] ,combResults.tgw[wh.rows,3:4] ,"Cmn.Disp" = combResults.cmn[,5],"Tgw.Disp" = combResults.tgw[wh.rows,5],"UpDown.Cmn" = combResults.cmn[,6],"UpDown.Tgw" = combResults.tgw[wh.rows,6],combResults.cmn[,7:ncol(combResults.cmn)] ) > head( combResults.all )               logConc logFC  pVal.Cmn adj.pVal.Cmn  pVal.Tgw adj.pVal.Tgw Cmn.Disp Tgw.Disp UpDown.Cmn UpDown.Tgw ENSG00000151503   -11.9  5.82 6.51e-193    1.07e-188 1.09e-311    1.79e-307     0.02   0.0112          1          1 ENSG00000096060  -11.3  5.01 1.23e-162    1.02e-158 1.51e-269    1.25e-265     0.02   0.0112          1          1 ENSG00000127954  -15.6  8.24 2.43e-153    1.34e-149 1.61e-171    8.85e-168     0.02   0.0163          1          1 ENSG00000166451  -12.3  4.69 1.05e-134    4.32e-131 7.20e-159    2.97e-155     0.02   0.0163          1          1 ENSG00000131016  -14.4  5.31 4.12e-110    1.36e-106  1.07e-91     1.26e-88     0.02   0.0268          1          1 ENSG00000113594  -12.8  4.12 5.31e-102     1.46e-98 4.68e-119    1.10e-115     0.02   0.0163          1          1 C_R1 C_R2 C_R3 C_R4 T_R1 T_R2 T_R3 ENSG00000151503  35   35   49   59 3307 3439 1224 ENSG00000096060  65   79  105  113 3975 3727 1451 ENSG00000127954   0    0    3    3  607  602  220 ENSG00000166451  41   52   57   57 1750 1654  728 ENSG00000131016   9    5   18    6  564  377  213 ENSG00000113594  37   36   57   43  936  959  418

> write.table( combResults.all, file = "combResults_all_ex1.csv" , sep = "," , row.names = TRUE ) code

= =