DESeq Tutorial

Hana Lee (hanalee@berkeley.edu)
Brem Lab
CGRL RNAseq Workshop
April 30, 2012



Background


The question we are trying to answer is: Which genes are being expressed at different levels in different conditions? (Conditions can be environment, genetic background, etc.) Or in statistical terms: Do our measurements for the expression of a gene in different RNAseq experiments come from two different distributions or the same distribution? We test for differential expression by calculating a p-value that represents the probability of the null hypothesis (i.e. the measurements come from the same distribution and the gene is being expressed at the same level across conditions).

DESeq is a R package for Bioconductor designed to assess the statistical significance of expression differences measured in RNAseq.

Negative binomial vs. Poisson


Count data follows a Poisson distribution. However, Poisson assumes that the mean equals the variance/dispersion (μ = σ2). However, in RNAseq data, genes with larger mean counts tend to show greater variance, which is described as the overdispersion problem.


deseq-fig1.jpg
Fig. 1 from Anders & Huber, 2010: Dependence of the variance on the mean for condition A in the fly RNA-Seq data


One solution is to approximate count data with a negative binomial distribution, which is typically used to address overdispersion. The above figure shows the relationship of variance to mean in fly RNAseq data. The purple line represents what is expected from a Poisson distribution, while the orange line represents a negative binomial distribution as used in DESeq. (The dotted orange line represents edgeR, another Bioconductor package that relies on the negative binomial, although it parametrizes the distribution a little differently.)

Estimating dispersion


DESeq estimates the variance by taking into account both the coverage within each library ("size factors"), the expression strength under each condition, and the per-gene variance. Library coverage is determined by comparing the ratio of gene counts between replicate experiments. Expression strength is determined by the mean counts for a gene for each condition. The per-gene variance is assumed to be a function of the mean that is approximated by empirical fit to the data. For more details, see Anders & Huber, 2010.

By comparison, edgeR estimates the variance by estimating a single parameter that is assumed to be constant for all genes in an experiment.

Statistical test


P-values are calculated through a method that is analogous to a Fisher's exact test, using a 2x2 contingency table, but instead of assuming that the probabilities follow a hypergeometric distribution, they follow a negative binomial distribution parametrized from the mean and the estimated dispersion.

References





Installing DEseq


> source("http://www.bioconductor.org/biocLite.R")
> biocLite("DESeq")



Preparing the Input


We will be using the following yeast sample data: yeast_sample_data.txt (right-click to download and save locally to your computer)

(Note: if you have already used EDAseq to normalize your data, please skip ahead as your data is already loaded into R as an object that DESeq can read.)

Load the data into R:
> count_table <- read.table("yeast_sample_data.txt", header=T, sep="\t", row.names=1)
> head(count_table)
        WE1_counts WE2_counts M1_counts M2_counts M3_counts
YAL008W        465        291       911       926       946
YBR255W       1040       1357      1428      1304      1112
YGR131W         95        170       230       113       138
YNL003C       1800       3107      3979      3012      2899
YBR135W       1094       2143       936       860       561
YBR160W       2122       2926      3950      3630      2980

In this data set, the rows are labeled by systematic gene names, and each column represents an RNAseq library. For the purposes of this tutorial, treat WE and M as two separate conditions, with two and three replicates respectively. Note that DESeq takes unnormalized count values as input so do not normalize by library size.

Create a vector in R that will describe the condition corresponding to each column name in the count table.
> expt_design <- data.frame(row.names = colnames(count_table),
+     condition = c("WE","WE","M","M","M"))
> expt_design
           condition
WE1_counts        WE
WE2_counts        WE
M1_counts          M
M2_counts          M
M3_counts          M
> conditions = expt_design$condition
> conditions
[1] WE WE M M M
Levels: M WE



Running DEseq


First, we create a CountDataSet object:
> library("DESeq")
> data <- newCountDataSet(count_table, conditions)
> head(counts(data))
        WE1_counts WE2_counts M1_counts M2_counts M3_counts
YAL008W        465        291       911       926       946
YBR255W       1040       1357      1428      1304      1112
YGR131W         95        170       230       113       138
YNL003C       1800       3107      3979      3012      2899
YBR135W       1094       2143       936       860       561
YBR160W       2122       2926      3950      3630      2980

Next, we estimate the size factors (which is a representation of the differences in coverage between replicates):
> data <- estimateSizeFactors(data)
> sizeFactors(data)
WE1_counts WE2_counts  M1_counts  M2_counts  M3_counts
 0.8225420  1.0636676  1.1830475  1.1250534  0.8588237

(Note: if you've already used EDAseq on your data, run the following commands on your data to transform it to a CountDataSet object:
library("DESeq")
data <- as(normalized_data, "CountDataSet")
After this point, proceed with the rest of the tutorial.)

Finally, we estimate the per-gene variance as a function of the mean:
> data <- estimateVarianceFunctions(data)
> results <- nbinomTest(data, "M", "WE")

(Note: Depending on the version you install, the name of the function differs from estimateDispersions() provided in the vignette.)

If you don't have any replicates, there's an option that allows you to treat all conditions as replicates in order to estimate the per-gene variance. Please note that not having any replicates greatly reduces DESeq's power to detect differential expression:

> data2 <- data[, c("WE1_counts","M1_counts")]
> data2 <- estimateSizeFactors(data2)
> data2 <- estimateVarianceFunctions(data2, method="blind")
> results2 <- nbinomTest(data2, "M", "WE")

Results can be saved as a file:
> write.table(results,
+     file="DESeq_results.txt",
+     sep="\t",
+     row.names=rownames(results),
+     col.names=colnames(results),
+     quote=F)



Understanding the Output


What the results look like:
> head(results)
       id  baseMean baseMeanA baseMeanB foldChange log2FoldChange         pval
1 YAL008W  706.7053  898.2080  419.4512  0.4669867    -1.09854657 6.641443e-04
2 YBR255W 1240.2100 1220.3008 1270.0738  1.0407874     0.05767547 7.435133e-01
3 YGR131W  146.1715  151.8459  137.6600  0.9065769    -0.14149875 8.722571e-01
4 YNL003C 2905.0929 3138.7004 2554.6817  0.8139298    -0.29702379 2.657303e-01
5 YBR135W 1110.7109  736.2681 1672.3752  2.2714216     1.18359548 2.550313e-07
6 YBR160W 3073.1755 3345.0702 2665.3334  0.7967945    -0.32772048 1.890174e-01
          padj    resVarA      resVarB
1 6.044244e-03 2.18947587 1.0247063951
2 8.925745e-01 0.14417477 0.0006701733
3 9.528324e-01 1.30581931 0.1901169941
4 5.462501e-01 1.05583592 0.6348775975
5 4.679414e-06 0.20112792 2.8304917137
6 4.485319e-01 0.08728659 0.0310454492
 
  • baseMean - mean estimated from both conditions (with normalization by size factors)
  • baseMeanA - mean estimated from the first condition specified as an argument to nbinomTest (in this case, "M")
  • baseMeanB - mean estimated from the second condition (in this case, "WE")
  • foldChange - ratio of the estimated means
  • log2FoldChange - log2 of the ratio
  • pval - uncorrected p-value from the negative binomial test
  • padj - p-value adjusted for multiple testing using Benjamini-Hochberg to estimate the false discovery rate
  • resVarA - dispersion estimated from the first condition
  • resVarB - dispersion estimated from the second condition

Results from the case with no replicates:
> head(results2)
       id  baseMean baseMeanA baseMeanB foldChange log2FoldChange      pval
1 YAL008W  658.7190  760.1870   557.251  0.7330447     -0.4480269 0.6968153
2 YBR255W 1218.9621 1191.5994  1246.325  1.0459262      0.0647810 0.8608814
3 YGR131W  152.8856  191.9243   113.847  0.5931871     -0.7534410 0.8802727
4 YNL003C 2738.6953 3320.2898  2157.101  0.6496725     -0.6222155 0.3482909
5 YBR135W 1046.0431  781.0483  1311.038  1.6785619      0.7472258 0.2773562
6 YBR160W 2919.5364 3296.0907  2542.982  0.7715146     -0.3742345 0.5954106
  padj resVarA resVarB
1    1      NA      NA
2    1      NA      NA
3    1      NA      NA
4    1      NA      NA
5    1      NA      NA
6    1      NA      NA
 
Note that resVarA and resVarB are NA because the variance was calculated across conditions rather than within each condition.