The question we are trying to answer is: Which genes are being expressed at different levels in different conditions? (Conditions can be environment, genetic background, etc.) Or in statistical terms: Do our measurements for the expression of a gene in different RNAseq experiments come from two different distributions or the same distribution? We test for differential expression by calculating a p-value that represents the probability of the null hypothesis (i.e. the measurements come from the same distribution and the gene is being expressed at the same level across conditions).
DESeq is a R package for Bioconductor designed to assess the statistical significance of expression differences measured in RNAseq.
Negative binomial vs. Poisson
Count data follows a Poisson distribution. However, Poisson assumes that the mean equals the variance/dispersion (μ = σ2). However, in RNAseq data, genes with larger mean counts tend to show greater variance, which is described as the overdispersion problem.
Fig. 1 from Anders & Huber, 2010: Dependence of the variance on the mean for condition A in the fly RNA-Seq data
One solution is to approximate count data with a negative binomial distribution, which is typically used to address overdispersion. The above figure shows the relationship of variance to mean in fly RNAseq data. The purple line represents what is expected from a Poisson distribution, while the orange line represents a negative binomial distribution as used in DESeq. (The dotted orange line represents edgeR, another Bioconductor package that relies on the negative binomial, although it parametrizes the distribution a little differently.)
Estimating dispersion
DESeq estimates the variance by taking into account both the coverage within each library ("size factors"), the expression strength under each condition, and the per-gene variance. Library coverage is determined by comparing the ratio of gene counts between replicate experiments. Expression strength is determined by the mean counts for a gene for each condition. The per-gene variance is assumed to be a function of the mean that is approximated by empirical fit to the data. For more details, see Anders & Huber, 2010.
By comparison, edgeR estimates the variance by estimating a single parameter that is assumed to be constant for all genes in an experiment.
Statistical test
P-values are calculated through a method that is analogous to a Fisher's exact test, using a 2x2 contingency table, but instead of assuming that the probabilities follow a hypergeometric distribution, they follow a negative binomial distribution parametrized from the mean and the estimated dispersion.
References
Anders & Huber (2010). Differential expression analysis for sequence count data. Genome Biology11: R106. [doi:10.1186/gb-2010-11-10-r106]
We will be using the following yeast sample data: yeast_sample_data.txt (right-click to download and save locally to your computer)
(Note: if you have already used EDAseq to normalize your data, please skip ahead as your data is already loaded into R as an object that DESeq can read.)
In this data set, the rows are labeled by systematic gene names, and each column represents an RNAseq library. For the purposes of this tutorial, treat WE and M as two separate conditions, with two and three replicates respectively. Note that DESeq takes unnormalized count values as input so do not normalize by library size.
Create a vector in R that will describe the condition corresponding to each column name in the count table.
> expt_design <- data.frame(row.names = colnames(count_table),
+ condition = c("WE","WE","M","M","M"))
> expt_design
condition
WE1_counts WE
WE2_counts WE
M1_counts M
M2_counts M
M3_counts M
> conditions = expt_design$condition
> conditions
[1] WE WE M M M
Levels: M WE
(Note: if you've already used EDAseq on your data, run the following commands on your data to transform it to a CountDataSet object:
library("DESeq")
data <- as(normalized_data, "CountDataSet")
After this point, proceed with the rest of the tutorial.)
Finally, we estimate the per-gene variance as a function of the mean:
> data <- estimateVarianceFunctions(data)
> results <- nbinomTest(data, "M", "WE")
(Note: Depending on the version you install, the name of the function differs from estimateDispersions() provided in the vignette.)
If you don't have any replicates, there's an option that allows you to treat all conditions as replicates in order to estimate the per-gene variance. Please note that not having any replicates greatly reduces DESeq's power to detect differential expression:
DESeq Tutorial
Table of Contents
Brem Lab
CGRL RNAseq Workshop
April 30, 2012
Background
The question we are trying to answer is: Which genes are being expressed at different levels in different conditions? (Conditions can be environment, genetic background, etc.) Or in statistical terms: Do our measurements for the expression of a gene in different RNAseq experiments come from two different distributions or the same distribution? We test for differential expression by calculating a p-value that represents the probability of the null hypothesis (i.e. the measurements come from the same distribution and the gene is being expressed at the same level across conditions).
DESeq is a R package for Bioconductor designed to assess the statistical significance of expression differences measured in RNAseq.
Negative binomial vs. Poisson
Count data follows a Poisson distribution. However, Poisson assumes that the mean equals the variance/dispersion (μ = σ2). However, in RNAseq data, genes with larger mean counts tend to show greater variance, which is described as the overdispersion problem.
One solution is to approximate count data with a negative binomial distribution, which is typically used to address overdispersion. The above figure shows the relationship of variance to mean in fly RNAseq data. The purple line represents what is expected from a Poisson distribution, while the orange line represents a negative binomial distribution as used in DESeq. (The dotted orange line represents edgeR, another Bioconductor package that relies on the negative binomial, although it parametrizes the distribution a little differently.)
Estimating dispersion
DESeq estimates the variance by taking into account both the coverage within each library ("size factors"), the expression strength under each condition, and the per-gene variance. Library coverage is determined by comparing the ratio of gene counts between replicate experiments. Expression strength is determined by the mean counts for a gene for each condition. The per-gene variance is assumed to be a function of the mean that is approximated by empirical fit to the data. For more details, see Anders & Huber, 2010.
By comparison, edgeR estimates the variance by estimating a single parameter that is assumed to be constant for all genes in an experiment.
Statistical test
P-values are calculated through a method that is analogous to a Fisher's exact test, using a 2x2 contingency table, but instead of assuming that the probabilities follow a hypergeometric distribution, they follow a negative binomial distribution parametrized from the mean and the estimated dispersion.
References
Installing DEseq
> source("http://www.bioconductor.org/biocLite.R") > biocLite("DESeq")Preparing the Input
We will be using the following yeast sample data: yeast_sample_data.txt (right-click to download and save locally to your computer)
(Note: if you have already used EDAseq to normalize your data, please skip ahead as your data is already loaded into R as an object that DESeq can read.)
Load the data into R:
> count_table <- read.table("yeast_sample_data.txt", header=T, sep="\t", row.names=1) > head(count_table) WE1_counts WE2_counts M1_counts M2_counts M3_counts YAL008W 465 291 911 926 946 YBR255W 1040 1357 1428 1304 1112 YGR131W 95 170 230 113 138 YNL003C 1800 3107 3979 3012 2899 YBR135W 1094 2143 936 860 561 YBR160W 2122 2926 3950 3630 2980In this data set, the rows are labeled by systematic gene names, and each column represents an RNAseq library. For the purposes of this tutorial, treat WE and M as two separate conditions, with two and three replicates respectively. Note that DESeq takes unnormalized count values as input so do not normalize by library size.
Create a vector in R that will describe the condition corresponding to each column name in the count table.
> expt_design <- data.frame(row.names = colnames(count_table), + condition = c("WE","WE","M","M","M")) > expt_design condition WE1_counts WE WE2_counts WE M1_counts M M2_counts M M3_counts M > conditions = expt_design$condition > conditions [1] WE WE M M M Levels: M WERunning DEseq
First, we create a CountDataSet object:
> library("DESeq") > data <- newCountDataSet(count_table, conditions) > head(counts(data)) WE1_counts WE2_counts M1_counts M2_counts M3_counts YAL008W 465 291 911 926 946 YBR255W 1040 1357 1428 1304 1112 YGR131W 95 170 230 113 138 YNL003C 1800 3107 3979 3012 2899 YBR135W 1094 2143 936 860 561 YBR160W 2122 2926 3950 3630 2980Next, we estimate the size factors (which is a representation of the differences in coverage between replicates):
(Note: if you've already used EDAseq on your data, run the following commands on your data to transform it to a CountDataSet object:
library("DESeq") data <- as(normalized_data, "CountDataSet")After this point, proceed with the rest of the tutorial.)Finally, we estimate the per-gene variance as a function of the mean:
(Note: Depending on the version you install, the name of the function differs from estimateDispersions() provided in the vignette.)
If you don't have any replicates, there's an option that allows you to treat all conditions as replicates in order to estimate the per-gene variance. Please note that not having any replicates greatly reduces DESeq's power to detect differential expression:
> data2 <- data[, c("WE1_counts","M1_counts")] > data2 <- estimateSizeFactors(data2) > data2 <- estimateVarianceFunctions(data2, method="blind") > results2 <- nbinomTest(data2, "M", "WE")Results can be saved as a file:
Understanding the Output
What the results look like:
> head(results) id baseMean baseMeanA baseMeanB foldChange log2FoldChange pval 1 YAL008W 706.7053 898.2080 419.4512 0.4669867 -1.09854657 6.641443e-04 2 YBR255W 1240.2100 1220.3008 1270.0738 1.0407874 0.05767547 7.435133e-01 3 YGR131W 146.1715 151.8459 137.6600 0.9065769 -0.14149875 8.722571e-01 4 YNL003C 2905.0929 3138.7004 2554.6817 0.8139298 -0.29702379 2.657303e-01 5 YBR135W 1110.7109 736.2681 1672.3752 2.2714216 1.18359548 2.550313e-07 6 YBR160W 3073.1755 3345.0702 2665.3334 0.7967945 -0.32772048 1.890174e-01 padj resVarA resVarB 1 6.044244e-03 2.18947587 1.0247063951 2 8.925745e-01 0.14417477 0.0006701733 3 9.528324e-01 1.30581931 0.1901169941 4 5.462501e-01 1.05583592 0.6348775975 5 4.679414e-06 0.20112792 2.8304917137 6 4.485319e-01 0.08728659 0.0310454492Results from the case with no replicates:
> head(results2) id baseMean baseMeanA baseMeanB foldChange log2FoldChange pval 1 YAL008W 658.7190 760.1870 557.251 0.7330447 -0.4480269 0.6968153 2 YBR255W 1218.9621 1191.5994 1246.325 1.0459262 0.0647810 0.8608814 3 YGR131W 152.8856 191.9243 113.847 0.5931871 -0.7534410 0.8802727 4 YNL003C 2738.6953 3320.2898 2157.101 0.6496725 -0.6222155 0.3482909 5 YBR135W 1046.0431 781.0483 1311.038 1.6785619 0.7472258 0.2773562 6 YBR160W 2919.5364 3296.0907 2542.982 0.7715146 -0.3742345 0.5954106 padj resVarA resVarB 1 1 NA NA 2 1 NA NA 3 1 NA NA 4 1 NA NA 5 1 NA NA 6 1 NA NANote that resVarA and resVarB are NA because the variance was calculated across conditions rather than within each condition.