The question we are trying to answer is: Which genes are being expressed at different levels in different conditions? (Conditions can be environment, genetic background, etc.) Or in statistical terms: Do our measurements for the expression of a gene in different RNAseq experiments come from two different distributions or the same distribution? We test for differential expression by calculating a p-value that represents the probability of the null hypothesis (i.e. the measurements come from the same distribution and the gene is being expressed at the same level across conditions).

DESeq is a R package for Bioconductor designed to assess the statistical significance of expression differences measured in RNAseq.

Negative binomial vs. Poisson

Count data follows a Poisson distribution. However, Poisson assumes that the mean equals the variance/dispersion (μ = σ2). However, in RNAseq data, genes with larger mean counts tend to show greater variance, which is described as the overdispersion problem.

Fig. 1 from Anders & Huber, 2010: Dependence of the variance on the mean for condition A in the fly RNA-Seq data

One solution is to approximate count data with a negative binomial distribution, which is typically used to address overdispersion. The above figure shows the relationship of variance to mean in fly RNAseq data. The purple line represents what is expected from a Poisson distribution, while the orange line represents a negative binomial distribution as used in DESeq. (The dotted orange line represents edgeR, another Bioconductor package that relies on the negative binomial, although it parametrizes the distribution a little differently.)

Estimating dispersion

DESeq estimates the variance by taking into account both the coverage within each library ("size factors"), the expression strength under each condition, and the per-gene variance. Library coverage is determined by comparing the ratio of gene counts between replicate experiments. Expression strength is determined by the mean counts for a gene for each condition. The per-gene variance is assumed to be a function of the mean that is approximated by empirical fit to the data. For more details, see Anders & Huber, 2010.

By comparison, edgeR estimates the variance by estimating a single parameter that is assumed to be constant for all genes in an experiment.

Statistical test

P-values are calculated through a method that is analogous to a Fisher's exact test, using a 2x2 contingency table, but instead of assuming that the probabilities follow a hypergeometric distribution, they follow a negative binomial distribution parametrized from the mean and the estimated dispersion.

References

Anders & Huber (2010). Differential expression analysis for sequence count data. Genome Biology11: R106. [doi:10.1186/gb-2010-11-10-r106]

We will be using the following yeast sample data: yeast_sample_data.txt (right-click to download and save locally to your computer)

(Note: if you have already used EDAseq to normalize your data, please skip ahead as your data is already loaded into R as an object that DESeq can read.)

In this data set, the rows are labeled by systematic gene names, and each column represents an RNAseq library. For the purposes of this tutorial, treat WE and M as two separate conditions, with two and three replicates respectively. Note that DESeq takes unnormalized count values as input so do not normalize by library size.

Create a vector in R that will describe the condition corresponding to each column name in the count table.

> expt_design <- data.frame(row.names = colnames(count_table),
+ condition = c("WE","WE","M","M","M"))
> expt_design
condition
WE1_counts WE
WE2_counts WE
M1_counts M
M2_counts M
M3_counts M
> conditions = expt_design$condition
> conditions
[1] WE WE M M M
Levels: M WE

(Note: if you've already used EDAseq on your data, run the following commands on your data to transform it to a CountDataSet object:

library("DESeq")
data <- as(normalized_data, "CountDataSet")

After this point, proceed with the rest of the tutorial.)

Finally, we estimate the per-gene variance as a function of the mean:

> data <- estimateVarianceFunctions(data)
> results <- nbinomTest(data, "M", "WE")

(Note: Depending on the version you install, the name of the function differs from estimateDispersions() provided in the vignette.)

If you don't have any replicates, there's an option that allows you to treat all conditions as replicates in order to estimate the per-gene variance. Please note that not having any replicates greatly reduces DESeq's power to detect differential expression:

## DESeq Tutorial

## Table of Contents

Brem Lab

CGRL RNAseq Workshop

April 30, 2012

## Background

The question we are trying to answer is: Which genes are being expressed at different levels in different conditions? (Conditions can be environment, genetic background, etc.) Or in statistical terms: Do our measurements for the expression of a gene in different RNAseq experiments come from two different distributions or the same distribution? We test for differential expression by calculating a p-value that represents the probability of the null hypothesis (i.e. the measurements come from the same distribution and the gene is being expressed at the same level across conditions).

DESeq is a R package for Bioconductor designed to assess the statistical significance of expression differences measured in RNAseq.

## Negative binomial vs. Poisson

Count data follows a Poisson distribution. However, Poisson assumes that the mean equals the variance/dispersion (

μ=σ2). However, in RNAseq data, genes with larger mean counts tend to show greater variance, which is described as the overdispersion problem.One solution is to approximate count data with a negative binomial distribution, which is typically used to address overdispersion. The above figure shows the relationship of variance to mean in fly RNAseq data. The purple line represents what is expected from a Poisson distribution, while the orange line represents a negative binomial distribution as used in DESeq. (The dotted orange line represents edgeR, another Bioconductor package that relies on the negative binomial, although it parametrizes the distribution a little differently.)

## Estimating dispersion

DESeq estimates the variance by taking into account both the coverage within each library ("size factors"), the expression strength under each condition, and the per-gene variance. Library coverage is determined by comparing the ratio of gene counts between replicate experiments. Expression strength is determined by the mean counts for a gene for each condition. The per-gene variance is assumed to be a function of the mean that is approximated by empirical fit to the data. For more details, see Anders & Huber, 2010.

By comparison, edgeR estimates the variance by estimating a single parameter that is assumed to be constant for all genes in an experiment.

## Statistical test

P-values are calculated through a method that is analogous to a Fisher's exact test, using a 2x2 contingency table, but instead of assuming that the probabilities follow a hypergeometric distribution, they follow a negative binomial distribution parametrized from the mean and the estimated dispersion.

## References

Genome Biology11: R106. [doi:10.1186/gb-2010-11-10-r106]## Installing DEseq

## Preparing the Input

We will be using the following yeast sample data: yeast_sample_data.txt (right-click to download and save locally to your computer)

(Note: if you have already used EDAseq to normalize your data, please skip ahead as your data is already loaded into R as an object that DESeq can read.)

Load the data into R:

In this data set, the rows are labeled by systematic gene names, and each column represents an RNAseq library. For the purposes of this tutorial, treat WE and M as two separate conditions, with two and three replicates respectively. Note that DESeq takes unnormalized count values as input so do not normalize by library size.

Create a vector in R that will describe the condition corresponding to each column name in the count table.

## Running DEseq

First, we create a CountDataSet object:

Next, we estimate the size factors (which is a representation of the differences in coverage between replicates):

(Note: if you've already used EDAseq on your data, run the following commands on your data to transform it to a CountDataSet object:

After this point, proceed with the rest of the tutorial.)

Finally, we estimate the per-gene variance as a function of the mean:

(Note: Depending on the version you install, the name of the function differs from estimateDispersions() provided in the vignette.)

If you don't have any replicates, there's an option that allows you to treat all conditions as replicates in order to estimate the per-gene variance. Please note that not having any replicates greatly reduces DESeq's power to detect differential expression:

Results can be saved as a file:

## Understanding the Output

What the results look like:

Results from the case with no replicates:

Note that resVarA and resVarB are NA because the variance was calculated across conditions rather than within each condition.