Spring+2012+DESeq+Tutorial

=DESeq Tutorial= toc Hana Lee (hanalee@berkeley.edu) Brem Lab CGRL RNAseq Workshop April 30, 2012

Background
The question we are trying to answer is: Which genes are being expressed at different levels in different conditions? (Conditions can be environment, genetic background, etc.) Or in statistical terms: Do our measurements for the expression of a gene in different RNAseq experiments come from two different distributions or the same distribution? We test for differential expression by calculating a p-value that represents the probability of the null hypothesis (i.e. the measurements come from the same distribution and the gene is being expressed at the same level across conditions).

DESeq is a [|R] package for [|Bioconductor] designed to assess the statistical significance of expression differences measured in RNAseq.

Negative binomial vs. Poisson
Count data follows a [|Poisson distribution]. However, Poisson assumes that the mean equals the variance/dispersion (//μ// = // σ //2). However, in RNAseq data, genes with larger mean counts tend to show greater variance, which is described as the overdispersion problem.



One solution is to approximate count data with a [|negative binomial distribution], which is typically used to address overdispersion. The above figure shows the relationship of variance to mean in fly RNAseq data. The purple line represents what is expected from a Poisson distribution, while the orange line represents a negative binomial distribution as used in DESeq. (The dotted orange line represents [|edgeR], another Bioconductor package that relies on the negative binomial, although it parametrizes the distribution a little differently.)

Estimating dispersion
DESeq estimates the variance by taking into account both the coverage within each library ("size factors"), the expression strength under each condition, and the per-gene variance. Library coverage is determined by comparing the ratio of gene counts between replicate experiments. Expression strength is determined by the mean counts for a gene for each condition. The per-gene variance is assumed to be a function of the mean that is approximated by empirical fit to the data. For more details, see Anders & Huber, 2010.

By comparison, edgeR estimates the variance by estimating a single parameter that is assumed to be constant for all genes in an experiment.

Statistical test
P-values are calculated through a method that is analogous to a Fisher's exact test, using a 2x2 contingency table, but instead of assuming that the probabilities follow a hypergeometric distribution, they follow a negative binomial distribution parametrized from the mean and the estimated dispersion.

Installing DEseq
code > source("http://www.bioconductor.org/biocLite.R") > biocLite("DESeq") code

Preparing the Input
We will be using the following yeast sample data: [|yeast_sample_data.txt] (right-click to download and save locally to your computer)

(Note: if you have already used [|EDAseq] to normalize your data, please skip ahead as your data is already loaded into R as an object that DESeq can read.)

Load the data into R: code > count_table <- read.table("yeast_sample_data.txt", header=T, sep="\t", row.names=1) > head(count_table) WE1_counts WE2_counts M1_counts M2_counts M3_counts YAL008W       465        291       911       926       946 YBR255W      1040       1357      1428      1304      1112 YGR131W        95        170       230       113       138 YNL003C      1800       3107      3979      3012      2899 YBR135W      1094       2143       936       860       561 YBR160W      2122       2926      3950      3630      2980 code

In this data set, the rows are labeled by systematic gene names, and each column represents an RNAseq library. For the purposes of this tutorial, treat WE and M as two separate conditions, with two and three replicates respectively. Note that DESeq takes unnormalized count values as input so do not normalize by library size.

Create a vector in R that will describe the condition corresponding to each column name in the count table. code > expt_design <- data.frame(row.names = colnames(count_table), +    condition = c("WE","WE","M","M","M")) > expt_design condition WE1_counts       WE WE2_counts        WE M1_counts          M M2_counts          M M3_counts          M > conditions = expt_design$condition > conditions [1] WE WE M M M Levels: M WE code

Running DEseq
First, we create a CountDataSet object: code > library("DESeq") > data <- newCountDataSet(count_table, conditions) > head(counts(data)) WE1_counts WE2_counts M1_counts M2_counts M3_counts YAL008W       465        291       911       926       946 YBR255W      1040       1357      1428      1304      1112 YGR131W        95        170       230       113       138 YNL003C      1800       3107      3979      3012      2899 YBR135W      1094       2143       936       860       561 YBR160W      2122       2926      3950      3630      2980 code

Next, we estimate the size factors (which is a representation of the differences in coverage between replicates): code > data <- estimateSizeFactors(data) > sizeFactors(data) WE1_counts WE2_counts M1_counts  M2_counts  M3_counts 0.8225420 1.0636676  1.1830475  1.1250534  0.8588237 code

(Note: if you've already used [|EDAseq] on your data, run the following commands on your data to transform it to a CountDataSet object: code library("DESeq") data <- as(normalized_data, "CountDataSet") code After this point, proceed with the rest of the tutorial.)

Finally, we estimate the per-gene variance as a function of the mean: code > data <- estimateVarianceFunctions(data) > results <- nbinomTest(data, "M", "WE") code

(Note: Depending on the version you install, the name of the function differs from estimateDispersions provided in the vignette.)

If you don't have any replicates, there's an option that allows you to treat all conditions as replicates in order to estimate the per-gene variance. Please note that not having any replicates greatly reduces DESeq's power to detect differential expression:

code > data2 <- data[, c("WE1_counts","M1_counts")] > data2 <- estimateSizeFactors(data2) > data2 <- estimateVarianceFunctions(data2, method="blind") > results2 <- nbinomTest(data2, "M", "WE") code

Results can be saved as a file: code > write.table(results, +    file="DESeq_results.txt", +     sep="\t", +     row.names=rownames(results), +     col.names=colnames(results), +     quote=F) code

Understanding the Output
What the results look like: code > head(results) id baseMean baseMeanA baseMeanB foldChange log2FoldChange         pval 1 YAL008W 706.7053  898.2080  419.4512  0.4669867    -1.09854657 6.641443e-04 2 YBR255W 1240.2100 1220.3008 1270.0738 1.0407874     0.05767547 7.435133e-01 3 YGR131W 146.1715  151.8459  137.6600  0.9065769    -0.14149875 8.722571e-01 4 YNL003C 2905.0929 3138.7004 2554.6817 0.8139298    -0.29702379 2.657303e-01 5 YBR135W 1110.7109 736.2681 1672.3752  2.2714216     1.18359548 2.550313e-07 6 YBR160W 3073.1755 3345.0702 2665.3334 0.7967945    -0.32772048 1.890174e-01 padj   resVarA      resVarB 1 6.044244e-03 2.18947587 1.0247063951 2 8.925745e-01 0.14417477 0.0006701733 3 9.528324e-01 1.30581931 0.1901169941 4 5.462501e-01 1.05583592 0.6348775975 5 4.679414e-06 0.20112792 2.8304917137 6 4.485319e-01 0.08728659 0.0310454492

code
 * baseMean - mean estimated from both conditions (with normalization by size factors)
 * baseMeanA - mean estimated from the first condition specified as an argument to nbinomTest (in this case, "M")
 * baseMeanB - mean estimated from the second condition (in this case, "WE")
 * foldChange - ratio of the estimated means
 * log2FoldChange - log2 of the ratio
 * pval - uncorrected p-value from the negative binomial test
 * padj - p-value adjusted for multiple testing using [|Benjamini-Hochberg] to estimate the false discovery rate
 * resVarA - dispersion estimated from the first condition
 * resVarB - dispersion estimated from the second condition

Results from the case with no replicates: code > head(results2) id baseMean baseMeanA baseMeanB foldChange log2FoldChange      pval 1 YAL008W 658.7190  760.1870   557.251  0.7330447     -0.4480269 0.6968153 2 YBR255W 1218.9621 1191.5994 1246.325  1.0459262      0.0647810 0.8608814 3 YGR131W 152.8856  191.9243   113.847  0.5931871     -0.7534410 0.8802727 4 YNL003C 2738.6953 3320.2898 2157.101  0.6496725     -0.6222155 0.3482909 5 YBR135W 1046.0431 781.0483  1311.038  1.6785619      0.7472258 0.2773562 6 YBR160W 2919.5364 3296.0907 2542.982  0.7715146     -0.3742345 0.5954106 padj resVarA resVarB 1   1      NA      NA 2    1      NA      NA 3    1      NA      NA 4    1      NA      NA 5    1      NA      NA 6    1      NA      NA

code Note that resVarA and resVarB are NA because the variance was calculated across conditions rather than within each condition.