Functional+Enrichment+Analysis

Spring 2012 Workshop Yulia Mostovoy

Requirement:
 * Rpy2 python library

Presentation: [|CGRLworkshop_functionalenrichment.pdf]

=Evaluating enrichment=

You have a list of genes that were differentially expressed (DE) in your experiment - so what can you do with them? You want to characterize them, see what kinds of genes belong to the group, and determine whether they have anything in common. A common approach is to look for significant overlap with various functional categories: do the genes tend to have similar functions? Do they tend to belong to shared pathways, or are they influenced by the same regulatory factors?

We can determine whether your genes have significant overlap with a given category by using Fisher’s exact test. First, we set up a 2x2 contingency table:
 * ~  ||~ Genes in category ||~ Genes not in category ||~ total ||
 * ~ DE genes ||= **k** ||= m - k ||= **m** ||
 * ~ Not DE genes ||= n - k ||= N - m - n + k ||= N - m ||
 * ~ total ||= **n** ||= N - n ||= **N** ||

You can get values for the contingency table using a few UNIX commands, given that you have the following files, each containing one gene per line: code $ sort DE_genelist -o DE_genelist $ sort category_genelist -o category_genelist $ sort all_valid_genelist -o all_valid_genelist
 * List of your DE genes ("DE_genelist" in the example below)
 * List of genes in your category ("category_genelist")
 * List of all genes with valid data in your study ("all_valid_genelist")

$ join all_valid_genelist DE_genelist | wc -l 211 (m)

$ join all_valid_genelist category_genelist | wc -l 39 (n)

$ wc -l all_valid_genelist 6152 (N)

$ join DE_genelist category_genelist | wc -l 14 (k) code Plug those values in to get the final table:
 * ~  ||~ Genes in category ||~ Genes not in category ||
 * ~ Differentially expressed genes ||= 14 ||= 197 ||
 * ~ Not differentially expressed genes ||= 25 ||= 5916 ||

(In this example, the DE genes were upregulated in yeast during heat shock (Gasch et al. 2000, data from here), while the category represents the targets of transcription factor HSF1, which is known to regulate the transcriptional response to heat shock.)

Once you have your table, Fisher's exact test determines the probability of getting that contingency table under the null hypothesis that the DE genes are just as likely to belong to the category as any other genes in the genome. A low p-value using the one-tailed version of the test indicates that the DE genes are statistically overrepresented within the category.

We can perform this test using the following R code: code > table <- matrix(c(14,25,197,5916), nrow=2, ncol=2) > fisher.result <- fisher.test(table, alternative=‘g’) > p.value <- fisher.result$p.value > p.value [1] 1.43551e-11 code We conclude that the DE genes are significantly overrepresented among the targets of HSF1.

If you are only interesting in a single category, that's all you need to do. When you have many categories, you would run the above R code for each category, generating a list of p-values (a script that does this is described below). Then it's essential to correct your p-values to account for multiple testing. Here's how you would do that in R: code > pvals.example <- c(0.05, 0.00001, 0.4, 0.2, 0.002) > pvals.adjusted <- p.adjust(pvals.example, method='bonferroni') > pvals.adjusted [1] 0.25000 0.00005 1.00000 1.00000 0.01000 code

=Sources of functional categories=
 * ==Gene ontology (GO)==
 * GO terms describe the properties of gene products in a structured and standardized way. They are hierarchical, with broad terms branching into increasingly specific terms. They are standardized such that the same set of terms can be used for any species. Terms are split into the following three broad types:
 * Biological process
 * Molecular function
 * Cellular component
 * GO term descriptions and annotations for many species can be downloaded from [|geneontology.org]
 * ==Biochemical/metabolic pathways==
 * Examples include:
 * [|KEGG pathways] - available for many species
 * [|Pathways for Saccharomyces cerevisiae from SGD]
 * ==Transcription factor targets==
 * May involve motif data and/or ChIP binding data, with availability varying by species.
 * Example: Transcription factor target data in //S. cerevisiae// from MacIsaac et al. 2006, [|available here].
 * ==Many others!==
 * Other kinds of classifications that may be useful include protein complexes, co-expressed genes, microRNA targets, and genes with disease phenotypes.

=Tutorial: performing an enrichment analysis=

Included in this tutorial is a script that will perform enrichment analysis on any set of genes and functional categories. The script takes three files as input:
 * 1) A list of the genes in your subset of interest (e.g. differentially expressed genes)
 * Format: one gene name per line
 * 1) A list of all the genes for which there is valid data in your study
 * Format: one gene name per line
 * This list should include the genes in your subset of interest
 * 1) Categories to be tested
 * One group per line, in the following format:
 * Group name [tab] list of genes (each separated by a tab)
 * Optionally, add group descriptions like so:
 * Group name | group description [tab] list of genes (separated by tabs)

The file [|enrichment_analysis.tar.gz] includes the script, as well as sample files and several category files for //S. cerevisiae//.

Sample run: code $ python functional_enrichment.py upregulated_heat_shock.txt all_valid_genes.txt TF_target_file_S_cerevisiae.txt > output.txt Genes in sample: 211 Genes in population: 6152 Genes in sample rejected for not being present in population: 0

$ head output.txt Group  Frequency       pval_raw        pval_adjusted   Description HSF1   14/39 (35.90%)  1.435510e-11    1.435510e-09    Trimeric heat shock transcription factor, activates multiple genes in response to stresses that include hyperthermia; recognizes variable heat shock elements (HSEs) consisting of inverted NGAAN repeats; posttranslationally regulated MSN2   17/78 (21.79%)  6.524301e-10    6.524301e-08    Transcriptional activator related to Msn4p; activated in stress conditions, which results in translocation from the cytoplasm to the nucleus; binds DNA at stress response elements of responsive genes, inducing gene expression SKN7   18/129 (13.95%) 3.178556e-07    3.178556e-05    Nuclear response regulator and transcription factor, part of a branched two-component signaling system; required for optimal induction of heat-shock genes in response to oxidative stress; involved in osmoregulation MSN4   12/68 (17.65%)  2.594244e-06    2.594244e-04    Transcriptional activator related to Msn2p; activated in stress conditions, which results in translocation from the cytoplasm to the nucleus; binds DNA at stress response elements of responsive genes, inducing gene expression NRG1   14/101 (13.86%) 7.410118e-06    7.410118e-04    Transcriptional repressor that recruits the Cyc8p-Tup1p complex to promoters; mediates glucose repression and negatively regulates a variety of processes including filamentous growth and alkaline pH response ROX1   8/42 (19.05%)   7.235930e-05    7.235930e-03    Heme-dependent repressor of hypoxic genes; contains an HMG domain that is responsible for DNA bending activity RTG3   8/45 (17.78%)   1.209457e-04    1.209457e-02    Basic helix-loop-helix-leucine zipper (bHLH/Zip) transcription factor that forms a complex with another bHLH/Zip protein, Rtg1p, to activate the retrograde (RTG) and TOR pathways SUT1   9/64 (14.06%)   2.957879e-04    2.957879e-02    Transcription factor of the Zn[II]2Cys6 family involved in sterol uptake; involved in induction of hypoxic gene expression MOT3   5/21 (23.81%)   5.876120e-04    5.876120e-02    Nuclear transcription factor with two Cys2-His2 zinc fingers; involved in repression of a subset of hypoxic genes by Rox1p, repression of several DAN/TIR genes during aerobic growth, and repression of ergosterol biosynthetic genes code

Also included are several scripts that convert files of different formats into the category format readable by this script.
 * [|format_GO_files.tar.gz] - can be used for any gene_association file from [|geneontology.org]
 * [|format_TF_target_file_S_cerevisiae.tar.gz] - formats ORFs_by_factor files from [|MacIsaac et al. 2006]

=References=

Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., Davis, A. P., Dolinski, K., Dwight, S. S., Eppig, J. T., et al. 2000. Gene Ontology: tool for the unification of biology. //Nat. Genet.// **25:** 25-29.

Gasch, A. P., Spellman, P. T., Kao, C. M., Carmel-Harel Orna, Eisen, M. B., Storz, G., Botstein, D., and Brown, P. O. 2000. Genomic expression programs in the response of yeast cells to environmental changes. //Mol. Biol. Cell// **11:** 4241.

MacIsaac, K., Wang, T., Gordon, D. B., Gifford, D., Stormo, G., and Fraenkel, E. 2006. An improved map of conserved regulatory sites for //Saccharomyces cerevisiae//. //BMC Bioinformatics// **7:** 113.