Functional Enrichment

Are there any biological functions associated with your differentially expressed gene sets?

Online tools
There are MANY, here's an example...
  • Neurospora crassa FUNCAT (functional categories)
http://mips.helmholtz-muenchen.de/cgi-bin/proj/funcatDB/search_advanced.pl?gene=2
  • Set of functionally related genes:
NC_sample_genes.jpg
  • Input sample gene set:
NC_input.jpg
  • Output: 43 functional terms enrichedNc_output.jpg


Gene Ontology

Gene Ontology (GO) Project http://www.geneontology.org/
"Major bioinformatics initiative with the aim of standardizing the representation of gene and gene product attributes across species and databases. The project provides a controlled vocabulary of terms for describing gene product characteristics and gene product annotation data."
  • Sets of functional terms separated by broad category: Cellular Component, Molecular Function, or Biological Process
  • GO Ontology File (all terms and descriptions): OBO v1.2 available http://www.geneontology.org/GO.downloads.ontology.shtml
  • Example:
    [Term]
    id: GO:0000009
    name: alpha-1,6-mannosyltransferase activity
    namespace: molecular_function
    def: "Catalysis of the transfer of a mannose residue from GDP-mannose to an oligosaccharide, forming an alpha-1,6-linkage." [GOC:mcc, PMID:2644248]
    xref: EC:2.4.1.-
    xref: Reactome:REACT_22295 "Addition of a third mannose to the N-glycan precursor by Alg2, Saccharomyces cerevisiae"
    xref: Reactome:REACT_22383 "Addition of a third mannose to the N-glycan precursor by ALG2, Homo sapiens"
    is_a: GO:0000030 ! mannosyltransferase activity


Species with defined GO terms
http://www.geneontology.org/GO.annotation.species_db.shtml

Example GO term file (Candida)

Transferring GO terms
  • Reciprocal best hits (BLAST) for most closely-related species with associated GO terms
  • Transfer GO terms from RBH


Pfam Domains

http://pfam.sanger.ac.uk/

Alternatively, functional enrichment can be done using functional domain predictions...

HMMER
http://hmmer.janelia.org/
  • hmmscan program from HMMER3 package allows you to "search protein sequences against collections of profiles, e.g. Pfam".
  • Output is long and detailed... use the python script included here:

    • If the Pfam hmm isn't accessible, you will need to download the Pfam alignment files here:
      ftp://ftp.sanger.ac.uk/pub/databases/Pfam/current_release/Pfam-A.seed.gz
      And then run the following to decompress and create an hmm database using the program hmmbuild (HMMER3 package). These files are big, delete or compress when finished running hmmscan.
      $ gunzip Pfam-A.seed.gz
      $ hmmbuild Pfam-A.hmm Pfam-A.seed
      $ ls Pfam-A.* # this gives you:
      Pfam-A.hmm Pfam-A.hmm.h3f Pfam-A.hmm.h3i Pfam-A.hmm.h3m Pfam-A.hmm.h3p Pfam-A.seed

    • The wrapper script run_hmmscan.py will run hmmscan for you, but it's not the most efficient and it takes quite a while for a full genome set of proteins (10,000 genes = appx. 1.5hr). This test set only has 8 sequences:

$ python test.fa test_pfam # this gives you 5 files (the tmp files can be deleted):
    • $ ls ../get_Pfam_predictions
    • run_hmmscan.py test.fa test_pfam.assoc test_pfam.desc test_pfam.pop tmp.fa tmp.out
    • $ head test_pfam.*
    • > test_pfam.assoc <
    • Bcer_06612-RA Ldh_2;
    • Bcer_06601-RA
    • Bcer_06600-RA
    • Bcer_06610-RA
    • Bcer_06603-RA Hus1;
    • Bcer_06606-RA Hydrolase;E1-E2_ATPase;
    • Bcer_06622-RA MMtag;
    • Bcer_06616-RA APH;

    • > test_pfam.desc <
    • Ldh_2 Malate/L-lactate dehydrogenase
    • Hus1 Hus1-like protein
    • E1-E2_ATPase E1-E2 ATPase
    • APH Phosphotransferase enzyme family
    • Hydrolase haloacid dehalogenase-like hydrolase
    • MMtag Kinase phosphorylation protein

    • > test_pfam.pop <
    • Bcer_06612-RA
    • Bcer_06601-RA
    • Bcer_06600-RA
    • Bcer_06610-RA
    • Bcer_06603-RA
    • Bcer_06606-RA
    • Bcer_06622-RA
    • Bcer_06616-RA

    • The output files from run_hmmscan.py (association, description and population) should be correctly formatted for running the following functional enrichment program...


GeneMerge: Term enrichment test

http://genemerge.cbcb.umd.edu/
Castillo-Davis, C.I. and D.L. Hartl 2003. Bioinformatics 19(7):891-892 (If you use output from GeneMerge, please cite this paper)
  • Hypergeometric distribution: probability of successes in n draws without replacement
  • Here is the GeneMerge code
    • $ perl GeneMerge1.2.pl _.assoc _.desc ___.pop list_of_interest.txt output.out
    • Those first 3 files (assoc, desc, pop) are the output from above
    • The list of interest is simply your list of genes of interest (1 gene ID per line, obviously must match the gene ID's in the assoc and pop files)
    • And... it writes to the output file
  • Here is a very old script I wrote to make the output file slightly more user-friendly
    • $ python sort_geneMerge_output.py output.out -(flag)
    • There are 3 flags in the code (use just 1 at a time...)-s (basic summary statistics will print)
      -u (GO terms with uncorrected p-vals > 0.05 will print)
      -c (GO terms with bonferroni corrected p-vals > 0.05 will print
  • The programs and an example are in this file:
    • To download, unpack the directory and run the example:
      $ wget http://cgrlucb.wikispaces.com/file/view/Cocci_example.tar.gz
      $ tar -zxvf Cocci_example.tar.gz
      $ cd Cocci_example
      $ ls
      Cimmitis_pfam.assoc Cimmitis_pfam.pop geneMerge_use.txt upH_allpop.txt
      Cimmitis_pfam.desc GeneMerge1.2.pl sort_geneMerge_output.py upS_allpop.txt
      $ perl GeneMerge1.2.pl Cimmitis_pfam.assoc Cimmitis_pfam.desc Cimmitis_pfam.pop upH_allpop.txt upH.out
      $ python sort_geneMerge_output.py upH.out -c
      Pfam_ID Pop_freq Pop_frac Study_frac raw_Pval corr_Pval Description Genes_with_term
      Kinesin 0.00112485939257593 11/9779 8/1299 1.07985654931451e-05 0.00555046266347657 Kinesin motor domain CIMG_01713 CIMG_04554 CIMG_05502 CIMG_07460 CIMG_09273 CIMG_11124 CIMG_12417 CIMG_13268