Isoform+Estimation+Spring+2013

__Section 1:__ Simple Analysis with Reference Genome and Single Sample

The first step is to map the reads using TopHat. Since we have already done this in a previous section, we will work with the file ~/tophat_out/accepted_hits.bam.

We will now use Cufflinks to assemble the reads into transfrags using the command below: code $ cufflinks -o cuff_out tophat_out/accepted_hits.bam code This command will produce three new files: transcripts.gtf, isoforms.fpkm_tracking, and genes.fpkm_tracking in the cuff_out folder. The transcripts.gtf file contains the structure of the assembled transfrags in GTF format. The other two files contain the expression estimates for the isoforms and genes, respectively.

We will first take a look at the GTF file by viewing it directly with the command code $ less cuff_out/transcripts.gtf code You can also view the Cufflinks abundance estimates by opening the fpkm_tracking files. A simple way to view the file is using the command code $ less cuff_out/isoforms.fpkm_tracking code The FPKM column shows the estimated abundance in **F**ragments **P**er **K**ilobase per **M**illion sequenced.

Note that if you were not interested in assembling novel isoforms, you could have provided a reference annotation in GFF or GTF format using the -G option to just calculate abundance estimates. code $ cufflinks -o cuff_out -G ~/annotations/hg19-refGene.gtf tophat_out/accepted_hits.bam code

Again, it is helpful to write this as a script. code format="bash"
 * 1) !/bin/bash

EXP_NAME=exp1 DATA_DIR=~/exp1 OUT_DIR=$EXP_NAME BAM_FILE=$EXP_NAME/accepted_hits.bam
 * 1) Change these variables to match your experiment

CMD="cufflinks -o $OUT_DIR $BAM_FILE"

echo "Running $CMD" $CMD 2>&1 | tee $EXP_NAME_cuff.log code
 * 1) Show output on the screen AND save in $EXP_NAME.cuff.log

__Section 2:__ Cuffcompare
If you have run Cufflinks allowing for novel isoforms, Cuffcompare allows you to compared your assembled transcripts to a reference annotation. code $ cuffcompare -r ~/annotations/hg19-refGene.gtf ~/cuff_out/transcripts.gtf code To look out the output, use the less command: code $ less cuffcmp.combined.gtf chr1   Cufflinks       exon    1362934 1363327. +      .       gene_id "XLOC_000006"; transcript_id "TCONS_00000008"; exon_number "2"; gene_name "NM_001146685"; oId "CUFF.35.1"; nearest_ref "NM_001146685"; class_code "="; tss_id "TSS7"; chr1   Cufflinks       exon    1370803 1371201. +      .       gene_id "XLOC_000007"; transcript_id "TCONS_00000009"; exon_number "1"; gene_name "NM_022834"; oId "CUFF.37.1"; nearest_ref "NM_022834"; class_code "j"; tss_id "TSS8"; chr1   Cufflinks       exon    1372307 1372864. +      .       gene_id "XLOC_000007"; transcript_id "TCONS_00000009"; exon_number "2"; gene_name "NM_022834"; oId "CUFF.37.1"; nearest_ref "NM_022834"; class_code "j"; tss_id "TSS8"; code In this output from cuffcompare, we can see from the class code label that TCONS_00000008 has class code "=" which means it matches a known transcripts. TCONS_00000009 has class code "j", which denotes a novel transcript. More information about Cuffcompare can be found in Cufflinks Tutorial.

__Section 3:__ Analysis with Reference Genome and Multiple Samples
It is likely that you will have several samples (and multiple replicates per sample) that you will wish to compare to calculate differential expression. The Cufflinks package includes a program called Cuffdiff that will handle this situation for you.

If you wish to find novel isoforms, you should run Cufflinks on each sample independently as shown above. You will then use the Cuffmerge utility to combine the assemblies (along with an optional reference annotation) into a single GTF. To use Cuffmerge, you will need a text file (here called assembly.txt) with the names of the transcript.gtf file produced from the Cufflinks runs you want to run through Cuffdiff. code $ cuffmerge -g ~/annotations/hg19-refGene.gtf assembly.txt code

You can now run Cuffdiff with the following command where merged.gtf is the output from Cuffmerge: code $ cuffdiff -o diff_out -b genome.fa -G merged.gtf sample1rep1.bam,sample1rep2.bam sample2rep1.bam,sample2rep2.bam code This command will estimate the bias-corrected abundances of the individual samples along with a between-replicate variance in order to find cases of significant differential expression. The method is similar to the count-based statistical methods described in an earlier section, but it also takes uncertainties in isoform deconvolution. Furthermore, Cuffdiff will determine differential splicing, coding (CDS), and promoter use using the Jenson-Shannon divergence.

The output to this command includes fpkm_tracking files similar to those produced by Cufflinks as well as the results of tests for differential expression at both the gene and isoform level, differential splicing, differential coding (CDS), and differential promoter use. Each test file contains a p-value, q-value, and FDR-based significance decision (YES or NO). Full details on the output can be found in the [|Cufflinks Manual].

__Section 4:__ References and Further Reading
>
 * References:**
 * Pachter, L (2011). **Models for transcript quantification from RNA-Seq.** arXiv:1104.3889v2
 * Roberts A, Pimentel H, Trapnell C, Pachter L. **Identification of novel transcripts in annotated genomes using RNA-Seq.** Bioinformatics. doi:10.1093/bioinformatics/btr355
 * Roberts A, Trapnell C, Donaghey J, Rinn JL, Pachter L (2011). **Improving RNA-Seq expression estimates by correcting for fragment bias.** //Genome Biology.// [|doi:10.1186/gb-2011-12-3-r22]
 * Trapnell C, Williams BA, Pertea G, Mortazavi AM, Kwan G, van Baren MJ, Salzberg SL, Wold B, Pachter L (2010). **Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation.** Nature Biotechnology. [|doi:10.1038/nbt.162]
 * Cufflinks Manual and [|Tutorial]
 * eXpress Manual and Tutorial


 * Tools of Interest:**
 * Cufflinks (reference-based transcriptome assembly, abundance estimation, and differential analysis)
 * Scripture (reference-based transcriptome assembly)
 * Trinity (//de novo// transcript assembly)
 * Trans-ABySS(//de novo// transcript assembly)
 * Oases (//de novo// transcript assembly)
 * eXpress (transcript-level abundance estimation)
 * RSEM (transcript-level abundance estimation)