Section 1: Simple Analysis with Reference Genome and Single Sample

The first step is to map the reads using TopHat. Since we have already done this in a previous section, we will work with the file ~/tophat_out/accepted_hits.bam.

We will now use Cufflinks to assemble the reads into transfrags using the command below:
$ cufflinks -o cuff_out tophat_out/accepted_hits.bam
This command will produce three new files: transcripts.gtf, isoforms.fpkm_tracking, and genes.fpkm_tracking in the cuff_out folder. The transcripts.gtf file contains the structure of the assembled transfrags in GTF format. The other two files contain the expression estimates for the isoforms and genes, respectively.

We will first take a look at the GTF file by viewing it directly with the command
$ less cuff_out/transcripts.gtf
You can also view the Cufflinks abundance estimates by opening the fpkm_tracking files. A simple way to view the file is using the command
$ less cuff_out/isoforms.fpkm_tracking
The FPKM column shows the estimated abundance in Fragments Per Kilobase per Million sequenced.

Note that if you were not interested in assembling novel isoforms, you could have provided a reference annotation in GFF or GTF format using the -G option to just calculate abundance estimates.
$ cufflinks -o cuff_out -G ~/annotations/hg19-refGene.gtf tophat_out/accepted_hits.bam

Again, it is helpful to write this as a script.
#!/bin/bash
 
# Change these variables to match your experiment
EXP_NAME=exp1
DATA_DIR=~/exp1
OUT_DIR=$EXP_NAME
BAM_FILE=$EXP_NAME/accepted_hits.bam
 
CMD="cufflinks -o $OUT_DIR $BAM_FILE"
 
echo "Running $CMD"
# Show output on the screen AND save in $EXP_NAME.cuff.log
$CMD 2>&1 | tee $EXP_NAME_cuff.log

Section 2: Cuffcompare


If you have run Cufflinks allowing for novel isoforms, Cuffcompare allows you to compared your assembled transcripts to a reference annotation.
$ cuffcompare -r ~/annotations/hg19-refGene.gtf ~/cuff_out/transcripts.gtf
To look out the output, use the less command:
$ less cuffcmp.combined.gtf
chr1    Cufflinks       exon    1362934 1363327 .       +       .       gene_id "XLOC_000006"; transcript_id "TCONS_00000008"; exon_number "2"; gene_name "NM_001146685"; oId "CUFF.35.1"; nearest_ref "NM_001146685"; class_code "="; tss_id "TSS7";
chr1    Cufflinks       exon    1370803 1371201 .       +       .       gene_id "XLOC_000007"; transcript_id "TCONS_00000009"; exon_number "1"; gene_name "NM_022834"; oId "CUFF.37.1"; nearest_ref "NM_022834"; class_code "j"; tss_id "TSS8";
chr1    Cufflinks       exon    1372307 1372864 .       +       .       gene_id "XLOC_000007"; transcript_id "TCONS_00000009"; exon_number "2"; gene_name "NM_022834"; oId "CUFF.37.1"; nearest_ref "NM_022834"; class_code "j"; tss_id "TSS8";
In this output from cuffcompare, we can see from the class code label that TCONS_00000008 has class code "=" which means it matches a known transcripts. TCONS_00000009 has class code "j", which denotes a novel transcript. More information about Cuffcompare can be found in Cufflinks Tutorial.

Section 3: Analysis with Reference Genome and Multiple Samples


It is likely that you will have several samples (and multiple replicates per sample) that you will wish to compare to calculate differential expression. The Cufflinks package includes a program called Cuffdiff that will handle this situation for you.

If you wish to find novel isoforms, you should run Cufflinks on each sample independently as shown above. You will then use the Cuffmerge utility to combine the assemblies (along with an optional reference annotation) into a single GTF. To use Cuffmerge, you will need a text file (here called assembly.txt) with the names of the transcript.gtf file produced from the Cufflinks runs you want to run through Cuffdiff.
$ cuffmerge -g  ~/annotations/hg19-refGene.gtf assembly.txt

You can now run Cuffdiff with the following command where merged.gtf is the output from Cuffmerge:
$ cuffdiff -o diff_out -b genome.fa -G merged.gtf sample1rep1.bam,sample1rep2.bam sample2rep1.bam,sample2rep2.bam
This command will estimate the bias-corrected abundances of the individual samples along with a between-replicate variance in order to find cases of significant differential expression. The method is similar to the count-based statistical methods described in an earlier section, but it also takes uncertainties in isoform deconvolution. Furthermore, Cuffdiff will determine differential splicing, coding (CDS), and promoter use using the Jenson-Shannon divergence.

The output to this command includes fpkm_tracking files similar to those produced by Cufflinks as well as the results of tests for differential expression at both the gene and isoform level, differential splicing, differential coding (CDS), and differential promoter use. Each test file contains a p-value, q-value, and FDR-based significance decision (YES or NO). Full details on the output can be found in the Cufflinks Manual.


Section 4: References and Further Reading


References:
  • Pachter, L (2011). Models for transcript quantification from RNA-Seq. arXiv:1104.3889v2
  • Roberts A, Pimentel H, Trapnell C, Pachter L. Identification of novel transcripts in annotated genomes using RNA-Seq. Bioinformatics. doi:10.1093/bioinformatics/btr355
  • Roberts A, Trapnell C, Donaghey J, Rinn JL, Pachter L (2011). Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biology. doi:10.1186/gb-2011-12-3-r22
  • Trapnell C, Williams BA, Pertea G, Mortazavi AM, Kwan G, van Baren MJ, Salzberg SL, Wold B, Pachter L (2010). Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotechnology. doi:10.1038/nbt.162

  • Cufflinks Manual and Tutorial
  • eXpress Manual and Tutorial

Tools of Interest:
  • Cufflinks (reference-based transcriptome assembly, abundance estimation, and differential analysis)
  • Scripture (reference-based transcriptome assembly)
  • Trinity (de novo transcript assembly)
  • Trans-ABySS(de novo transcript assembly)
  • Oases (de novo transcript assembly)
  • eXpress (transcript-level abundance estimation)
  • RSEM (transcript-level abundance estimation)