Section 1: Why Isoform Deconvolution is Necessary and How it is Done
When dealing with organisms with alternative splicing, there will often be ambiguous reads that could have been generated from multiple transcripts that share common sequence. For example, in the figure below, the purple paired-end reads are ambiguous meaning they could have come from the red or blue transcript.
It is important to determine from which isoform a fragment count originates in order to properly calculate the abundance for use in differential expression, otherwise changes in abundance of individual isoforms could "even out" at the gene level and vice-versa. Count-based methods that simply look at regions of the genome can then miss real differences in gene and isoform-level expression.
The most widely used tools for determining isoform-level abundances employ the Expectation-Maximization (EM) algorithm to determine the most likely (in terms of probability) origin of each fragment. While the full details of this method are outside the scope of this tutorial, Figure 4 of this paper is a simple cartoon explanation.
Section 2: Outline of Analysis for Organisms with Alternative Splicing Using "Tuxedo Tools"
Your analysis pipeline is dependent on whether or not your organism has a reference genome and/or transcriptome.
If your organism has a reference genome:
Map reads to the genome allowing for splice junctions with TopHat
If interested in novel isoforms, assemble reads into transfrags with Cufflinks.
Calculate isoform abundances with Cufflinks.
If interested in differential expression, compare samples with Cuffdiff.
If your organism has no reference genome:
If no transcriptome assembly exists, assemble reads into contigs/transfrags using Trinity, TransABySS, or Oases.
Map reads to the contigs using Bowtie.
Probabilistically assign reads using eXpress.
Use a count-based differential analysis tool with the (rounded) eXpress-estimated counts.
Section 3: Simple Analysis with Reference Genome and Single Sample
The first step is to map the reads using TopHat. Since we have already done this in a previous section, we will work with the file ~/exp1/tophat_out/accepted_hits.bam.
We will now use Cufflinks to assemble the reads into transfrags using the command below:
This command will produce thre new files: transcripts.gtf, isoforms.fpkm_tracking, and genes.fpkm_tracking in the cuff_out folder. The transcripts.gtf file contains the structure of the assembled transfrags in GTF format. The other two files contain the expression estimates for the isoforms and genes, respectively.
We will first take a look at the GTF file by viewing it directly with the command
$ less cuff_out/transcripts.gtf
Even if you are familiar with the GTF format, it is nearly impossible to interpret the transcript assemblies by looking at it in this way. Instead, we can view the GTF in a genome browser called IGV, which you can launch directly from the website by clicking here.
Once opened, you should choose the "Human hg19" reference and load the transcripts.gtf file from the File menu. You can now view what Cufflinks has assembled in comparison to the known reference.
You now need to copy the assembled annotation from the server to your local machine
You can also view the Cufflinks abundance estimates by opening the fpkm_tracking files. A simple way to view the file is using the command
$ less cuff_out/isoforms.fpkm_tracking
The FPKM column shows the estimated abundance in Fragments Per Kilobase per Million sequenced.
Note that if you were not interested in assembling novel isoforms, you could have provided a reference annotation in GFF or GTF format using the -G option to just calculate abundance estimates.
Section 4: Improved Results with Bias Correction
There are several common biases in RNA-Seq experiments stemming primarily from the library preparation. One involves sequence-specificity near the ends of reads, and another involves uneven coverage based on relative position along the transcript (i.e., 5' or 3' bias). Cufflinks optionally corrects for both of these biases when provided with the genome reference. For example, we can repeat the above experiment with bias correction using this command:
It is highly recommended that you take advantage of bias correction, as it has been shown to greatly improve results in multiple studies (here and here).
Section 5: Analysis with Reference Genome and Multiple Samples
It is likely that you will have several samples (and multiple replicates per sample) that you will wish to compare to calculate differential expression. The Cufflinks package includes a program called Cuffdiff that will handle this situation for you!
If you wish to find novel isoforms, you should run Cufflinks on each sample independently as shown above. You will then use the Cuffmerge utility to combine the assemblies (along with an optional reference annotation) into a single GTF. See the Cufflinks Tutorial for more details on how to do this.
You can now run Cuffdiff with the following command:
This command will estimate the bias-corrected abundances of the individual samples along with a between-replicate variance in order to find cases of significant differential expression. The method is similar to the count-based statistical methods described in an earlier section, but it also takes uncertainties in isoform deconvolution. Furthermore, Cuffdiff will determine differential splicing, coding (CDS), and promoter use using the Jenson-Shannon divergence.
The output to this command includes fpkm_tracking files similar to those produced by Cufflinks as well as the results of tests for differential expression at both the gene and isoform level, differential splicing, differential coding (CDS), and differential promoter use. Each test file contains a p-value, q-value, and FDR-based significance decision (YES or NO). Full details on the output can be found in the Cufflinks Manual.
Section 6: Example Analysis without a Reference Genome
Cufflinks is only usable when a reference genome exists. If your organism of interest does not have a reference genome, we recommend you use eXpress.
For this tutorial, we will assume that you do know the transcripts sequences. However, if you also lack this data, you can assemble transcript sequences from your RNA-Seq reads de novo using a tool such as Trinity, Trans-ABySS, or Oases.
We will base our example analysis on the data sample dataset included with eXpress, which is in the sample_data folder of the main express directory. We will jump to the eXpress Tutorial page for this example.
Note that eXpress uses a nearly identical model to Cufflinks, including bias correction. There is no differential expression method built to handle this type of output as of yet, but you could use rounded values for the estimated counts (post_counts_mean) in the results.xprs file as input for one of the count-based methods described in a previous section.
Section 7: References and Further Reading
References:
Pachter, L (2011). Models for transcript quantification from RNA-Seq.arXiv:1104.3889v2
Roberts A, Pimentel H, Trapnell C, Pachter L. Identification of novel transcripts in annotated genomes using RNA-Seq. Bioinformatics. doi:10.1093/bioinformatics/btr355
Roberts A, Trapnell C, Donaghey J, Rinn JL, Pachter L (2011). Improving RNA-Seq expression estimates by correcting for fragment bias.Genome Biology.doi:10.1186/gb-2011-12-3-r22
Trapnell C, Williams BA, Pertea G, Mortazavi AM, Kwan G, van Baren MJ, Salzberg SL, Wold B, Pachter L (2010). Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotechnology. doi:10.1038/nbt.162
Isoform Deconvolution and Unannotated Species
Section 1: Why Isoform Deconvolution is Necessary and How it is Done
When dealing with organisms with alternative splicing, there will often be ambiguous reads that could have been generated from multiple transcripts that share common sequence. For example, in the figure below, the purple paired-end reads are ambiguous meaning they could have come from the red or blue transcript.
It is important to determine from which isoform a fragment count originates in order to properly calculate the abundance for use in differential expression, otherwise changes in abundance of individual isoforms could "even out" at the gene level and vice-versa. Count-based methods that simply look at regions of the genome can then miss real differences in gene and isoform-level expression.
The most widely used tools for determining isoform-level abundances employ the Expectation-Maximization (EM) algorithm to determine the most likely (in terms of probability) origin of each fragment. While the full details of this method are outside the scope of this tutorial, Figure 4 of this paper is a simple cartoon explanation.
Section 2: Outline of Analysis for Organisms with Alternative Splicing Using "Tuxedo Tools"
Your analysis pipeline is dependent on whether or not your organism has a reference genome and/or transcriptome.
If your organism has a reference genome:
If your organism has no reference genome:
Section 3: Simple Analysis with Reference Genome and Single Sample
The first step is to map the reads using TopHat. Since we have already done this in a previous section, we will work with the file ~/exp1/tophat_out/accepted_hits.bam.
We will now use Cufflinks to assemble the reads into transfrags using the command below:
This command will produce thre new files: transcripts.gtf, isoforms.fpkm_tracking, and genes.fpkm_tracking in the cuff_out folder. The transcripts.gtf file contains the structure of the assembled transfrags in GTF format. The other two files contain the expression estimates for the isoforms and genes, respectively.
We will first take a look at the GTF file by viewing it directly with the command
Even if you are familiar with the GTF format, it is nearly impossible to interpret the transcript assemblies by looking at it in this way. Instead, we can view the GTF in a genome browser called IGV, which you can launch directly from the website by clicking here.
Once opened, you should choose the "Human hg19" reference and load the transcripts.gtf file from the File menu. You can now view what Cufflinks has assembled in comparison to the known reference.
You now need to copy the assembled annotation from the server to your local machine
You can also view the Cufflinks abundance estimates by opening the fpkm_tracking files. A simple way to view the file is using the command
The FPKM column shows the estimated abundance in Fragments Per Kilobase per Million sequenced.
Note that if you were not interested in assembling novel isoforms, you could have provided a reference annotation in GFF or GTF format using the -G option to just calculate abundance estimates.
Section 4: Improved Results with Bias Correction
There are several common biases in RNA-Seq experiments stemming primarily from the library preparation. One involves sequence-specificity near the ends of reads, and another involves uneven coverage based on relative position along the transcript (i.e., 5' or 3' bias). Cufflinks optionally corrects for both of these biases when provided with the genome reference. For example, we can repeat the above experiment with bias correction using this command:
It is highly recommended that you take advantage of bias correction, as it has been shown to greatly improve results in multiple studies (here and here).
Section 5: Analysis with Reference Genome and Multiple Samples
It is likely that you will have several samples (and multiple replicates per sample) that you will wish to compare to calculate differential expression. The Cufflinks package includes a program called Cuffdiff that will handle this situation for you!
If you wish to find novel isoforms, you should run Cufflinks on each sample independently as shown above. You will then use the Cuffmerge utility to combine the assemblies (along with an optional reference annotation) into a single GTF. See the Cufflinks Tutorial for more details on how to do this.
You can now run Cuffdiff with the following command:
This command will estimate the bias-corrected abundances of the individual samples along with a between-replicate variance in order to find cases of significant differential expression. The method is similar to the count-based statistical methods described in an earlier section, but it also takes uncertainties in isoform deconvolution. Furthermore, Cuffdiff will determine differential splicing, coding (CDS), and promoter use using the Jenson-Shannon divergence.
The output to this command includes fpkm_tracking files similar to those produced by Cufflinks as well as the results of tests for differential expression at both the gene and isoform level, differential splicing, differential coding (CDS), and differential promoter use. Each test file contains a p-value, q-value, and FDR-based significance decision (YES or NO). Full details on the output can be found in the Cufflinks Manual.
Section 6: Example Analysis without a Reference Genome
Cufflinks is only usable when a reference genome exists. If your organism of interest does not have a reference genome, we recommend you use eXpress.
For this tutorial, we will assume that you do know the transcripts sequences. However, if you also lack this data, you can assemble transcript sequences from your RNA-Seq reads de novo using a tool such as Trinity, Trans-ABySS, or Oases.
We will base our example analysis on the data sample dataset included with eXpress, which is in the sample_data folder of the main express directory. We will jump to the eXpress Tutorial page for this example.
Note that eXpress uses a nearly identical model to Cufflinks, including bias correction. There is no differential expression method built to handle this type of output as of yet, but you could use rounded values for the estimated counts (post_counts_mean) in the results.xprs file as input for one of the count-based methods described in a previous section.
Section 7: References and Further Reading
References:
Tools of Interest: