Section 1: Simple Analysis with Reference Genome and Single Sample
The first step is to map the reads using TopHat. Since we have already done this in a previous section, we will work with the file ~/tophat_out/accepted_hits.bam.
We will now use Cufflinks to assemble the reads into transfrags using the command below:
This command will produce three new files: transcripts.gtf, isoforms.fpkm_tracking, and genes.fpkm_tracking in the cuff_out folder. The transcripts.gtf file contains the structure of the assembled transfrags in GTF format. The other two files contain the expression estimates for the isoforms and genes, respectively.
We will first take a look at the GTF file by viewing it directly with the command
$ less cuff_out/transcripts.gtf
You can also view the Cufflinks abundance estimates by opening the fpkm_tracking files. A simple way to view the file is using the command
$ less cuff_out/isoforms.fpkm_tracking
The FPKM column shows the estimated abundance in Fragments Per Kilobase per Million sequenced.
Note that if you were not interested in assembling novel isoforms, you could have provided a reference annotation in GFF or GTF format using the -G option to just calculate abundance estimates.
#!/bin/bash# Change these variables to match your experimentEXP_NAME=exp1
DATA_DIR=~/exp1
OUT_DIR=$EXP_NAMEBAM_FILE=$EXP_NAME/accepted_hits.bam
CMD="cufflinks -o $OUT_DIR$BAM_FILE"echo"Running $CMD"# Show output on the screen AND save in $EXP_NAME.cuff.log$CMD2>&1|tee$EXP_NAME_cuff.log
Section 2: Cuffcompare
If you have run Cufflinks allowing for novel isoforms, Cuffcompare allows you to compared your assembled transcripts to a reference annotation.
In this output from cuffcompare, we can see from the class code label that TCONS_00000008 has class code "=" which means it matches a known transcripts. TCONS_00000009 has class code "j", which denotes a novel transcript. More information about Cuffcompare can be found in Cufflinks Tutorial.
Section 3: Analysis with Reference Genome and Multiple Samples
It is likely that you will have several samples (and multiple replicates per sample) that you will wish to compare to calculate differential expression. The Cufflinks package includes a program called Cuffdiff that will handle this situation for you.
If you wish to find novel isoforms, you should run Cufflinks on each sample independently as shown above. You will then use the Cuffmerge utility to combine the assemblies (along with an optional reference annotation) into a single GTF. To use Cuffmerge, you will need a text file (here called assembly.txt) with the names of the transcript.gtf file produced from the Cufflinks runs you want to run through Cuffdiff.
This command will estimate the bias-corrected abundances of the individual samples along with a between-replicate variance in order to find cases of significant differential expression. The method is similar to the count-based statistical methods described in an earlier section, but it also takes uncertainties in isoform deconvolution. Furthermore, Cuffdiff will determine differential splicing, coding (CDS), and promoter use using the Jenson-Shannon divergence.
The output to this command includes fpkm_tracking files similar to those produced by Cufflinks as well as the results of tests for differential expression at both the gene and isoform level, differential splicing, differential coding (CDS), and differential promoter use. Each test file contains a p-value, q-value, and FDR-based significance decision (YES or NO). Full details on the output can be found in the Cufflinks Manual.
Section 4: References and Further Reading
References:
Pachter, L (2011). Models for transcript quantification from RNA-Seq.arXiv:1104.3889v2
Roberts A, Pimentel H, Trapnell C, Pachter L. Identification of novel transcripts in annotated genomes using RNA-Seq. Bioinformatics. doi:10.1093/bioinformatics/btr355
Roberts A, Trapnell C, Donaghey J, Rinn JL, Pachter L (2011). Improving RNA-Seq expression estimates by correcting for fragment bias.Genome Biology.doi:10.1186/gb-2011-12-3-r22
Trapnell C, Williams BA, Pertea G, Mortazavi AM, Kwan G, van Baren MJ, Salzberg SL, Wold B, Pachter L (2010). Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotechnology. doi:10.1038/nbt.162
The first step is to map the reads using TopHat. Since we have already done this in a previous section, we will work with the file ~/tophat_out/accepted_hits.bam.
We will now use Cufflinks to assemble the reads into transfrags using the command below:
This command will produce three new files: transcripts.gtf, isoforms.fpkm_tracking, and genes.fpkm_tracking in the cuff_out folder. The transcripts.gtf file contains the structure of the assembled transfrags in GTF format. The other two files contain the expression estimates for the isoforms and genes, respectively.
We will first take a look at the GTF file by viewing it directly with the command
You can also view the Cufflinks abundance estimates by opening the fpkm_tracking files. A simple way to view the file is using the command
The FPKM column shows the estimated abundance in Fragments Per Kilobase per Million sequenced.
Note that if you were not interested in assembling novel isoforms, you could have provided a reference annotation in GFF or GTF format using the -G option to just calculate abundance estimates.
Again, it is helpful to write this as a script.
Section 2: Cuffcompare
If you have run Cufflinks allowing for novel isoforms, Cuffcompare allows you to compared your assembled transcripts to a reference annotation.
To look out the output, use the less command:
In this output from cuffcompare, we can see from the class code label that TCONS_00000008 has class code "=" which means it matches a known transcripts. TCONS_00000009 has class code "j", which denotes a novel transcript. More information about Cuffcompare can be found in Cufflinks Tutorial.
Section 3: Analysis with Reference Genome and Multiple Samples
It is likely that you will have several samples (and multiple replicates per sample) that you will wish to compare to calculate differential expression. The Cufflinks package includes a program called Cuffdiff that will handle this situation for you.
If you wish to find novel isoforms, you should run Cufflinks on each sample independently as shown above. You will then use the Cuffmerge utility to combine the assemblies (along with an optional reference annotation) into a single GTF. To use Cuffmerge, you will need a text file (here called assembly.txt) with the names of the transcript.gtf file produced from the Cufflinks runs you want to run through Cuffdiff.
You can now run Cuffdiff with the following command where merged.gtf is the output from Cuffmerge:
This command will estimate the bias-corrected abundances of the individual samples along with a between-replicate variance in order to find cases of significant differential expression. The method is similar to the count-based statistical methods described in an earlier section, but it also takes uncertainties in isoform deconvolution. Furthermore, Cuffdiff will determine differential splicing, coding (CDS), and promoter use using the Jenson-Shannon divergence.
The output to this command includes fpkm_tracking files similar to those produced by Cufflinks as well as the results of tests for differential expression at both the gene and isoform level, differential splicing, differential coding (CDS), and differential promoter use. Each test file contains a p-value, q-value, and FDR-based significance decision (YES or NO). Full details on the output can be found in the Cufflinks Manual.
Section 4: References and Further Reading
References:
Tools of Interest: