Section 1: Sequence alignment overview

In the next few sections, we are going to cover how to use the "Tuxedo" tools, beginning with raw sequence files and finishing with differential expression analysis. A rough outline of the process is shown below.


Slide1.png

Almost every application that does some sort RNA-Seq analysis requires you to first align your raw reads. Raw reads are output from the base caller (e.g. BayesCall) after analyzing an intensity file generated from the image file from a sequencing experiment. Raw reads from the sequencer come in two formats:

FASTQ (most common):
@SRR039628.234 HWI-EAS220_1_FC304WC_GEX:3:1:51:335 length=50
ACAGAGCAAAAGACAAGAAAACAGATGAAATTGTGGCTCTAAAGCGGCTG
+SRR039628.234 HWI-EAS220_1_FC304WC_GEX:3:1:51:335 length=50
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<:<<<<<<<9:<66<9

FASTA:
>SRR039628.234 HWI-EAS220_1_FC304WC_GEX:3:1:51:335 length=50
ACAGAGCAAAAGACAAGAAAACAGATGAAATTGTGGCTCTAAAGCGGCTG
In order to align a set of reads, you need some sort of reference. In RNA-Seq you can use the genome or transcriptome. The most common way to store a reference is in FASTA format. We will cover how to download a reference in the next section.


Section 2: Building a Bowtie index


The first step in sequence alignment is acquiring a Bowtie index of your reference(s). You will typically use the same index file to map numerous experiments.

You can also download the reference with the annotation: iGenomes instead of building the index and downloading the annotation yourself.

To acquire an index, do the following:
  • Check if your index is already built ( Bowtie Indexes)
    • If it does, download it and unzip to your index location
    • If it doesn't, download the genome from UCSC Genome Browser
      • On the left hand side, click "Downloads"
      • Find your model organism and click it (e.g. Human)
      • Find the build (e.g. hg19 or hg19)
      • Click "Full data set"
      • Download "chromFa.tar.gz"
      • Unzip it tar -xvf chromFa.tar.gz
  • For this tutorial, please use the file /global/courses/rnaseqworkshop/web/references/hg19_chr1.fa which has been downloaded from the UCSC genome browser.
  • Once you have all of your sequences, the easiest way to make an index is to pipe all of your files into a multi-FASTA file:
$ cd Downloads/
$ cat *.fa > hg19_chr1.fa
  • Move the file to a safe place you can remember.
$ mkdir -p ~/bwt_idx/hg19_chr1
$ cp hg19_chr1.fa ~/bwt_idx/hg19_chr1
  • Build the actual index:
$ bowtie2-build -f ~/bwt_idx/hg19_chr1/hg19_chr1.fa ~/bwt_idx/hg19_chr1/hg19_chr1

You should now see a number of files in directory ~/bwt_idx/hg19_chr1/hg19_chr1:
$ ls ~/bwt_idx/hg19_chr1/
hg19_chr1.1.ebwt  hg19_chr1.2.ebwt  hg19_chr1.3.ebwt  hg19_chr1.4.ebwt
hg19_chr1.fa  hg19_chr1.rev.1.ebwt  hg19_chr1.rev.2.ebwt

IMPORTANT Remember to keep the original multi-FASTA file with the index (hg19_chr1.fa). Programs that use the index often also require the multi-FASTA and will look for it here. If the file is not there, then it will have to rebuild the multi-FASTA which can often take a while.



Section 3: Mapping with TopHat


How TopHat works

TopHat first uses Bowtie to map reads to the genome. After the initial mapping, there are unmapped reads. TopHat attempts to map these reads against both novel and known junctions. It does this by creating Bowtie indexes for junctions.

It finds junctions by finding all possible canonical sites (GT-AG) with introns between 70 - 20,000 bp long. This is based on the fact that 93% of known mouse introns in the UCSC annotation fall within this range. This is a parameter that can be adjusted if you suspect is incorrect for your model system.

It also finds junctions by using the assembly of mapped reads which imply exons.

For more information, check out the TopHat papers [1] [2]

Choosing an annotation file

It is important to choose a relevant annotation. How reads maps all the way to how isoforms will be quantified depends on this part. If you use an annotation file from the beginning you should stick with it or start the analysis over. Alternatively you could not use an annotation at all.

If you decide to use an annotation, the most complete place to get annotations is the UCSC Genome Browser. Here is how to do it:

  • Go to http://genome.ucsc.edu
  • Click "Table Browser" on the left hand side
  • Select the genome genome and assembly you will be working on
    • Make sure you choose the same assembly for both your genome reference and annotation
  • Select "Group: Genes and Gene Prediction Tracks"
  • Choose a track that contains the isoforms of interest. If you don't know:
    • RefSeq is rather conservative
    • Ensembl is much more complete
  • Ensure "Region: genome" is selected
  • Select "Output: GTF - gene transfer format"
  • Type a relevant "output file" name
    • e.g.: hg19_refGene.gtf
  • Click "get output"

Running a TopHat process


TopHat has many parameters that are important and relevant to each experiment. You should take care to choose your parameters wisely. This is only a very short list of the set of parameters available in the TopHat manual

First, make sure you have a Bowtie index (see Section 2).

Parameter
Short description
Description
Examples
-p
Number of threads
How many concurrent threads will be run in Bowtie. If you have more than one processor, you should use at most the number of processors you have
>= 1
-G/--GTF
Annotation
This annotation will supply known junctions. In future versions of TopHat it will also map directory to the transcriptome which increases sensitivity.
~/annotations/hg19-refGene.gtf
-r/--mate-inner-dist <int>
Mate inner distance
This is the expected (mean) inner distance between mate pairs. For, example, for paired end runs with fragments selected at 300bp, where each end is 50bp, you should set -r to be 200. There is no default, and this parameter is required for paired end runs.
200
--mate-std-dev <int>
Mate standard deviation
The standard deviation for the distribution on inner distances between mate pairs. The default is 20bp. This is used to discard mates that are too far apart. It is better to overestimate this than underestimate.
80
--no-novel-juncs

Only look for reads across junctions indicated in the supplied GFF or junctions file.


Note that you will get very different results depending on how choose the mean and standard deviation. This is because TopHat discards mate pairs that are one standard deviation away from the mean.

Let's map the supplied data we have:

$ tophat --GTF ~/annotations/hg19-refGene.gtf -r 200 --mate-std-dev 80 ~/bwt_idx/hg19_chr1/hg19_chr1 ~/exp1/chr1_2M_1.fq ~/exp1/chr1_2M_2.fq

Once it has completed, you should see a number of files:

$ ls
accepted_hits.bam  deletions.bed  insertions.bed  junctions.bed
left_kept_reads.info  logs  right_kept_reads.info

It is helpful to write a script in Bash to keep track of the parameters that were used. Below is the same example in a Bash script:

#!/bin/bash
 
# Change these variables to match your experiment
EXP_NAME=exp1
GTF=~/annotations/hg19-refGene.gtf
DATA_DIR=~/exp1
LEFT="chr1_2M_1.fq"
RIGHT="chr1_2M_2.fq"
BWT_IDX=~/bwt_idx/hg19_chr1/hg19_chr1
INNER_LEN=200
STD_DEV=80
OUT_DIR=$EXP_NAME
 
CMD="tophat --GTF $GTF -r $INNER_LEN --mate-std-dev $STD_DEV -o $OUT_DIR $BWT_IDX $DATA_DIR/$LEFT $DATA_DIR/$RIGHT"
 
echo "Running $CMD"
# Show output on the screen AND save in $EXP_NAME.log
$CMD 2>&1 | tee $EXP_NAME.log

Once you've written your script, save it as something like exp1.sh, change the permissions (chmod 755 exp1.sh) and run it.



Section 4: Understanding the output


The standard output will be three output files in the directory you specified (using -o or the default tophat_out/ )
  • accepted_hits.bam
  • junctioned.bed

accepted_hits.bam

This is where the alignments will be stored in SAM format. If you have a TopHat BAM file, you can see the code that generated the file. It will appear in the "CL:" field:

$ samtools view -H accepted_hits.bam

We can also look at the alignments in the BAM file. You can read more about BAM files here

$ samtools view accepted_hits.bam | less
SRR039628.2785118       97      chr1    548651  255     50M     =       1630902 1082301 CAGGCACCTGTAATCACAGCTACTTGGGAGGCTGAGGCAGCAGAATCTCT      <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<8<:8::9::8:48      NM:i:2  NH:i:1
SRR039628.3262529       145     chr1    564650  255     50M     =       1229900 665300  TTCTAACCTCCCTGTTCTTATGAATTCGAACAGCATACCCCCGATTCCGC      7:::<9:<::<:<:<<:<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<      NM:i:1  NH:i:1
SRR039628.6640390       145     chr1    564650  255     50M     =       1575683 1011083 TTCTAACCTCCCTGTTCTTATGAATTCGAACAGCATACCCCCGATTCCGC      4//:<:<</:<<::<<4<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<      NM:i:1  NH:i:1
 

The accepted_hits.bam file is the file that you should later use for abundance estimation.

junctions.bed


The junctioned.bed file contains a list of the junctions from the annotation and the novel junctions that TopHat found.
  1. ^ Trapnell C, Pachter L, Salzberg SL (2009). TopHat: discovering splice junctions with RNA-Seq. Bioinformatics. 2009 May 1;25(9):1105-11. Epub 2009 Mar 16. dx.doi.org/10.1093/bioinformatics/btp120
  2. ^ Trapnell C, Roberts A, Goff L, Pertea G, Kim D, Kelley DR, Pimentel H, Salzberg SL, Rinn JL, Pachter L (2012). Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks Nature Protocol Epub 2012 Mar 1; 7(3):562-578. http://www.nature.com/nprot/journal/v7/n3/full/nprot.2012.016.html