Sequence+Alignment+Spring+2013

=__Section 1__: Sequence alignment overview= ====In the next few sections, we are going to cover how to use the "Tuxedo" tools, beginning with raw sequence files and finishing with differential expression analysis. A rough outline of the process is shown below.====



Almost every application that does some sort RNA-Seq analysis requires you to first align your raw reads. Raw reads are output from the base caller (e.g. BayesCall) after analyzing an intensity file generated from the image file from a sequencing experiment. Raw reads from the sequencer come in two formats:

FASTQ (most common): code @SRR039628.234 HWI-EAS220_1_FC304WC_GEX:3:1:51:335 length=50 ACAGAGCAAAAGACAAGAAAACAGATGAAATTGTGGCTCTAAAGCGGCTG +SRR039628.234 HWI-EAS220_1_FC304WC_GEX:3:1:51:335 length=50 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<:<<<<<<<9:<66<9 code

FASTA: code >SRR039628.234 HWI-EAS220_1_FC304WC_GEX:3:1:51:335 length=50 ACAGAGCAAAAGACAAGAAAACAGATGAAATTGTGGCTCTAAAGCGGCTG code In order to align a set of reads, you need some sort of reference. In RNA-Seq you can use the **genome** or **transcriptome**. The most common way to store a reference is in FASTA format. We will cover how to download a reference in the next section.

=__Section 2__: Building a Bowtie index=

The first step in sequence alignment is acquiring a Bowtie index of your reference(s). You will typically use the same index file to map numerous experiments.

You can also download the reference with the annotation: iGenomes instead of building the index and downloading the annotation yourself.

To acquire an index, do the following: code $ cd Downloads/ $ cat *.fa > hg19_chr1.fa code code $ mkdir -p ~/bwt_idx/hg19_chr1 $ cp hg19_chr1.fa ~/bwt_idx/hg19_chr1 code code $ bowtie2-build -f ~/bwt_idx/hg19_chr1/hg19_chr1.fa ~/bwt_idx/hg19_chr1/hg19_chr1 code
 * Check if your index is already built ( Bowtie Indexes)
 * If it does, download it and unzip to your index location
 * If it doesn't, download the genome from UCSC Genome Browser
 * On the left hand side, click "Downloads"
 * Find your model organism and click it (e.g. Human)
 * Find the build (e.g. hg19 or hg19)
 * Click "Full data set"
 * Download "chromFa.tar.gz"
 * Unzip it
 * For this tutorial, please use the file which has been downloaded from the UCSC genome browser.
 * Once you have all of your sequences, the easiest way to make an index is to pipe all of your files into a multi-FASTA file:
 * Move the file to a safe place you can remember.
 * Build the actual index:

You should now see a number of files in directory : code $ ls ~/bwt_idx/hg19_chr1/ hg19_chr1.1.ebwt hg19_chr1.2.ebwt  hg19_chr1.3.ebwt  hg19_chr1.4.ebwt hg19_chr1.fa hg19_chr1.rev.1.ebwt  hg19_chr1.rev.2.ebwt code


 * IMPORTANT** Remember to keep the original multi-FASTA file with the index . Programs that use the index often also require the multi-FASTA and will look for it here. If the file is not there, then it will have to rebuild the multi-FASTA which can often take a while.

=__Section 3__: Mapping with TopHat=

How TopHat works
TopHat first uses Bowtie to map reads to the genome. After the initial mapping, there are unmapped reads. TopHat attempts to map these reads against both novel and known junctions. It does this by creating Bowtie indexes for junctions.

It finds junctions by finding all possible canonical sites (GT-AG) with introns between 70 - 20,000 bp long. This is based on the fact that 93% of known mouse introns in the UCSC annotation fall within this range. This is a parameter that can be adjusted if you suspect is incorrect for your model system.

It also finds junctions by using the assembly of mapped reads which imply exons.

For more information, check out the TopHat papers

Choosing an annotation file
It is important to choose a relevant annotation. How reads maps all the way to how isoforms will be quantified depends on this part. If you use an annotation file from the beginning you should stick with it or start the analysis over. Alternatively you could not use an annotation at all.

If you decide to use an annotation, the most complete place to get annotations is the UCSC Genome Browser. Here is how to do it:


 * Go to @http://genome.ucsc.edu
 * Click "Table Browser" on the left hand side
 * Select the genome genome and assembly you will be working on
 * Make sure you choose the same assembly for both your genome reference and annotation
 * Select "Group: Genes and Gene Prediction Tracks"
 * Choose a track that contains the isoforms of interest. If you don't know:
 * RefSeq is rather conservative
 * Ensembl is much more complete
 * Ensure "Region: genome" is selected
 * Select "Output: GTF - gene transfer format"
 * Type a relevant "output file" name
 * e.g.: hg19_refGene.gtf
 * Click "get output"

Running a TopHat process
TopHat has many parameters that are important and relevant to each experiment. You should take care to choose your parameters wisely. This is only a very short list of the set of parameters available in the TopHat manual

First, make sure you have a Bowtie index (see Section 2).


 * ~ Parameter ||~ Short description ||~ Description ||~ Examples ||
 * < || Number of threads || How many concurrent threads will be run in Bowtie. If you have more than one processor, you should use at most the number of processors you have ||< >= 1 ||
 * < || Annotation || This annotation will supply known junctions. In future versions of TopHat it will also map directory to the transcriptome which increases sensitivity. ||<  ||
 * < ||< Mate inner distance ||< This is the expected (mean) inner distance between mate pairs. For, example, for paired end runs with fragments selected at 300bp, where each end is 50bp, you should set -r to be 200. There is no default, and this parameter is required for paired end runs. ||< 200 ||
 * < ||< Mate standard deviation ||< The standard deviation for the distribution on inner distances between mate pairs. The default is 20bp. This is used to discard mates that are too far apart. It is better to overestimate this than underestimate. ||< 80 ||
 * ||  ||< Only look for reads across junctions indicated in the supplied GFF or junctions file. ||   ||

Note that you will get very different results depending on how choose the mean and standard deviation. This is because TopHat discards mate pairs that are one standard deviation away from the mean.

Let's map the supplied data we have:

code $ tophat --GTF ~/annotations/hg19-refGene.gtf -r 200 --mate-std-dev 80 ~/bwt_idx/hg19_chr1/hg19_chr1 ~/exp1/chr1_2M_1.fq ~/exp1/chr1_2M_2.fq code

Once it has completed, you should see a number of files:

code $ ls accepted_hits.bam deletions.bed  insertions.bed  junctions.bed left_kept_reads.info logs  right_kept_reads.info code

It is helpful to write a script in Bash to keep track of the parameters that were used. Below is the same example in a Bash script:

code format="bash"
 * 1) !/bin/bash

EXP_NAME=exp1 GTF=~/annotations/hg19-refGene.gtf DATA_DIR=~/exp1 LEFT="chr1_2M_1.fq" RIGHT="chr1_2M_2.fq" BWT_IDX=~/bwt_idx/hg19_chr1/hg19_chr1 INNER_LEN=200 STD_DEV=80 OUT_DIR=$EXP_NAME
 * 1) Change these variables to match your experiment

CMD="tophat --GTF $GTF -r $INNER_LEN --mate-std-dev $STD_DEV -o $OUT_DIR $BWT_IDX $DATA_DIR/$LEFT $DATA_DIR/$RIGHT"

echo "Running $CMD" $CMD 2>&1 | tee $EXP_NAME.log code
 * 1) Show output on the screen AND save in $EXP_NAME.log

Once you've written your script, save it as something like, change the permissions and run it.

=__Section 4__: Understanding the output=

The standard output will be three output files in the directory you specified (using or the default  )

This is where the alignments will be stored in SAM format. If you have a TopHat BAM file, you can see the code that generated the file. It will appear in the "CL:" field:

code $ samtools view -H accepted_hits.bam code

We can also look at the alignments in the BAM file. You can read more about BAM files here

code $ samtools view accepted_hits.bam | less SRR039628.2785118      97      chr1    548651  255     50M     =       1630902 1082301 CAGGCACCTGTAATCACAGCTACTTGGGAGGCTGAGGCAGCAGAATCTCT      <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<8<:8::9::8:48      NM:i:2  NH:i:1 SRR039628.3262529      145     chr1    564650  255     50M     =       1229900 665300  TTCTAACCTCCCTGTTCTTATGAATTCGAACAGCATACCCCCGATTCCGC      7:::<9:<::<:<:<<:<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<      NM:i:1  NH:i:1 SRR039628.6640390      145     chr1    564650  255     50M     =       1575683 1011083 TTCTAACCTCCCTGTTCTTATGAATTCGAACAGCATACCCCCGATTCCGC      4//:<:<</:<<::<<4<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<      NM:i:1  NH:i:1

code

The file is the file that you should later use for abundance estimation.

The file contains a list of the junctions from the annotation and the novel junctions that TopHat found.