Almost every application that does some sort RNA-Seq analysis requires you to first align your raw reads. Raw reads are output from the base caller (e.g. BayesCall) after analyzing an intensity file generated from the image file from a sequencing experiment. Raw reads from the sequencer come in two formats:
You can find complete datasets from the Short Read Archive. After you find a dataset of interest, you can convert the SRA files into FASTQ files by installing the SRA Toolkit and using the tool fastq-dump.
In order to align a set of reads, you need some sort of reference. In RNA-Seq you can use the genome or transcriptome. The most common way to store a reference is in FASTA format. We will cover how to download a reference in the next section.
Depending on how you will be doing your analysis (transcriptome or genome) you will do your mapping differently. Below are the different pipelines:
Genome pipeline that we will be covering right now:
Transcriptome pipeline:
Section 2: Building a Bowtie index
The first step in sequence alignment is acquiring a Bowtie index of your reference(s). You will typically use the same index file to map numerous experiments.
You can also download the reference with the annotation: iGenomes instead of building the index and downloading the annotation yourself.
Find your model organism and click it (e.g. Human)
Find the build (e.g. hg19 or hg19)
Click "Full data set"
Download "chromFa.tar.gz"
Unzip it tar -xvf chromFa.tar.gz
For this tutorial, please use the file /global/courses/rnaseqworkshop/web/references/hg19_chr1.fa which has been downloaded from the UCSC genome browser.
Once you have all of your sequences, the easiest way to make an index is to pipe all of your files into a multi-FASTA file:
$ cd Downloads/
$ cat *.fa > hg19_chr1.fa
Move the file to a safe place you can remember. I like to use ~/bwt_idx (see the note at the end of this section).
You should now see a number of files in directory ~/bwt_idx/hg19_chr1/hg19_chr1:
$ ls ~/bwt_idx/hg19_chr1/
hg19_chr1.1.ebwt hg19_chr1.2.ebwt hg19_chr1.3.ebwt hg19_chr1.4.ebwt
hg19_chr1.fa hg19_chr1.rev.1.ebwt hg19_chr1.rev.2.ebwt
IMPORTANT Remember to keep the original multi-FASTA file with the index (hg19_chr1.fa). Programs that use the index often also require the multi-FASTA and will look for it here. If the file is not there, then it will have to rebuild the multi-FASTA which can often take a while.
Note: Why the redundant name? Personally I like doing this because when you start collecting numerous indexes your folder becomes huge (each index makes several files). I also like including a README file if the name isn't completely obvious or I have some extra notes I want to include about the index.
Section 3: Mapping with TopHat
How TopHat works
TopHat first uses Bowtie to map reads to the genome. After the initial mapping, there are unmapped reads. TopHat attempts to map these reads against both novel and known junctions. It does this by creating Bowtie indexes for junctions.
It finds junctions by finding all possible canonical sites (GT-AG) with introns between 70 - 20,000 bp long. This is based on the fact that 93% of known mouse introns in the UCSC annotation fall within this range. This is a parameter that can be adjusted if you suspect is incorrect for your model system.
It also finds junctions by using the assembly of mapped reads which imply exons.
For more information, check out the TopHat paper [1]
Choosing an annotation file
It is important to choose a relevant annotation. How reads maps all the way to how isoforms will be quantified depends on this part. If you use an annotation file from the beginning you should stick with it or start the analysis over. Alternatively you could not use an annotation at all.
If you decide to use an annotation, the most complete place to get annotations is the UCSC Genome Browser. Here is how to do it:
Select the genome genome and assembly you will be working on
Make sure you choose the same assembly for both your genome reference and annotation
Select "Group: Genes and Gene Prediction Tracks"
Choose a track that contains the isoforms of interest. If you don't know:
RefSeq is rather conservative
Ensembl is much more complete
Ensure "Region: genome" is selected
Select "Output: GTF - gene transfer format"
Type a relevant "output file" name
e.g.: hg19_refGene.gtf
Click "get output"
Running a TopHat process
TopHat has many parameters that are important and relevant to each experiment. You should take care to choose your parameters wisely. This is only a very short list of the set of parameters available in the TopHat manual
First, make sure you have a Bowtie index (see Section 2).
Parameter
Short description
Description
Examples
-p
Number of threads
How many concurrent threads will be run in Bowtie. If you have more than one processor, you should use at most the number of processors you have
>= 1
-G/--GTF
Annotation
This annotation will supply known junctions. In future versions of TopHat it will also map directory to the transcriptome which increases sensitivity.
~/annotations/hg19-refGene.gtf
-r/--mate-inner-dist <int>
Mate inner distance
This is the expected (mean) inner distance between mate pairs. For, example, for paired end runs with fragments selected at 300bp, where each end is 50bp, you should set -r to be 200. There is no default, and this parameter is required for paired end runs.
200
--mate-std-dev <int>
Mate standard deviation
The standard deviation for the distribution on inner distances between mate pairs. The default is 20bp. This is used to discard mates that are too far apart. It is better to overestimate this than underestimate.
80
Note that you will get very different results depending on how choose the mean and standard deviation. This is because TopHat discards mate pairs that are one standard deviation away from the mean.
Once it has completed, you should see a number of files:
$ ls
accepted_hits.bam deletions.bed insertions.bed junctions.bed
left_kept_reads.info logs right_kept_reads.info
I like to write a script in Bash to keep track of the parameters I used. Below is the same example in a Bash script:
#!/bin/bash# Change these variables to match your experimentEXP_NAME=exp1
GTF=~/annotations/hg19-refGene.gtf
DATA_DIR=~/exp1
LEFT="chr1_2M_1.fq"RIGHT="chr1_2M_2.fq"BWT_IDX=~/bwt_idx/hg19_chr1/hg19_chr1
INNER_LEN=200STD_DEV=80OUT_DIR=$EXP_NAMECMD="tophat --GTF $GTF -r $INNER_LEN --mate-std-dev $STD_DEV -o $OUT_DIR$BWT_IDX$DATA_DIR/$LEFT$DATA_DIR/$RIGHT"echo"Running $CMD"# Show output on the screen AND save in $EXP_NAME.log$CMD2>&1|tee$EXP_NAME.log
Once you've written your script, save it as something like exp1.sh, change the permissions (chmod 755 exp1.sh) and run it.
Section 4: Understanding the output
The standard output will be three output files in the directory you specified (using -o or the default tophat_out/ )
accepted_hits.bam
junctioned.bed
accepted_hits.bam
This is where the alignments will be stored in SAM format. If you have a TopHat BAM file, you can see the code that generated the file. It will appear in the "CL:" field:
$ samtools view -H accepted_hits.bam
The accepted_hits.bam file is the file that you should later use for abundance estimation.
junctions.bed
The junctioned.bed file contains a list of the junctions from the annotation and the novel junctions that TopHat found.
Section 5: Samtools
After the reads are aligned, a common question is to get summaries of the data, such as how many reads overlap each exon or gene -- we will talk more about what are good estimates of gene expression (especially if there is alternative splicing in the organism). The first set of tools that are used for bam files are called 'samtools'. With these you can quickly access parts of the data. We will give just a couple of examples here.
First you have to create an index for the bam file, so that the reads can be quickly accessed. It will create an '*.bai' file.
samtools index accepted_hits.bam
We can get some basic alignment summaries:
$ samtools flagstat accepted_hits.bam
138154 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
138154 + 0 mapped (100.00%:nan%)
138154 + 0 paired in sequencing
69211 + 0 read1
68943 + 0 read2
88314 + 0 properly paired (63.92%:nan%)
125798 + 0 with itself and mate mapped
12356 + 0 singletons (8.94%:nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
We can get basic summaries per chromosome (reference sequence name)
You can also have it count the alignments, which is just basically a line count of all the alignments listed (so mate pairs count separately if they are listed separately in the bam file); same with multiple alignments. Furthermore, if the intron gap of the alignment (the 'N') overlaps the region, that will be counted too.
The 'b' options tells samtools that the output should be in bam format. The -f is the filter (0x2) corresponds to proper pairs. You can actually do this all in one line
To get specific types of summaries, many people just write programs using samtools. bedTools (http://code.google.com/p/bedtools/) provides a lot of summarization tools as well as other tools for comparing overlap of annotation, etc. They can work with BED files, but also GTF annotation files and BAM files. We are not going to discuss how to get the right annotation file for the summaries you want, but bedTools also have a lot tools for this. All of these tools can also be used between annotations, for example to find intersections of exons, etc.
We will work with the '.gtf' that we used to align the files (its important that the annotation for the alignment match the annotation for the summaries, in terms of the name of the chromosome, the genome build, etc.). Since we are only working with chr1 reads, we can limit ourselves to these to save processing time.
Notice that the regions are by exon, and they overlap and are repeated. This is because what is an exon region is different depending on what transcript it is in. bedTools finds the overlap for a region, so each of these regions will have their own overlap (and so a read might get counted twice.) So this may not be the perfect annotation depending on what kind of summaries you want.
For simplicity later on, we will also make a file for just the information for a single gene.
Then we can run coverageBed to find the coverage for each region in the gtf file.
$ coverageBed -split -abam accepted_hits.bam -b hg19-chr1.gtf > chr1Coverage.bed
Error: malformed GFF entry at line 15525. Start was greater than end. Exiting.
There is some error in the gtf file (we are not sure why) where the starts are less than the ends for a few entries, so we will work with a version where these have been taken out (code not shown). For simplicity later on, we will also make a file for just the information for a single gene.
The -split option is important for RNA-Seq: it says do not count the overlap if the overlap is in the intron region (in contrast the summary in samtools appears to counts this)
The old annotation information is there, as well as 4 additional columns:
1) The number of features in A (the bam file) that overlapped (by at least one base pair) the B interval (the exon).
2) The number of bases in B that had non-zero coverage from features in A.
3) The length of the entry in B.
4) The fraction of bases in B that had non-zero coverage from features in A.
To save space, we'll look at just the columns of interest
You can do other things by combining the filtering of samtools with the coverage. For example, we can use the flag that tells us whether an alignment is a proper-pair, to get only overlaps of proper-pairs, and compare the counts of the two:
We can also get summaries of only those reads that fall entirely within the exon, by using another bed tool 'intersectBed' and pipe the results to coverageBed.
We can do other things, like only do alignments if they match the strand ('-s' option), which would be appropriate for a stranded protocol. Another that might be of interest, is the coverage per base within each exon. This takes much longer, so we'll do only the single gene we made before:
We might want to get gene information. We're not going to go through any details, but there are different ways this could be done. You could get all reads that overlap the entire region of the gene, in which case you would want an annotation with each gene being an entry and follow the previous commands with this type of gtf.
Authors:
This section was written by Harold Pimentel and Elizabeth Perdom. Please feel free to contact Harold if you have any questions.
Section 1: Sequence alignment overview
Below is a typical Illumina pipeline:Almost every application that does some sort RNA-Seq analysis requires you to first align your raw reads. Raw reads are output from the base caller (e.g. BayesCall) after analyzing an intensity file generated from the image file from a sequencing experiment. Raw reads from the sequencer come in two formats:
FASTQ (most common):
FASTA:
You can find complete datasets from the Short Read Archive. After you find a dataset of interest, you can convert the SRA files into FASTQ files by installing the SRA Toolkit and using the tool fastq-dump.
In order to align a set of reads, you need some sort of reference. In RNA-Seq you can use the genome or transcriptome. The most common way to store a reference is in FASTA format. We will cover how to download a reference in the next section.
Depending on how you will be doing your analysis (transcriptome or genome) you will do your mapping differently. Below are the different pipelines:
Genome pipeline that we will be covering right now:
Transcriptome pipeline:
Section 2: Building a Bowtie index
The first step in sequence alignment is acquiring a Bowtie index of your reference(s). You will typically use the same index file to map numerous experiments.
You can also download the reference with the annotation: iGenomes instead of building the index and downloading the annotation yourself.
To acquire an index, do the following:
You should now see a number of files in directory ~/bwt_idx/hg19_chr1/hg19_chr1:
IMPORTANT Remember to keep the original multi-FASTA file with the index (hg19_chr1.fa). Programs that use the index often also require the multi-FASTA and will look for it here. If the file is not there, then it will have to rebuild the multi-FASTA which can often take a while.
Section 3: Mapping with TopHat
How TopHat works
TopHat first uses Bowtie to map reads to the genome. After the initial mapping, there are unmapped reads. TopHat attempts to map these reads against both novel and known junctions. It does this by creating Bowtie indexes for junctions.It finds junctions by finding all possible canonical sites (GT-AG) with introns between 70 - 20,000 bp long. This is based on the fact that 93% of known mouse introns in the UCSC annotation fall within this range. This is a parameter that can be adjusted if you suspect is incorrect for your model system.
It also finds junctions by using the assembly of mapped reads which imply exons.
For more information, check out the TopHat paper [1]
Choosing an annotation file
It is important to choose a relevant annotation. How reads maps all the way to how isoforms will be quantified depends on this part. If you use an annotation file from the beginning you should stick with it or start the analysis over. Alternatively you could not use an annotation at all.If you decide to use an annotation, the most complete place to get annotations is the UCSC Genome Browser. Here is how to do it:
Running a TopHat process
TopHat has many parameters that are important and relevant to each experiment. You should take care to choose your parameters wisely. This is only a very short list of the set of parameters available in the TopHat manual
First, make sure you have a Bowtie index (see Section 2).
Note that you will get very different results depending on how choose the mean and standard deviation. This is because TopHat discards mate pairs that are one standard deviation away from the mean.
Let's map the supplied data we have:
Once it has completed, you should see a number of files:
I like to write a script in Bash to keep track of the parameters I used. Below is the same example in a Bash script:
Once you've written your script, save it as something like exp1.sh, change the permissions (chmod 755 exp1.sh) and run it.
Section 4: Understanding the output
The standard output will be three output files in the directory you specified (using -o or the default tophat_out/ )
accepted_hits.bam
This is where the alignments will be stored in SAM format. If you have a TopHat BAM file, you can see the code that generated the file. It will appear in the "CL:" field:The accepted_hits.bam file is the file that you should later use for abundance estimation.
junctions.bed
The junctioned.bed file contains a list of the junctions from the annotation and the novel junctions that TopHat found.
Section 5: Samtools
After the reads are aligned, a common question is to get summaries of the data, such as how many reads overlap each exon or gene -- we will talk more about what are good estimates of gene expression (especially if there is alternative splicing in the organism). The first set of tools that are used for bam files are called 'samtools'. With these you can quickly access parts of the data. We will give just a couple of examples here.
First you have to create an index for the bam file, so that the reads can be quickly accessed. It will create an '*.bai' file.
We can get some basic alignment summaries:
We can get basic summaries per chromosome (reference sequence name)
Each line consists of reference sequence name, sequence length, # mapped reads and # unmapped reads.
We can get the reads that overlap a region, in this case corresponding to a gene "NM_033487", using 'view'. We'll save it to another file
You can also have it count the alignments, which is just basically a line count of all the alignments listed (so mate pairs count separately if they are listed separately in the bam file); same with multiple alignments. Furthermore, if the intron gap of the alignment (the 'N') overlaps the region, that will be counted too.
You can also use samtools to filter out reads based on some flags. A common one is to only get reads that align as a proper pair.
The 'b' options tells samtools that the output should be in bam format. The -f is the filter (0x2) corresponds to proper pairs. You can actually do this all in one line
Section 6: bedTools
To get specific types of summaries, many people just write programs using samtools. bedTools (http://code.google.com/p/bedtools/) provides a lot of summarization tools as well as other tools for comparing overlap of annotation, etc. They can work with BED files, but also GTF annotation files and BAM files. We are not going to discuss how to get the right annotation file for the summaries you want, but bedTools also have a lot tools for this. All of these tools can also be used between annotations, for example to find intersections of exons, etc.
We will work with the '.gtf' that we used to align the files (its important that the annotation for the alignment match the annotation for the summaries, in terms of the name of the chromosome, the genome build, etc.). Since we are only working with chr1 reads, we can limit ourselves to these to save processing time.
Notice that the regions are by exon, and they overlap and are repeated. This is because what is an exon region is different depending on what transcript it is in. bedTools finds the overlap for a region, so each of these regions will have their own overlap (and so a read might get counted twice.) So this may not be the perfect annotation depending on what kind of summaries you want.
For simplicity later on, we will also make a file for just the information for a single gene.
Then we can run coverageBed to find the coverage for each region in the gtf file.
There is some error in the gtf file (we are not sure why) where the starts are less than the ends for a few entries, so we will work with a version where these have been taken out (code not shown). For simplicity later on, we will also make a file for just the information for a single gene.
The -split option is important for RNA-Seq: it says do not count the overlap if the overlap is in the intron region (in contrast the summary in samtools appears to counts this)
The old annotation information is there, as well as 4 additional columns:
1) The number of features in A (the bam file) that overlapped (by at least one base pair) the B interval (the exon).
2) The number of bases in B that had non-zero coverage from features in A.
3) The length of the entry in B.
4) The fraction of bases in B that had non-zero coverage from features in A.
To save space, we'll look at just the columns of interest
You can do other things by combining the filtering of samtools with the coverage. For example, we can use the flag that tells us whether an alignment is a proper-pair, to get only overlaps of proper-pairs, and compare the counts of the two:
We can also get summaries of only those reads that fall entirely within the exon, by using another bed tool 'intersectBed' and pipe the results to coverageBed.
We can do other things, like only do alignments if they match the strand ('-s' option), which would be appropriate for a stranded protocol. Another that might be of interest, is the coverage per base within each exon. This takes much longer, so we'll do only the single gene we made before:
We might want to get gene information. We're not going to go through any details, but there are different ways this could be done. You could get all reads that overlap the entire region of the gene, in which case you would want an annotation with each gene being an entry and follow the previous commands with this type of gtf.
Authors:
This section was written by Harold Pimentel and Elizabeth Perdom. Please feel free to contact Harold if you have any questions.References:
BowtieTopHat
Other tools
GMAP / GSNAP