Section 1: Sequence alignment overview

Below is a typical Illumina pipeline:

illumina_pipeline_2.png

Almost every application that does some sort RNA-Seq analysis requires you to first align your raw reads. Raw reads are output from the base caller (e.g. BayesCall) after analyzing an intensity file generated from the image file from a sequencing experiment. Raw reads from the sequencer come in two formats:

FASTQ (most common):
@SRR039628.234 HWI-EAS220_1_FC304WC_GEX:3:1:51:335 length=50
ACAGAGCAAAAGACAAGAAAACAGATGAAATTGTGGCTCTAAAGCGGCTG
+SRR039628.234 HWI-EAS220_1_FC304WC_GEX:3:1:51:335 length=50
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<:<<<<<<<9:<66<9

FASTA:
>SRR039628.234 HWI-EAS220_1_FC304WC_GEX:3:1:51:335 length=50
ACAGAGCAAAAGACAAGAAAACAGATGAAATTGTGGCTCTAAAGCGGCTG

You can find complete datasets from the Short Read Archive. After you find a dataset of interest, you can convert the SRA files into FASTQ files by installing the SRA Toolkit and using the tool fastq-dump.

In order to align a set of reads, you need some sort of reference. In RNA-Seq you can use the genome or transcriptome. The most common way to store a reference is in FASTA format. We will cover how to download a reference in the next section.

Depending on how you will be doing your analysis (transcriptome or genome) you will do your mapping differently. Below are the different pipelines:

Genome pipeline that we will be covering right now:
genome.png

Transcriptome pipeline:
transcriptome_pipeline.png


Section 2: Building a Bowtie index


The first step in sequence alignment is acquiring a Bowtie index of your reference(s). You will typically use the same index file to map numerous experiments.

You can also download the reference with the annotation: iGenomes instead of building the index and downloading the annotation yourself.

To acquire an index, do the following:
  • Check if your index is already built ( Bowtie Indexes)
    • If it does, download it and unzip to your index location
    • If it doesn't, download the genome from UCSC Genome Browser
      • On the left hand side, click "Downloads"
      • Find your model organism and click it (e.g. Human)
      • Find the build (e.g. hg19 or hg19)
      • Click "Full data set"
      • Download "chromFa.tar.gz"
      • Unzip it tar -xvf chromFa.tar.gz
  • For this tutorial, please use the file /global/courses/rnaseqworkshop/web/references/hg19_chr1.fa which has been downloaded from the UCSC genome browser.
  • Once you have all of your sequences, the easiest way to make an index is to pipe all of your files into a multi-FASTA file:
$ cd Downloads/
$ cat *.fa > hg19_chr1.fa
  • Move the file to a safe place you can remember. I like to use ~/bwt_idx (see the note at the end of this section).
$ mkdir -p ~/bwt_idx/hg19_chr1
$ cp hg19_chr1.fa ~/bwt_idx/hg19_chr1
  • Build the actual index:
$ bowtie-build -f ~/bwt_idx/hg19_chr1/hg19_chr1.fa ~/bwt_idx/hg19_chr1/hg19_chr1

You should now see a number of files in directory ~/bwt_idx/hg19_chr1/hg19_chr1:

$ ls ~/bwt_idx/hg19_chr1/
hg19_chr1.1.ebwt  hg19_chr1.2.ebwt  hg19_chr1.3.ebwt  hg19_chr1.4.ebwt
hg19_chr1.fa  hg19_chr1.rev.1.ebwt  hg19_chr1.rev.2.ebwt

IMPORTANT Remember to keep the original multi-FASTA file with the index (hg19_chr1.fa). Programs that use the index often also require the multi-FASTA and will look for it here. If the file is not there, then it will have to rebuild the multi-FASTA which can often take a while.

  • Note: Why the redundant name? Personally I like doing this because when you start collecting numerous indexes your folder becomes huge (each index makes several files). I also like including a README file if the name isn't completely obvious or I have some extra notes I want to include about the index.



Section 3: Mapping with TopHat


How TopHat works

TopHat first uses Bowtie to map reads to the genome. After the initial mapping, there are unmapped reads. TopHat attempts to map these reads against both novel and known junctions. It does this by creating Bowtie indexes for junctions.

It finds junctions by finding all possible canonical sites (GT-AG) with introns between 70 - 20,000 bp long. This is based on the fact that 93% of known mouse introns in the UCSC annotation fall within this range. This is a parameter that can be adjusted if you suspect is incorrect for your model system.

It also finds junctions by using the assembly of mapped reads which imply exons.

For more information, check out the TopHat paper [1]

Choosing an annotation file

It is important to choose a relevant annotation. How reads maps all the way to how isoforms will be quantified depends on this part. If you use an annotation file from the beginning you should stick with it or start the analysis over. Alternatively you could not use an annotation at all.

If you decide to use an annotation, the most complete place to get annotations is the UCSC Genome Browser. Here is how to do it:

  • Go to http://genome.ucsc.edu
  • Click "Table Browser" on the left hand side
  • Select the genome genome and assembly you will be working on
    • Make sure you choose the same assembly for both your genome reference and annotation
  • Select "Group: Genes and Gene Prediction Tracks"
  • Choose a track that contains the isoforms of interest. If you don't know:
    • RefSeq is rather conservative
    • Ensembl is much more complete
  • Ensure "Region: genome" is selected
  • Select "Output: GTF - gene transfer format"
  • Type a relevant "output file" name
    • e.g.: hg19_refGene.gtf
  • Click "get output"

Running a TopHat process


TopHat has many parameters that are important and relevant to each experiment. You should take care to choose your parameters wisely. This is only a very short list of the set of parameters available in the TopHat manual

First, make sure you have a Bowtie index (see Section 2).

Parameter
Short description
Description
Examples
-p
Number of threads
How many concurrent threads will be run in Bowtie. If you have more than one processor, you should use at most the number of processors you have
>= 1
-G/--GTF
Annotation
This annotation will supply known junctions. In future versions of TopHat it will also map directory to the transcriptome which increases sensitivity.
~/annotations/hg19-refGene.gtf
-r/--mate-inner-dist <int>
Mate inner distance
This is the expected (mean) inner distance between mate pairs. For, example, for paired end runs with fragments selected at 300bp, where each end is 50bp, you should set -r to be 200. There is no default, and this parameter is required for paired end runs.
200
--mate-std-dev <int>
Mate standard deviation
The standard deviation for the distribution on inner distances between mate pairs. The default is 20bp. This is used to discard mates that are too far apart. It is better to overestimate this than underestimate.
80

Note that you will get very different results depending on how choose the mean and standard deviation. This is because TopHat discards mate pairs that are one standard deviation away from the mean.

Let's map the supplied data we have:

$ tophat --GTF ~/annotations/hg19-refGene.gtf -r 200 --mate-std-dev 80 ~/bwt_idx/hg19_chr1/hg19_chr1 chr1_2M_1.fq chr1_2M_2.fq

Once it has completed, you should see a number of files:

$ ls
accepted_hits.bam  deletions.bed  insertions.bed  junctions.bed
left_kept_reads.info  logs  right_kept_reads.info

I like to write a script in Bash to keep track of the parameters I used. Below is the same example in a Bash script:

#!/bin/bash
 
# Change these variables to match your experiment
EXP_NAME=exp1
GTF=~/annotations/hg19-refGene.gtf
DATA_DIR=~/exp1
LEFT="chr1_2M_1.fq"
RIGHT="chr1_2M_2.fq"
BWT_IDX=~/bwt_idx/hg19_chr1/hg19_chr1
INNER_LEN=200
STD_DEV=80
OUT_DIR=$EXP_NAME
 
CMD="tophat --GTF $GTF -r $INNER_LEN --mate-std-dev $STD_DEV -o $OUT_DIR $BWT_IDX $DATA_DIR/$LEFT $DATA_DIR/$RIGHT"
 
echo "Running $CMD"
# Show output on the screen AND save in $EXP_NAME.log
$CMD 2>&1 | tee $EXP_NAME.log

Once you've written your script, save it as something like exp1.sh, change the permissions (chmod 755 exp1.sh) and run it.



Section 4: Understanding the output


The standard output will be three output files in the directory you specified (using -o or the default tophat_out/ )
  • accepted_hits.bam
  • junctioned.bed

accepted_hits.bam

This is where the alignments will be stored in SAM format. If you have a TopHat BAM file, you can see the code that generated the file. It will appear in the "CL:" field:

$ samtools view -H accepted_hits.bam

The accepted_hits.bam file is the file that you should later use for abundance estimation.

junctions.bed


The junctioned.bed file contains a list of the junctions from the annotation and the novel junctions that TopHat found.

Section 5: Samtools


After the reads are aligned, a common question is to get summaries of the data, such as how many reads overlap each exon or gene -- we will talk more about what are good estimates of gene expression (especially if there is alternative splicing in the organism). The first set of tools that are used for bam files are called 'samtools'. With these you can quickly access parts of the data. We will give just a couple of examples here.

First you have to create an index for the bam file, so that the reads can be quickly accessed. It will create an '*.bai' file.
samtools index accepted_hits.bam

We can get some basic alignment summaries:


$ samtools flagstat accepted_hits.bam
 
138154 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
138154 + 0 mapped (100.00%:nan%)
138154 + 0 paired in sequencing
69211 + 0 read1
68943 + 0 read2
88314 + 0 properly paired (63.92%:nan%)
125798 + 0 with itself and mate mapped
12356 + 0 singletons (8.94%:nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
We can get basic summaries per chromosome (reference sequence name)
samtools idxstats accepted_hits.bam
chr1 249250621 138154 0
* 0 0 0
Each line consists of reference sequence name, sequence length, # mapped reads and # unmapped reads.

We can get the reads that overlap a region, in this case corresponding to a gene "NM_033487", using 'view'. We'll save it to another file

$ samtools view accepted_hits.bam chr1:1571100-1655775 > geneOverlap.sam
$ head geneOverlap.sam
SRR039630.576251    83    chr1    1532846    255    25M289602N25M    =    1822308    289512    CCCCCCCCGCCGCCCCCCCCCCCGCCTCCGTCCGCCCCTCAGACGCCTCC    3.394998&99&:4..9&<8"<<&<<"<<4"<<<<<<<8<<<<<<<<<<<    NM:i:2    XS:A:-    NH:i:1
SRR039630.6482376    83    chr1    1532863    255    8M289602N42M    =    1822298    289485    CCCCCCCCCTCCGTCCGCCCCTCAGACGCCTCCAGCCATCGGGATGGGCG    &94&88&:9":8&&;<8<<<<;<<<<<<<<;<<<<<<<<<<<<<<<<<<<    NM:i:1    XS:A:-    NH:i:1
SRR039628.4097271    355    chr1    1571051    3    50M    =    1571302    301    ATCGCAGCTCCAGCCGCAGTAGCCACGCCTGTGGTCCCGGCTGGGGTCTC    <<<<<<<<<<<<<<<<<<<<<<;<<<<<-<<<<<6:<86<8.8-:+<28:    NM:i:2    NH:i:2    CC:Z:=    CP:i:1634271    HI:i:0
SRR039628.5629287    99    chr1    1571051    3    50M    =    1571302    301    ATCGCAGCTCCAGCCGAAGTAGCCACGCCTGTGGTCCCGGCTGAGTTCTC    ;;;7;;4;;;;;;.;;(8;;;0;8;:89;+5492777.46864".77478    NM:i:1    NH:i:2    CC:Z:=    CP:i:1634271    HI:i:0
SRR039632.632305    163    chr1    1571051    3    50M    =    1571299    298    ATCGCAGCTCCAGCCGCAGTAGCCACGCCTGTGGGCCCGGCTGAGTTCTT    <<<<<<<<<<<<<<<<<<<5;<<5+9;5994544+45/54/54+4'4+4"    NM:i:2    NH:i:2    CC:Z:=    CP:i:1634271    HI:i:0
SRR039628.1159782    403    chr1    1571052    3    50M    =    1570899    -203    TCGCAGCCCCAGCCGCAGTAGCCACGCCTGTGGTCCCGGCTGAGTTCTCC    +;5;5"<'<:7'<:9<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<    NM:i:1    NH:i:2    CC:Z:=    CP:i:1634272    HI:i:0
SRR039628.639841    419    chr1    1571053    3    50M    =    1571299    296    CGCAGCTCCAGCCGCAGTAGCCACGCCTGTGGTCCCGGCTGAGTTCTCCC    <<<<<<<<<<<<<<<<<<<<<<*<<<<<<:<<53<<;:+39/7993939;    NM:i:0    NH:i:2    CC:Z:=    CP:i:1634273    HI:i:0
SRR039628.3362328    345    chr1    1571053    3    50M    *    0    0    CCCAGCCCCAGCCGCAGTAGCCACGCCTGTGGTCCCGGCTGAGTTCTCCC    <*<''<'<<'8<<:<<<;<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<    NM:i:2    NH:i:2    CC:Z:=    CP:i:1634273    HI:i:0
SRR039630.1486911    419    chr1    1571053    3    50M    =    1571298    295    CGCAGCTCCAGCCGCAGTAGCCACGCCTGTGGTCCCGGCTGAGTTCTCCC    <<<<<<8<<<<<<7<6<<<<<<<<<<<:<+2<4<;444844+8+8;;;44    NM:i:0    NH:i:2    CC:Z:=    CP:i:1634273    HI:i:0
SRR039630.3604171    355    chr1    1571053    3    50M    =    1571299    296    CGCAGCTCCAGCCGCAGTAGCCACGCCTGTGGTCCCGGCTGAGTTCTCCC    <<<<<<<<<<<<<<<<<<<<<<<<<<<<<:9<<<<;:;;;9.:8;::8;;    NM:i:0    NH:i:2    CC:Z:=    CP:i:1634273    HI:i:0
You can also have it count the alignments, which is just basically a line count of all the alignments listed (so mate pairs count separately if they are listed separately in the bam file); same with multiple alignments. Furthermore, if the intron gap of the alignment (the 'N') overlaps the region, that will be counted too.
$ samtools view -c accepted_hits.bam chr1:1571100-1655775
10603
$ wc -l geneOverlap.sam
   10603 geneOverlap.sam
You can also use samtools to filter out reads based on some flags. A common one is to only get reads that align as a proper pair.
$ samtools view -b -f 0x2 accepted_hits.bam > properPairs.bam
$ samtools index properPairs.bam
$ samtools view -c properPairs.bam chr1:1571100-1655775
7150
The 'b' options tells samtools that the output should be in bam format. The -f is the filter (0x2) corresponds to proper pairs. You can actually do this all in one line
$ samtools view -cbf 0x2 accepted_hits.bam chr1:1571100-1655775
7150

Section 6: bedTools


To get specific types of summaries, many people just write programs using samtools. bedTools (http://code.google.com/p/bedtools/) provides a lot of summarization tools as well as other tools for comparing overlap of annotation, etc. They can work with BED files, but also GTF annotation files and BAM files. We are not going to discuss how to get the right annotation file for the summaries you want, but bedTools also have a lot tools for this. All of these tools can also be used between annotations, for example to find intersections of exons, etc.

We will work with the '.gtf' that we used to align the files (its important that the annotation for the alignment match the annotation for the summaries, in terms of the name of the chromosome, the genome build, etc.). Since we are only working with chr1 reads, we can limit ourselves to these to save processing time.

$ grep -P 'chr1\t' hg19-refGene.gtf > hg19-chr1.gtf
$ head hg19-chr1.gtf
chr1 hg19_refGene start_codon 67000042 67000044 0.000000 + . gene_id "NM_032291"; transcript_id "NM_032291";
chr1 hg19_refGene CDS 67000042 67000051 0.000000 + 0 gene_id "NM_032291"; transcript_id "NM_032291";
chr1 hg19_refGene exon 66999825 67000051 0.000000 + . gene_id "NM_032291"; transcript_id "NM_032291";
chr1 hg19_refGene CDS 67091530 67091593 0.000000 + 2 gene_id "NM_032291"; transcript_id "NM_032291";
chr1 hg19_refGene exon 67091530 67091593 0.000000 + . gene_id "NM_032291"; transcript_id "NM_032291";
chr1 hg19_refGene CDS 67098753 67098777 0.000000 + 1 gene_id "NM_032291"; transcript_id "NM_032291";
chr1 hg19_refGene exon 67098753 67098777 0.000000 + . gene_id "NM_032291"; transcript_id "NM_032291";
chr1 hg19_refGene CDS 67101627 67101698 0.000000 + 0 gene_id "NM_032291"; transcript_id "NM_032291";
chr1 hg19_refGene exon 67101627 67101698 0.000000 + . gene_id "NM_032291"; transcript_id "NM_032291";
chr1 hg19_refGene CDS 67105460 67105516 0.000000 + 0 gene_id "NM_032291"; transcript_id "NM_032291";
Notice that the regions are by exon, and they overlap and are repeated. This is because what is an exon region is different depending on what transcript it is in. bedTools finds the overlap for a region, so each of these regions will have their own overlap (and so a read might get counted twice.) So this may not be the perfect annotation depending on what kind of summaries you want.

For simplicity later on, we will also make a file for just the information for a single gene.

$ grep 'gene_id NM_033487' hg19-chr1-corrected.gtf > gene.gtf
Then we can run coverageBed to find the coverage for each region in the gtf file.
$ coverageBed -split -abam accepted_hits.bam -b hg19-chr1.gtf > chr1Coverage.bed
Error: malformed GFF entry at line 15525. Start was greater than end. Exiting.
There is some error in the gtf file (we are not sure why) where the starts are less than the ends for a few entries, so we will work with a version where these have been taken out (code not shown). For simplicity later on, we will also make a file for just the information for a single gene.

$ coverageBed -split -abam accepted_hits.bam -b hg19-chr1-corrected.gtf > chr1Coverage.bed
The -split option is important for RNA-Seq: it says do not count the overlap if the overlap is in the intron region (in contrast the summary in samtools appears to counts this)

The old annotation information is there, as well as 4 additional columns:

1) The number of features in A (the bam file) that overlapped (by at least one base pair) the B interval (the exon).
2) The number of bases in B that had non-zero coverage from features in A.
3) The length of the entry in B.
4) The fraction of bases in B that had non-zero coverage from features in A.

To save space, we'll look at just the columns of interest
$ head -20 chr1Coverage.bed | cut -f 1,3-5,7,10-13,9
chr1    CDS    161480624    161480746    +    gene_id NM_001136219; transcript_id NM_001136219;     0    0    123    0.0000000
chr1    exon    161480624    161480746    +    gene_id NM_001136219; transcript_id NM_001136219;     0    0    123    0.0000000
chr1    CDS    161480624    161480746    +    gene_id NM_021642; transcript_id NM_021642;     0    0    123    0.0000000
chr1    exon    161480624    161480746    +    gene_id NM_021642; transcript_id NM_021642;     0    0    123    0.0000000
chr1    CDS    247463824    247464578    -    gene_id NM_032752; transcript_id NM_032752;     0    0    755    0.0000000
chr1    exon    247463623    247464578    -    gene_id NM_032752; transcript_id NM_032752;     0    0    956    0.0000000
chr1    CDS    248512077    248513010    +    gene_id NM_001001918; transcript_id NM_001001918;     0    0    934    0.0000000
chr1    exon    248512077    248513013    +    gene_id NM_001001918; transcript_id NM_001001918;     0    0    937    0.0000000
chr1    CDS    1572770    1572875    -    gene_id NM_033487; transcript_id NM_033487;     188    106    106    1.0000000
chr1    exon    1572770    1572875    -    gene_id NM_033487; transcript_id NM_033487;     188    106    106    1.0000000
chr1    exon    1310534    1310818    -    gene_id NM_017900; transcript_id NM_017900;     129    230    285    0.8070176
chr1    CDS    1572770    1572875    -    gene_id NM_033486; transcript_id NM_033486;     188    106    106    1.0000000
chr1    exon    1572770    1572875    -    gene_id NM_033486; transcript_id NM_033486;     188    106    106    1.0000000
chr1    CDS    1572770    1572875    -    gene_id NM_033493; transcript_id NM_033493;     188    106    106    1.0000000
chr1    exon    1572770    1572875    -    gene_id NM_033493; transcript_id NM_033493;     188    106    106    1.0000000
chr1    CDS    1179571    1179655    -    gene_id NM_001014980; transcript_id NM_001014980;     2    75    85    0.8823529
chr1    exon    1179571    1179655    -    gene_id NM_001014980; transcript_id NM_001014980;     2    75    85    0.8823529
chr1    CDS    1572770    1572875    -    gene_id NM_033492; transcript_id NM_033492;     188    106    106    1.0000000
chr1    exon    1572770    1572875    -    gene_id NM_033492; transcript_id NM_033492;     188    106    106    1.0000000
chr1    CDS    1572770    1572875    -    gene_id NM_033489; transcript_id NM_033489;     188    106    106    1.0000000
 
 
You can do other things by combining the filtering of samtools with the coverage. For example, we can use the flag that tells us whether an alignment is a proper-pair, to get only overlaps of proper-pairs, and compare the counts of the two:
$ samtools view -uf 0x2 accepted_hits.bam | coverageBed -split -abam stdin -b hg19-chr1-corrected.gtf > chr1Coverage-properPairs.bed
$ grep 'gene_id NM_033487' chr1Coverage.bed | head | cut -f 1,3-5,7,10-13
chr1    CDS    1572770    1572875    -    188    106    106    1.0000000
chr1    exon    1572770    1572875    -    188    106    106    1.0000000
chr1    stop_codon    1571126    1571128    -    37    3    3    1.0000000
chr1    CDS    1571129    1571218    -    67    90    90    1.0000000
chr1    exon    1571100    1571218    -    124    119    119    1.0000000
chr1    CDS    1571299    1571488    -    169    190    190    1.0000000
chr1    exon    1571299    1571488    -    169    190    190    1.0000000
chr1    CDS    1571695    1571843    -    132    149    149    1.0000000
chr1    exon    1571695    1571843    -    132    149    149    1.0000000
chr1    CDS    1572044    1572160    -    100    117    117    1.0000000
 
$ grep 'gene_id NM_033487' chr1Coverage-properPairs.bed | head | cut -f 1,3-5,7,10-13
chr1    CDS    1572770    1572875    -    31    106    106    1.0000000
chr1    exon    1572770    1572875    -    31    106    106    1.0000000
chr1    stop_codon    1571126    1571128    -    35    3    3    1.0000000
chr1    CDS    1571129    1571218    -    65    90    90    1.0000000
chr1    exon    1571100    1571218    -    115    119    119    1.0000000
chr1    CDS    1571299    1571488    -    94    190    190    1.0000000
chr1    exon    1571299    1571488    -    94    190    190    1.0000000
chr1    CDS    1571695    1571843    -    37    149    149    1.0000000
chr1    exon    1571695    1571843    -    37    149    149    1.0000000
chr1    CDS    1572044    1572160    -    54    117    117    1.0000000
 
 
We can also get summaries of only those reads that fall entirely within the exon, by using another bed tool 'intersectBed' and pipe the results to coverageBed.
$ intersectBed -abam accepted_hits.bam -b hg19-chr1-corrected.gtf -f 1.0 | coverageBed -abam stdin -b hg19-chr1-corrected.gtf > chr1Coverage-contained.bed
$ grep 'gene_id NM_033487' chr1Coverage.bed | head | cut -f 1,3-5,7,10-13
chr1    CDS    1572770    1572875    -    188    106    106    1.0000000
chr1    exon    1572770    1572875    -    188    106    106    1.0000000
chr1    stop_codon    1571126    1571128    -    37    3    3    1.0000000
chr1    CDS    1571129    1571218    -    67    90    90    1.0000000
chr1    exon    1571100    1571218    -    124    119    119    1.0000000
chr1    CDS    1571299    1571488    -    169    190    190    1.0000000
chr1    exon    1571299    1571488    -    169    190    190    1.0000000
chr1    CDS    1571695    1571843    -    132    149    149    1.0000000
chr1    exon    1571695    1571843    -    132    149    149    1.0000000
chr1    CDS    1572044    1572160    -    100    117    117    1.0000000
 
$ grep 'gene_id NM_033487' chr1Coverage-contained.bed | head | cut -f 1,3-5,7,10-13
chr1    CDS    1572770    1572875    -    54    105    106    0.9905660
chr1    exon    1572770    1572875    -    54    105    106    0.9905660
chr1    stop_codon    1571126    1571128    -    14    3    3    1.0000000
chr1    CDS    1571129    1571218    -    36    89    90    0.9888889
chr1    exon    1571100    1571218    -    36    112    119    0.9411765
chr1    CDS    1571299    1571488    -    151    186    190    0.9789473
chr1    exon    1571299    1571488    -    151    186    190    0.9789473
chr1    CDS    1571695    1571843    -    87    148    149    0.9932886
chr1    exon    1571695    1571843    -    87    148    149    0.9932886
chr1    CDS    1572044    1572160    -    33    117    117    1.0000000
We can do other things, like only do alignments if they match the strand ('-s' option), which would be appropriate for a stranded protocol. Another that might be of interest, is the coverage per base within each exon. This takes much longer, so we'll do only the single gene we made before:
$ coverageBed -d -split -abam accepted_hits.bam -b gene.gtf > genePerBase.bed
$ head genePerBase.bed
chr1    hg19_refGene    CDS    1572770    1572875    0    -    0    gene_id NM_033487; transcript_id NM_033487;     1    59
chr1    hg19_refGene    CDS    1572770    1572875    0    -    0    gene_id NM_033487; transcript_id NM_033487;     2    59
chr1    hg19_refGene    CDS    1572770    1572875    0    -    0    gene_id NM_033487; transcript_id NM_033487;     3    59
chr1    hg19_refGene    CDS    1572770    1572875    0    -    0    gene_id NM_033487; transcript_id NM_033487;     4    59
chr1    hg19_refGene    CDS    1572770    1572875    0    -    0    gene_id NM_033487; transcript_id NM_033487;     5    60
chr1    hg19_refGene    CDS    1572770    1572875    0    -    0    gene_id NM_033487; transcript_id NM_033487;     6    60
chr1    hg19_refGene    CDS    1572770    1572875    0    -    0    gene_id NM_033487; transcript_id NM_033487;     7    61
chr1    hg19_refGene    CDS    1572770    1572875    0    -    0    gene_id NM_033487; transcript_id NM_033487;     8    61
chr1    hg19_refGene    CDS    1572770    1572875    0    -    0    gene_id NM_033487; transcript_id NM_033487;     9    61
chr1    hg19_refGene    CDS    1572770    1572875    0    -    0    gene_id NM_033487; transcript_id NM_033487;     10    61
 
 
We might want to get gene information. We're not going to go through any details, but there are different ways this could be done. You could get all reads that overlap the entire region of the gene, in which case you would want an annotation with each gene being an entry and follow the previous commands with this type of gtf.


Authors:

This section was written by Harold Pimentel and Elizabeth Perdom. Please feel free to contact Harold if you have any questions.

References:

Bowtie
TopHat

Other tools

GMAP / GSNAP
  1. ^ Trapnell C, Pachter L, Salzberg SL (2009). TopHat: discovering splice junctions with RNA-Seq. Bioinformatics. 2009 May 1;25(9):1105-11. Epub 2009 Mar 16. dx.doi.org/10.1093/bioinformatics/btp120