Sequence+alignment

=__Section 1__: Sequence alignment overview= Below is a typical Illumina pipeline:



Almost every application that does some sort RNA-Seq analysis requires you to first align your raw reads. Raw reads are output from the base caller (e.g. BayesCall) after analyzing an intensity file generated from the image file from a sequencing experiment. Raw reads from the sequencer come in two formats:

FASTQ (most common): code @SRR039628.234 HWI-EAS220_1_FC304WC_GEX:3:1:51:335 length=50 ACAGAGCAAAAGACAAGAAAACAGATGAAATTGTGGCTCTAAAGCGGCTG +SRR039628.234 HWI-EAS220_1_FC304WC_GEX:3:1:51:335 length=50 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<:<<<<<<<9:<66<9 code

FASTA: code >SRR039628.234 HWI-EAS220_1_FC304WC_GEX:3:1:51:335 length=50 ACAGAGCAAAAGACAAGAAAACAGATGAAATTGTGGCTCTAAAGCGGCTG code

You can find complete datasets from the Short Read Archive. After you find a dataset of interest, you can convert the SRA files into FASTQ files by installing the SRA Toolkit and using the tool.

In order to align a set of reads, you need some sort of reference. In RNA-Seq you can use the **genome** or **transcriptome**. The most common way to store a reference is in FASTA format. We will cover how to download a reference in the next section.

Depending on how you will be doing your analysis (transcriptome or genome) you will do your mapping differently. Below are the different pipelines:

Genome pipeline that we will be covering right now:

Transcriptome pipeline:

=__Section 2__: Building a Bowtie index=

The first step in sequence alignment is acquiring a Bowtie index of your reference(s). You will typically use the same index file to map numerous experiments.

You can also download the reference with the annotation: iGenomes instead of building the index and downloading the annotation yourself.

To acquire an index, do the following: code $ cd Downloads/ $ cat *.fa > hg19_chr1.fa code code $ mkdir -p ~/bwt_idx/hg19_chr1 $ cp hg19_chr1.fa ~/bwt_idx/hg19_chr1 code code $ bowtie-build -f ~/bwt_idx/hg19_chr1/hg19_chr1.fa ~/bwt_idx/hg19_chr1/hg19_chr1 code
 * Check if your index is already built ( Bowtie Indexes)
 * If it does, download it and unzip to your index location
 * If it doesn't, download the genome from UCSC Genome Browser
 * On the left hand side, click "Downloads"
 * Find your model organism and click it (e.g. Human)
 * Find the build (e.g. hg19 or hg19)
 * Click "Full data set"
 * Download "chromFa.tar.gz"
 * Unzip it
 * For this tutorial, please use the file which has been downloaded from the UCSC genome browser.
 * Once you have all of your sequences, the easiest way to make an index is to pipe all of your files into a multi-FASTA file:
 * Move the file to a safe place you can remember. I like to use (see the note at the end of this section).
 * Build the actual index:

You should now see a number of files in directory :

code $ ls ~/bwt_idx/hg19_chr1/ hg19_chr1.1.ebwt hg19_chr1.2.ebwt  hg19_chr1.3.ebwt  hg19_chr1.4.ebwt hg19_chr1.fa hg19_chr1.rev.1.ebwt  hg19_chr1.rev.2.ebwt code


 * IMPORTANT** Remember to keep the original multi-FASTA file with the index . Programs that use the index often also require the multi-FASTA and will look for it here. If the file is not there, then it will have to rebuild the multi-FASTA which can often take a while.


 * Note: Why the redundant name? Personally I like doing this because when you start collecting numerous indexes your folder becomes huge (each index makes several files). I also like including a README file if the name isn't completely obvious or I have some extra notes I want to include about the index.

=__Section 3__: Mapping with TopHat=

How TopHat works
TopHat first uses Bowtie to map reads to the genome. After the initial mapping, there are unmapped reads. TopHat attempts to map these reads against both novel and known junctions. It does this by creating Bowtie indexes for junctions.

It finds junctions by finding all possible canonical sites (GT-AG) with introns between 70 - 20,000 bp long. This is based on the fact that 93% of known mouse introns in the UCSC annotation fall within this range. This is a parameter that can be adjusted if you suspect is incorrect for your model system.

It also finds junctions by using the assembly of mapped reads which imply exons.

For more information, check out the TopHat paper

Choosing an annotation file
It is important to choose a relevant annotation. How reads maps all the way to how isoforms will be quantified depends on this part. If you use an annotation file from the beginning you should stick with it or start the analysis over. Alternatively you could not use an annotation at all.

If you decide to use an annotation, the most complete place to get annotations is the UCSC Genome Browser. Here is how to do it:


 * Go to @http://genome.ucsc.edu
 * Click "Table Browser" on the left hand side
 * Select the genome genome and assembly you will be working on
 * Make sure you choose the same assembly for both your genome reference and annotation
 * Select "Group: Genes and Gene Prediction Tracks"
 * Choose a track that contains the isoforms of interest. If you don't know:
 * RefSeq is rather conservative
 * Ensembl is much more complete
 * Ensure "Region: genome" is selected
 * Select "Output: GTF - gene transfer format"
 * Type a relevant "output file" name
 * e.g.: hg19_refGene.gtf
 * Click "get output"

Running a TopHat process
TopHat has many parameters that are important and relevant to each experiment. You should take care to choose your parameters wisely. This is only a very short list of the set of parameters available in the TopHat manual

First, make sure you have a Bowtie index (see Section 2).


 * ~ Parameter ||~ Short description ||~ Description ||~ Examples ||
 * < || Number of threads || How many concurrent threads will be run in Bowtie. If you have more than one processor, you should use at most the number of processors you have ||< >= 1 ||
 * < || Annotation || This annotation will supply known junctions. In future versions of TopHat it will also map directory to the transcriptome which increases sensitivity. ||<  ||
 * < ||< Mate inner distance ||< This is the expected (mean) inner distance between mate pairs. For, example, for paired end runs with fragments selected at 300bp, where each end is 50bp, you should set -r to be 200. There is no default, and this parameter is required for paired end runs. ||< 200 ||
 * < ||< Mate standard deviation ||< The standard deviation for the distribution on inner distances between mate pairs. The default is 20bp. This is used to discard mates that are too far apart. It is better to overestimate this than underestimate. ||< 80 ||

Note that you will get very different results depending on how choose the mean and standard deviation. This is because TopHat discards mate pairs that are one standard deviation away from the mean.

Let's map the supplied data we have:

code $ tophat --GTF ~/annotations/hg19-refGene.gtf -r 200 --mate-std-dev 80 ~/bwt_idx/hg19_chr1/hg19_chr1 chr1_2M_1.fq chr1_2M_2.fq code

Once it has completed, you should see a number of files:

code $ ls accepted_hits.bam deletions.bed  insertions.bed  junctions.bed left_kept_reads.info logs  right_kept_reads.info code

I like to write a script in Bash to keep track of the parameters I used. Below is the same example in a Bash script:

code format="bash"
 * 1) !/bin/bash

EXP_NAME=exp1 GTF=~/annotations/hg19-refGene.gtf DATA_DIR=~/exp1 LEFT="chr1_2M_1.fq" RIGHT="chr1_2M_2.fq" BWT_IDX=~/bwt_idx/hg19_chr1/hg19_chr1 INNER_LEN=200 STD_DEV=80 OUT_DIR=$EXP_NAME
 * 1) Change these variables to match your experiment

CMD="tophat --GTF $GTF -r $INNER_LEN --mate-std-dev $STD_DEV -o $OUT_DIR $BWT_IDX $DATA_DIR/$LEFT $DATA_DIR/$RIGHT"

echo "Running $CMD" $CMD 2>&1 | tee $EXP_NAME.log code
 * 1) Show output on the screen AND save in $EXP_NAME.log

Once you've written your script, save it as something like, change the permissions and run it.

=__Section 4__: Understanding the output=

The standard output will be three output files in the directory you specified (using or the default  )

This is where the alignments will be stored in SAM format. If you have a TopHat BAM file, you can see the code that generated the file. It will appear in the "CL:" field:

code $ samtools view -H accepted_hits.bam code

The file is the file that you should later use for abundance estimation.

The file contains a list of the junctions from the annotation and the novel junctions that TopHat found.

= __Section 5:__ Samtools =

After the reads are aligned, a common question is to get summaries of the data, such as how many reads overlap each exon or gene -- we will talk more about what are good estimates of gene expression (especially if there is alternative splicing in the organism). The first set of tools that are used for bam files are called 'samtools'. With these you can quickly access parts of the data. We will give just a couple of examples here.

First you have to create an index for the bam file, so that the reads can be quickly accessed. It will create an '*.bai' file. code samtools index accepted_hits.bam code

We can get some basic alignment summaries:

code $ samtools flagstat accepted_hits.bam

138154 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 duplicates 138154 + 0 mapped (100.00%:nan%) 138154 + 0 paired in sequencing 69211 + 0 read1 68943 + 0 read2 88314 + 0 properly paired (63.92%:nan%) 125798 + 0 with itself and mate mapped 12356 + 0 singletons (8.94%:nan%) 0 + 0 with mate mapped to a different chr 0 + 0 with mate mapped to a different chr (mapQ>=5) code We can get basic summaries per chromosome (reference sequence name) code samtools idxstats accepted_hits.bam chr1 249250621 138154 0 code Each line consists of reference sequence name, sequence length, # mapped reads and # unmapped reads.
 * 0 0 0

We can get the reads that overlap a region, in this case corresponding to a gene "NM_033487", using 'view'. We'll save it to another file

code $ samtools view accepted_hits.bam chr1:1571100-1655775 > geneOverlap.sam $ head geneOverlap.sam SRR039630.576251   83    chr1    1532846    255    25M289602N25M    =    1822308    289512    CCCCCCCCGCCGCCCCCCCCCCCGCCTCCGTCCGCCCCTCAGACGCCTCC    3.394998&99&:4..9&<8"<<&<<"<<4"<<<<<<<8<<<<<<<<<<<    NM:i:2    XS:A:-    NH:i:1 SRR039630.6482376    83    chr1    1532863    255    8M289602N42M    =    1822298    289485    CCCCCCCCCTCCGTCCGCCCCTCAGACGCCTCCAGCCATCGGGATGGGCG    &94&88&:9":8&&;<8<<<<;<<<<<<<<;<<<<<<<<<<<<<<<<<<<    NM:i:1    XS:A:-    NH:i:1 SRR039628.4097271   355    chr1    1571051    3    50M    =    1571302    301    ATCGCAGCTCCAGCCGCAGTAGCCACGCCTGTGGTCCCGGCTGGGGTCTC    <<<<<<<<<<<<<<<<<<<<<<;<<<<<-<<<<<6:<86<8.8-:+<28:    NM:i:2    NH:i:2    CC:Z:=    CP:i:1634271    HI:i:0 SRR039628.5629287   99    chr1    1571051    3    50M    =    1571302    301    ATCGCAGCTCCAGCCGAAGTAGCCACGCCTGTGGTCCCGGCTGAGTTCTC    ;;;7;;4;;;;;;.;;(8;;;0;8;:89;+5492777.46864".77478    NM:i:1    NH:i:2    CC:Z:=    CP:i:1634271    HI:i:0 SRR039632.632305    163    chr1    1571051    3    50M    =    1571299    298    ATCGCAGCTCCAGCCGCAGTAGCCACGCCTGTGGGCCCGGCTGAGTTCTT    <<<<<<<<<<<<<<<<<<<5;<<5+9;5994544+45/54/54+4'4+4"    NM:i:2    NH:i:2    CC:Z:=    CP:i:1634271    HI:i:0 SRR039628.1159782    403    chr1    1571052    3    50M    =    1570899    -203    TCGCAGCCCCAGCCGCAGTAGCCACGCCTGTGGTCCCGGCTGAGTTCTCC    +;5;5"<'<:7'<:9<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<    NM:i:1    NH:i:2    CC:Z:=    CP:i:1634272    HI:i:0 SRR039628.639841    419    chr1    1571053    3    50M    =    1571299    296    CGCAGCTCCAGCCGCAGTAGCCACGCCTGTGGTCCCGGCTGAGTTCTCCC    <<<<<<<<<<<<<<<<<<<<<<*<<<<<<:<<53<<;:+39/7993939;    NM:i:0    NH:i:2    CC:Z:=    CP:i:1634273    HI:i:0 SRR039628.3362328   345    chr1    1571053    3    50M    *    0    0    CCCAGCCCCAGCCGCAGTAGCCACGCCTGTGGTCCCGGCTGAGTTCTCCC    <*<''<'<<'8<<:<<<;<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<    NM:i:2    NH:i:2    CC:Z:=    CP:i:1634273    HI:i:0 SRR039630.1486911   419    chr1    1571053    3    50M    =    1571298    295    CGCAGCTCCAGCCGCAGTAGCCACGCCTGTGGTCCCGGCTGAGTTCTCCC    <<<<<<8<<<<<<7<6<<<<<<<<<<<:<+2<4<;444844+8+8;;;44    NM:i:0    NH:i:2    CC:Z:=    CP:i:1634273    HI:i:0 SRR039630.3604171   355    chr1    1571053    3    50M    =    1571299    296    CGCAGCTCCAGCCGCAGTAGCCACGCCTGTGGTCCCGGCTGAGTTCTCCC    <<<<<<<<<<<<<<<<<<<<<<<<<<<<<:9<<<<;:;;;9.:8;::8;;    NM:i:0    NH:i:2    CC:Z:=    CP:i:1634273    HI:i:0 code You can also have it count the alignments, which is just basically a line count of all the alignments listed (so mate pairs count separately if they are listed separately in the bam file); same with multiple alignments. Furthermore, if the intron gap of the alignment (the 'N') overlaps the region, that will be counted too. code $ samtools view -c accepted_hits.bam chr1:1571100-1655775 10603 $ wc -l geneOverlap.sam 10603 geneOverlap.sam code You can also use samtools to filter out reads based on some flags. A common one is to only get reads that align as a proper pair. code $ samtools view -b -f 0x2 accepted_hits.bam > properPairs.bam $ samtools index properPairs.bam $ samtools view -c properPairs.bam chr1:1571100-1655775 7150 code The 'b' options tells samtools that the output should be in bam format. The -f is the filter (0x2) corresponds to proper pairs. You can actually do this all in one line code $ samtools view -cbf 0x2 accepted_hits.bam chr1:1571100-1655775 7150 code = __Section 6__: bedTools =

To get specific types of summaries, many people just write programs using samtools. **bedTools** ( [] ) provides a lot of summarization tools as well as other tools for comparing overlap of annotation, etc. They can work with BED files, but also GTF annotation files and BAM files. We are not going to discuss how to get the right annotation file for the summaries you want, but bedTools also have a lot tools for this. All of these tools can also be used between annotations, for example to find intersections of exons, etc.

We will work with the '.gtf' that we used to align the files (its important that the annotation for the alignment match the annotation for the summaries, in terms of the name of the chromosome, the genome build, etc.). Since we are only working with chr1 reads, we can limit ourselves to these to save processing time.

code $ grep -P 'chr1\t' hg19-refGene.gtf > hg19-chr1.gtf $ head hg19-chr1.gtf chr1 hg19_refGene start_codon 67000042 67000044 0.000000 +. gene_id "NM_032291"; transcript_id "NM_032291"; chr1 hg19_refGene CDS 67000042 67000051 0.000000 + 0 gene_id "NM_032291"; transcript_id "NM_032291"; chr1 hg19_refGene exon 66999825 67000051 0.000000 +. gene_id "NM_032291"; transcript_id "NM_032291"; chr1 hg19_refGene CDS 67091530 67091593 0.000000 + 2 gene_id "NM_032291"; transcript_id "NM_032291"; chr1 hg19_refGene exon 67091530 67091593 0.000000 +. gene_id "NM_032291"; transcript_id "NM_032291"; chr1 hg19_refGene CDS 67098753 67098777 0.000000 + 1 gene_id "NM_032291"; transcript_id "NM_032291"; chr1 hg19_refGene exon 67098753 67098777 0.000000 +. gene_id "NM_032291"; transcript_id "NM_032291"; chr1 hg19_refGene CDS 67101627 67101698 0.000000 + 0 gene_id "NM_032291"; transcript_id "NM_032291"; chr1 hg19_refGene exon 67101627 67101698 0.000000 +. gene_id "NM_032291"; transcript_id "NM_032291"; chr1 hg19_refGene CDS 67105460 67105516 0.000000 + 0 gene_id "NM_032291"; transcript_id "NM_032291"; code Notice that the regions are by exon, and they overlap and are repeated. This is because what is an exon region is different depending on what transcript it is in. bedTools finds the overlap for a region, so each of these regions will have their own overlap (and so a read might get counted twice.) So this may not be the perfect annotation depending on what kind of summaries you want.

For simplicity later on, we will also make a file for just the information for a single gene.

code $ grep 'gene_id NM_033487' hg19-chr1-corrected.gtf > gene.gtf code Then we can run coverageBed to find the coverage for each region in the gtf file. code $ coverageBed -split -abam accepted_hits.bam -b hg19-chr1.gtf > chr1Coverage.bed Error: malformed GFF entry at line 15525. Start was greater than end. Exiting. code There is some error in the gtf file (we are not sure why) where the starts are less than the ends for a few entries, so we will work with a version where these have been taken out (code not shown). For simplicity later on, we will also make a file for just the information for a single gene.

code $ coverageBed -split -abam accepted_hits.bam -b hg19-chr1-corrected.gtf > chr1Coverage.bed code The -split option is important for RNA-Seq: it says do not count the overlap if the overlap is in the intron region (in contrast the summary in samtools appears to counts this)

The old annotation information is there, as well as 4 additional columns:

1) The number of features in A (the bam file) that overlapped (by at least one base pair) the B interval (the exon). 2) The number of bases in B that had non-zero coverage from features in A. 3) The length of the entry in B. 4) The fraction of bases in B that had non-zero coverage from features in A.

To save space, we'll look at just the columns of interest code $ head -20 chr1Coverage.bed | cut -f 1,3-5,7,10-13,9 chr1   CDS    161480624    161480746    +    gene_id NM_001136219; transcript_id NM_001136219;     0    0    123    0.0000000 chr1   exon    161480624    161480746    +    gene_id NM_001136219; transcript_id NM_001136219;     0    0    123    0.0000000 chr1   CDS    161480624    161480746    +    gene_id NM_021642; transcript_id NM_021642;     0    0    123    0.0000000 chr1   exon    161480624    161480746    +    gene_id NM_021642; transcript_id NM_021642;     0    0    123    0.0000000 chr1   CDS    247463824    247464578    -    gene_id NM_032752; transcript_id NM_032752;     0    0    755    0.0000000 chr1   exon    247463623    247464578    -    gene_id NM_032752; transcript_id NM_032752;     0    0    956    0.0000000 chr1   CDS    248512077    248513010    +    gene_id NM_001001918; transcript_id NM_001001918;     0    0    934    0.0000000 chr1   exon    248512077    248513013    +    gene_id NM_001001918; transcript_id NM_001001918;     0    0    937    0.0000000 chr1   CDS    1572770    1572875    -    gene_id NM_033487; transcript_id NM_033487;     188    106    106    1.0000000 chr1   exon    1572770    1572875    -    gene_id NM_033487; transcript_id NM_033487;     188    106    106    1.0000000 chr1   exon    1310534    1310818    -    gene_id NM_017900; transcript_id NM_017900;     129    230    285    0.8070176 chr1   CDS    1572770    1572875    -    gene_id NM_033486; transcript_id NM_033486;     188    106    106    1.0000000 chr1   exon    1572770    1572875    -    gene_id NM_033486; transcript_id NM_033486;     188    106    106    1.0000000 chr1   CDS    1572770    1572875    -    gene_id NM_033493; transcript_id NM_033493;     188    106    106    1.0000000 chr1   exon    1572770    1572875    -    gene_id NM_033493; transcript_id NM_033493;     188    106    106    1.0000000 chr1   CDS    1179571    1179655    -    gene_id NM_001014980; transcript_id NM_001014980;     2    75    85    0.8823529 chr1   exon    1179571    1179655    -    gene_id NM_001014980; transcript_id NM_001014980;     2    75    85    0.8823529 chr1   CDS    1572770    1572875    -    gene_id NM_033492; transcript_id NM_033492;     188    106    106    1.0000000 chr1   exon    1572770    1572875    -    gene_id NM_033492; transcript_id NM_033492;     188    106    106    1.0000000 chr1   CDS    1572770    1572875    -    gene_id NM_033489; transcript_id NM_033489;     188    106    106    1.0000000

code You can do other things by combining the filtering of samtools with the coverage. For example, we can use the flag that tells us whether an alignment is a proper-pair, to get only overlaps of proper-pairs, and compare the counts of the two: code $ samtools view -uf 0x2 accepted_hits.bam | coverageBed -split -abam stdin -b hg19-chr1-corrected.gtf > chr1Coverage-properPairs.bed code code $ grep 'gene_id NM_033487' chr1Coverage.bed | head | cut -f 1,3-5,7,10-13 chr1   CDS    1572770    1572875    -    188    106    106    1.0000000 chr1   exon    1572770    1572875    -    188    106    106    1.0000000 chr1   stop_codon    1571126    1571128    -    37    3    3    1.0000000 chr1   CDS    1571129    1571218    -    67    90    90    1.0000000 chr1   exon    1571100    1571218    -    124    119    119    1.0000000 chr1   CDS    1571299    1571488    -    169    190    190    1.0000000 chr1   exon    1571299    1571488    -    169    190    190    1.0000000 chr1   CDS    1571695    1571843    -    132    149    149    1.0000000 chr1   exon    1571695    1571843    -    132    149    149    1.0000000 chr1   CDS    1572044    1572160    -    100    117    117    1.0000000

$ grep 'gene_id NM_033487' chr1Coverage-properPairs.bed | head | cut -f 1,3-5,7,10-13 chr1   CDS    1572770    1572875    -    31    106    106    1.0000000 chr1   exon    1572770    1572875    -    31    106    106    1.0000000 chr1   stop_codon    1571126    1571128    -    35    3    3    1.0000000 chr1   CDS    1571129    1571218    -    65    90    90    1.0000000 chr1   exon    1571100    1571218    -    115    119    119    1.0000000 chr1   CDS    1571299    1571488    -    94    190    190    1.0000000 chr1   exon    1571299    1571488    -    94    190    190    1.0000000 chr1   CDS    1571695    1571843    -    37    149    149    1.0000000 chr1   exon    1571695    1571843    -    37    149    149    1.0000000 chr1   CDS    1572044    1572160    -    54    117    117    1.0000000

code We can also get summaries of only those reads that fall entirely within the exon, by using another bed tool 'intersectBed' and pipe the results to coverageBed. code $ intersectBed -abam accepted_hits.bam -b hg19-chr1-corrected.gtf -f 1.0 | coverageBed -abam stdin -b hg19-chr1-corrected.gtf > chr1Coverage-contained.bed $ grep 'gene_id NM_033487' chr1Coverage.bed | head | cut -f 1,3-5,7,10-13 chr1   CDS    1572770    1572875    -    188    106    106    1.0000000 chr1   exon    1572770    1572875    -    188    106    106    1.0000000 chr1   stop_codon    1571126    1571128    -    37    3    3    1.0000000 chr1   CDS    1571129    1571218    -    67    90    90    1.0000000 chr1   exon    1571100    1571218    -    124    119    119    1.0000000 chr1   CDS    1571299    1571488    -    169    190    190    1.0000000 chr1   exon    1571299    1571488    -    169    190    190    1.0000000 chr1   CDS    1571695    1571843    -    132    149    149    1.0000000 chr1   exon    1571695    1571843    -    132    149    149    1.0000000 chr1   CDS    1572044    1572160    -    100    117    117    1.0000000

$ grep 'gene_id NM_033487' chr1Coverage-contained.bed | head | cut -f 1,3-5,7,10-13 chr1   CDS    1572770    1572875    -    54    105    106    0.9905660 chr1   exon    1572770    1572875    -    54    105    106    0.9905660 chr1   stop_codon    1571126    1571128    -    14    3    3    1.0000000 chr1   CDS    1571129    1571218    -    36    89    90    0.9888889 chr1   exon    1571100    1571218    -    36    112    119    0.9411765 chr1   CDS    1571299    1571488    -    151    186    190    0.9789473 chr1   exon    1571299    1571488    -    151    186    190    0.9789473 chr1   CDS    1571695    1571843    -    87    148    149    0.9932886 chr1   exon    1571695    1571843    -    87    148    149    0.9932886 chr1   CDS    1572044    1572160    -    33    117    117    1.0000000 code We can do other things, like only do alignments if they match the strand ('-s' option), which would be appropriate for a stranded protocol. Another that might be of interest, is the coverage per base within each exon. This takes much longer, so we'll do only the single gene we made before: code $ coverageBed -d -split -abam accepted_hits.bam -b gene.gtf > genePerBase.bed $ head genePerBase.bed chr1   hg19_refGene    CDS    1572770    1572875    0    -    0    gene_id NM_033487; transcript_id NM_033487;     1    59 chr1   hg19_refGene    CDS    1572770    1572875    0    -    0    gene_id NM_033487; transcript_id NM_033487;     2    59 chr1   hg19_refGene    CDS    1572770    1572875    0    -    0    gene_id NM_033487; transcript_id NM_033487;     3    59 chr1   hg19_refGene    CDS    1572770    1572875    0    -    0    gene_id NM_033487; transcript_id NM_033487;     4    59 chr1   hg19_refGene    CDS    1572770    1572875    0    -    0    gene_id NM_033487; transcript_id NM_033487;     5    60 chr1   hg19_refGene    CDS    1572770    1572875    0    -    0    gene_id NM_033487; transcript_id NM_033487;     6    60 chr1   hg19_refGene    CDS    1572770    1572875    0    -    0    gene_id NM_033487; transcript_id NM_033487;     7    61 chr1   hg19_refGene    CDS    1572770    1572875    0    -    0    gene_id NM_033487; transcript_id NM_033487;     8    61 chr1   hg19_refGene    CDS    1572770    1572875    0    -    0    gene_id NM_033487; transcript_id NM_033487;     9    61 chr1   hg19_refGene    CDS    1572770    1572875    0    -    0    gene_id NM_033487; transcript_id NM_033487;     10    61

code We might want to get gene information. We're not going to go through any details, but there are different ways this could be done. You could get all reads that overlap the entire region of the gene, in which case you would want an annotation with each gene being an entry and follow the previous commands with this type of gtf. = = = = = =

= __Authors__: = This section was written by Harold Pimentel and Elizabeth Perdom. Please feel free to contact Harold if you have any questions.

=__References__:= Bowtie TopHat

Other tools
GMAP / GSNAP