Section 5: Samtools


After the reads are aligned, a common question is to get summaries of the data, such as how many reads overlap each exon or gene -- we will talk more about what are good estimates of gene expression (especially if there is alternative splicing in the organism). The first set of tools that are used for bam files are called 'samtools'. With these you can quickly access parts of the data. We will give just a couple of examples here.

First you have to create an index for the bam file, so that the reads can be quickly accessed. It will create an '*.bai' file.

samtools index accepted_hits.bam

We can get some basic alignment summaries:
$ samtools flagstat accepted_hits.bam
 
138154 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
138154 + 0 mapped (100.00%:nan%)
138154 + 0 paired in sequencing
69211 + 0 read1
68943 + 0 read2
88314 + 0 properly paired (63.92%:nan%)
125798 + 0 with itself and mate mapped
12356 + 0 singletons (8.94%:nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

We can get basic summaries per chromosome (reference sequence name)

samtools idxstats accepted_hits.bam
chr1 249250621 138154 0
* 0 0 0
Each line consists of reference sequence name, sequence length, # mapped reads and # unmapped reads.

We can get the reads that overlap a region, in this case corresponding to a gene "NM_033487", using 'view'. We'll save it to another file

$ samtools view accepted_hits.bam chr1:1571100-1655775 > geneOverlap.sam
$ head geneOverlap.sam
SRR039630.576251    83    chr1    1532846    255    25M289602N25M    =    1822308    289512    CCCCCCCCGCCGCCCCCCCCCCCGCCTCCGTCCGCCCCTCAGACGCCTCC    3.394998&99&:4..9&<8"<<&<<"<<4"<<<<<<<8<<<<<<<<<<<    NM:i:2    XS:A:-    NH:i:1
SRR039630.6482376    83    chr1    1532863    255    8M289602N42M    =    1822298    289485    CCCCCCCCCTCCGTCCGCCCCTCAGACGCCTCCAGCCATCGGGATGGGCG    &94&88&:9":8&&;<8<<<<;<<<<<<<<;<<<<<<<<<<<<<<<<<<<    NM:i:1    XS:A:-    NH:i:1
SRR039628.4097271    355    chr1    1571051    3    50M    =    1571302    301    ATCGCAGCTCCAGCCGCAGTAGCCACGCCTGTGGTCCCGGCTGGGGTCTC    <<<<<<<<<<<<<<<<<<<<<<;<<<<<-<<<<<6:<86<8.8-:+<28:    NM:i:2    NH:i:2    CC:Z:=    CP:i:1634271    HI:i:0
SRR039628.5629287    99    chr1    1571051    3    50M    =    1571302    301    ATCGCAGCTCCAGCCGAAGTAGCCACGCCTGTGGTCCCGGCTGAGTTCTC    ;;;7;;4;;;;;;.;;(8;;;0;8;:89;+5492777.46864".77478    NM:i:1    NH:i:2    CC:Z:=    CP:i:1634271    HI:i:0
SRR039632.632305    163    chr1    1571051    3    50M    =    1571299    298    ATCGCAGCTCCAGCCGCAGTAGCCACGCCTGTGGGCCCGGCTGAGTTCTT    <<<<<<<<<<<<<<<<<<<5;<<5+9;5994544+45/54/54+4'4+4"    NM:i:2    NH:i:2    CC:Z:=    CP:i:1634271    HI:i:0
SRR039628.1159782    403    chr1    1571052    3    50M    =    1570899    -203    TCGCAGCCCCAGCCGCAGTAGCCACGCCTGTGGTCCCGGCTGAGTTCTCC    +;5;5"<'<:7'<:9<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<    NM:i:1    NH:i:2    CC:Z:=    CP:i:1634272    HI:i:0
SRR039628.639841    419    chr1    1571053    3    50M    =    1571299    296    CGCAGCTCCAGCCGCAGTAGCCACGCCTGTGGTCCCGGCTGAGTTCTCCC    <<<<<<<<<<<<<<<<<<<<<<*<<<<<<:<<53<<;:+39/7993939;    NM:i:0    NH:i:2    CC:Z:=    CP:i:1634273    HI:i:0
SRR039628.3362328    345    chr1    1571053    3    50M    *    0    0    CCCAGCCCCAGCCGCAGTAGCCACGCCTGTGGTCCCGGCTGAGTTCTCCC    <*<''<'<<'8<<:<<<;<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<    NM:i:2    NH:i:2    CC:Z:=    CP:i:1634273    HI:i:0
SRR039630.1486911    419    chr1    1571053    3    50M    =    1571298    295    CGCAGCTCCAGCCGCAGTAGCCACGCCTGTGGTCCCGGCTGAGTTCTCCC    <<<<<<8<<<<<<7<6<<<<<<<<<<<:<+2<4<;444844+8+8;;;44    NM:i:0    NH:i:2    CC:Z:=    CP:i:1634273    HI:i:0
SRR039630.3604171    355    chr1    1571053    3    50M    =    1571299    296    CGCAGCTCCAGCCGCAGTAGCCACGCCTGTGGTCCCGGCTGAGTTCTCCC    <<<<<<<<<<<<<<<<<<<<<<<<<<<<<:9<<<<;:;;;9.:8;::8;;    NM:i:0    NH:i:2    CC:Z:=    CP:i:1634273    HI:i:0
 

You can also have it count the alignments, which is just basically a line count of all the alignments listed (so mate pairs count separately if they are listed separately in the bam file); same with multiple alignments. Furthermore, if the intron gap of the alignment (the 'N') overlaps the region, that will be counted too.
$ samtools view -c accepted_hits.bam chr1:1571100-1655775
10603
$ wc -l geneOverlap.sam
   10603 geneOverlap.sam
 
You can also use samtools to filter out reads based on some flags. A common one is to only get reads that align as a proper pair.

$ samtools view -b -f 0x2 accepted_hits.bam > properPairs.bam
$ samtools index properPairs.bam
$ samtools view -c properPairs.bam chr1:1571100-1655775
7150
The 'b' options tells samtools that the output should be in bam format. The -f is the filter (0x2) corresponds to proper pairs. You can actually do this all in one line
$ samtools view -cbf 0x2 accepted_hits.bam chr1:1571100-1655775
7150

Section 6: bedTools


To get specific types of summaries, many people just write programs using samtools. bedTools (http://code.google.com/p/bedtools/) provides a lot of summarization tools as well as other tools for comparing overlap of annotation, etc. They can work with BED files, but also GTF annotation files and BAM files. We are not going to discuss how to get the right annotation file for the summaries you want, but bedTools also have a lot tools for this. All of these tools can also be used between annotations, for example to find intersections of exons, etc.

We will work with the '.gtf' that we used to align the files (its important that the annotation for the alignment match the annotation for the summaries, in terms of the name of the chromosome, the genome build, etc.). Since we are only working with chr1 reads, we can limit ourselves to these to save processing time.

$ grep -P 'chr1\t' hg19-refGene.gtf > hg19-chr1.gtf
$ head hg19-chr1.gtf
chr1 hg19_refGene start_codon 67000042 67000044 0.000000 + . gene_id "NM_032291"; transcript_id "NM_032291";
chr1 hg19_refGene CDS 67000042 67000051 0.000000 + 0 gene_id "NM_032291"; transcript_id "NM_032291";
chr1 hg19_refGene exon 66999825 67000051 0.000000 + . gene_id "NM_032291"; transcript_id "NM_032291";
chr1 hg19_refGene CDS 67091530 67091593 0.000000 + 2 gene_id "NM_032291"; transcript_id "NM_032291";
chr1 hg19_refGene exon 67091530 67091593 0.000000 + . gene_id "NM_032291"; transcript_id "NM_032291";
chr1 hg19_refGene CDS 67098753 67098777 0.000000 + 1 gene_id "NM_032291"; transcript_id "NM_032291";
chr1 hg19_refGene exon 67098753 67098777 0.000000 + . gene_id "NM_032291"; transcript_id "NM_032291";
chr1 hg19_refGene CDS 67101627 67101698 0.000000 + 0 gene_id "NM_032291"; transcript_id "NM_032291";
chr1 hg19_refGene exon 67101627 67101698 0.000000 + . gene_id "NM_032291"; transcript_id "NM_032291";
chr1 hg19_refGene CDS 67105460 67105516 0.000000 + 0 gene_id "NM_032291"; transcript_id "NM_032291";
Notice that the regions are by exon, and they overlap and are repeated. This is because what is an exon region is different depending on what transcript it is in. bedTools finds the overlap for a region, so each of these regions will have their own overlap (and so a read might get counted twice.) So this may not be the perfect annotation depending on what kind of summaries you want.

$ coverageBed -split -abam accepted_hits.bam -b hg19-chr1.gtf > chr1Coverage.bed
Error: malformed GFF entry at line 15525. Start was greater than end. Exiting.
There is some error in the gtf file (we are not sure why) where the starts are less than the ends for a few entries, so we will work with a version where these have been taken out (code not shown). For simplicity later on, we will also make a file for just the information for a single gene.
grep 'gene_id NM_033487' hg19-chr1-corrected.gtf > gene.gtf
 

coverageBed -split -abam accepted_hits.bam -b hg19-chr1-corrected.gtf > chr1Coverage.bed

The -split option is important for RNA-Seq: it says do not count the overlap if the overlap is in the intron region (in contrast the summary in samtools appears to counts this)

The old annotation information is there, as well as 4 additional columns:

1) The number of features in A (the bam file) that overlapped (by at least one base pair) the B interval (the exon).
2) The number of bases in B that had non-zero coverage from features in A.
3) The length of the entry in B.
4) The fraction of bases in B that had non-zero coverage from features in A.

To save space, we'll look at just the columns of interest

$ head -20 chr1Coverage.bed | cut -f 1,3-5,7,10-13,9
chr1    CDS    161480624    161480746    +    gene_id NM_001136219; transcript_id NM_001136219;     0    0    123    0.0000000
chr1    exon    161480624    161480746    +    gene_id NM_001136219; transcript_id NM_001136219;     0    0    123    0.0000000
chr1    CDS    161480624    161480746    +    gene_id NM_021642; transcript_id NM_021642;     0    0    123    0.0000000
chr1    exon    161480624    161480746    +    gene_id NM_021642; transcript_id NM_021642;     0    0    123    0.0000000
chr1    CDS    247463824    247464578    -    gene_id NM_032752; transcript_id NM_032752;     0    0    755    0.0000000
chr1    exon    247463623    247464578    -    gene_id NM_032752; transcript_id NM_032752;     0    0    956    0.0000000
chr1    CDS    248512077    248513010    +    gene_id NM_001001918; transcript_id NM_001001918;     0    0    934    0.0000000
chr1    exon    248512077    248513013    +    gene_id NM_001001918; transcript_id NM_001001918;     0    0    937    0.0000000
chr1    CDS    1572770    1572875    -    gene_id NM_033487; transcript_id NM_033487;     188    106    106    1.0000000
chr1    exon    1572770    1572875    -    gene_id NM_033487; transcript_id NM_033487;     188    106    106    1.0000000
chr1    exon    1310534    1310818    -    gene_id NM_017900; transcript_id NM_017900;     129    230    285    0.8070176
chr1    CDS    1572770    1572875    -    gene_id NM_033486; transcript_id NM_033486;     188    106    106    1.0000000
chr1    exon    1572770    1572875    -    gene_id NM_033486; transcript_id NM_033486;     188    106    106    1.0000000
chr1    CDS    1572770    1572875    -    gene_id NM_033493; transcript_id NM_033493;     188    106    106    1.0000000
chr1    exon    1572770    1572875    -    gene_id NM_033493; transcript_id NM_033493;     188    106    106    1.0000000
chr1    CDS    1179571    1179655    -    gene_id NM_001014980; transcript_id NM_001014980;     2    75    85    0.8823529
chr1    exon    1179571    1179655    -    gene_id NM_001014980; transcript_id NM_001014980;     2    75    85    0.8823529
chr1    CDS    1572770    1572875    -    gene_id NM_033492; transcript_id NM_033492;     188    106    106    1.0000000
chr1    exon    1572770    1572875    -    gene_id NM_033492; transcript_id NM_033492;     188    106    106    1.0000000
chr1    CDS    1572770    1572875    -    gene_id NM_033489; transcript_id NM_033489;     188    106    106    1.0000000
 
You can do other things by combining the filtering of samtools with the coverage. For example, we can use the flag that tells us whether an alignment is a proper-pair, to get only overlaps of proper-pairs:

samtools view -uf 0x2 accepted_hits.bam | coverageBed -split -abam stdin -b hg19-chr1-corrected.gtf > chr1Coverage-properPairs.bed
We can compare the number of counts for the two:

$ grep 'gene_id NM_033487' chr1Coverage.bed | head | cut -f 1,3-5,7,10-13
chr1    CDS    1572770    1572875    -    188    106    106    1.0000000
chr1    exon    1572770    1572875    -    188    106    106    1.0000000
chr1    stop_codon    1571126    1571128    -    37    3    3    1.0000000
chr1    CDS    1571129    1571218    -    67    90    90    1.0000000
chr1    exon    1571100    1571218    -    124    119    119    1.0000000
chr1    CDS    1571299    1571488    -    169    190    190    1.0000000
chr1    exon    1571299    1571488    -    169    190    190    1.0000000
chr1    CDS    1571695    1571843    -    132    149    149    1.0000000
chr1    exon    1571695    1571843    -    132    149    149    1.0000000
chr1    CDS    1572044    1572160    -    100    117    117    1.0000000
 
$ grep 'gene_id NM_033487' chr1Coverage-properPairs.bed | head | cut -f 1,3-5,7,10-13
chr1    CDS    1572770    1572875    -    31    106    106    1.0000000
chr1    exon    1572770    1572875    -    31    106    106    1.0000000
chr1    stop_codon    1571126    1571128    -    35    3    3    1.0000000
chr1    CDS    1571129    1571218    -    65    90    90    1.0000000
chr1    exon    1571100    1571218    -    115    119    119    1.0000000
chr1    CDS    1571299    1571488    -    94    190    190    1.0000000
chr1    exon    1571299    1571488    -    94    190    190    1.0000000
chr1    CDS    1571695    1571843    -    37    149    149    1.0000000
chr1    exon    1571695    1571843    -    37    149    149    1.0000000
chr1    CDS    1572044    1572160    -    54    117    117    1.0000000
 
We can also get summaries of only those reads that fall entirely within the exon, by using another bed tool 'intersectBed' and pipe the results to coverageBed.

$ intersectBed -abam accepted_hits.bam -b hg19-chr1-corrected.gtf -f 1.0 | coverageBed -abam stdin -b hg19-chr1-corrected.gtf > chr1Coverage-contained.bed
$ grep 'gene_id NM_033487' chr1Coverage.bed | head | cut -f 1,3-5,7,10-13
chr1    CDS    1572770    1572875    -    188    106    106    1.0000000
chr1    exon    1572770    1572875    -    188    106    106    1.0000000
chr1    stop_codon    1571126    1571128    -    37    3    3    1.0000000
chr1    CDS    1571129    1571218    -    67    90    90    1.0000000
chr1    exon    1571100    1571218    -    124    119    119    1.0000000
chr1    CDS    1571299    1571488    -    169    190    190    1.0000000
chr1    exon    1571299    1571488    -    169    190    190    1.0000000
chr1    CDS    1571695    1571843    -    132    149    149    1.0000000
chr1    exon    1571695    1571843    -    132    149    149    1.0000000
chr1    CDS    1572044    1572160    -    100    117    117    1.0000000
 
$ grep 'gene_id NM_033487' chr1Coverage-contained.bed | head | cut -f 1,3-5,7,10-13
chr1    CDS    1572770    1572875    -    54    105    106    0.9905660
chr1    exon    1572770    1572875    -    54    105    106    0.9905660
chr1    stop_codon    1571126    1571128    -    14    3    3    1.0000000
chr1    CDS    1571129    1571218    -    36    89    90    0.9888889
chr1    exon    1571100    1571218    -    36    112    119    0.9411765
chr1    CDS    1571299    1571488    -    151    186    190    0.9789473
chr1    exon    1571299    1571488    -    151    186    190    0.9789473
chr1    CDS    1571695    1571843    -    87    148    149    0.9932886
chr1    exon    1571695    1571843    -    87    148    149    0.9932886
chr1    CDS    1572044    1572160    -    33    117    117    1.0000000
 

We can do other things, like only do alignments if they match the strand ('-s' option), which would be appropriate for a stranded protocol. Another that might be of interest, is the coverage per base within each exon. This takes much longer, so we'll do only the single gene we made before:

$ coverageBed -d -split -abam accepted_hits.bam -b gene.gtf > genePerBase.bed
$ head genePerBase.bed
chr1    hg19_refGene    CDS    1572770    1572875    0    -    0    gene_id NM_033487; transcript_id NM_033487;     1    59
chr1    hg19_refGene    CDS    1572770    1572875    0    -    0    gene_id NM_033487; transcript_id NM_033487;     2    59
chr1    hg19_refGene    CDS    1572770    1572875    0    -    0    gene_id NM_033487; transcript_id NM_033487;     3    59
chr1    hg19_refGene    CDS    1572770    1572875    0    -    0    gene_id NM_033487; transcript_id NM_033487;     4    59
chr1    hg19_refGene    CDS    1572770    1572875    0    -    0    gene_id NM_033487; transcript_id NM_033487;     5    60
chr1    hg19_refGene    CDS    1572770    1572875    0    -    0    gene_id NM_033487; transcript_id NM_033487;     6    60
chr1    hg19_refGene    CDS    1572770    1572875    0    -    0    gene_id NM_033487; transcript_id NM_033487;     7    61
chr1    hg19_refGene    CDS    1572770    1572875    0    -    0    gene_id NM_033487; transcript_id NM_033487;     8    61
chr1    hg19_refGene    CDS    1572770    1572875    0    -    0    gene_id NM_033487; transcript_id NM_033487;     9    61
chr1    hg19_refGene    CDS    1572770    1572875    0    -    0    gene_id NM_033487; transcript_id NM_033487;     10    61
 

We might want to get gene information. We're not going to go through any details, but there are different ways this could be done. You could get all reads that overlap the entire region of the gene, in which case you would want an annotation with each gene being an entry and follow the previous commands with this type of gtf.