After the reads are aligned, a common question is to get summaries of the data, such as how many reads overlap each exon or gene -- we will talk more about what are good estimates of gene expression (especially if there is alternative splicing in the organism). The first set of tools that are used for bam files are called 'samtools'. With these you can quickly access parts of the data. We will give just a couple of examples here.
First you have to create an index for the bam file, so that the reads can be quickly accessed. It will create an '*.bai' file.
samtools index accepted_hits.bam
We can get some basic alignment summaries:
$ samtools flagstat accepted_hits.bam
138154 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
138154 + 0 mapped (100.00%:nan%)
138154 + 0 paired in sequencing
69211 + 0 read1
68943 + 0 read2
88314 + 0 properly paired (63.92%:nan%)
125798 + 0 with itself and mate mapped
12356 + 0 singletons (8.94%:nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
We can get basic summaries per chromosome (reference sequence name)
You can also have it count the alignments, which is just basically a line count of all the alignments listed (so mate pairs count separately if they are listed separately in the bam file); same with multiple alignments. Furthermore, if the intron gap of the alignment (the 'N') overlaps the region, that will be counted too.
The 'b' options tells samtools that the output should be in bam format. The -f is the filter (0x2) corresponds to proper pairs. You can actually do this all in one line
To get specific types of summaries, many people just write programs using samtools. bedTools (http://code.google.com/p/bedtools/) provides a lot of summarization tools as well as other tools for comparing overlap of annotation, etc. They can work with BED files, but also GTF annotation files and BAM files. We are not going to discuss how to get the right annotation file for the summaries you want, but bedTools also have a lot tools for this. All of these tools can also be used between annotations, for example to find intersections of exons, etc.
We will work with the '.gtf' that we used to align the files (its important that the annotation for the alignment match the annotation for the summaries, in terms of the name of the chromosome, the genome build, etc.). Since we are only working with chr1 reads, we can limit ourselves to these to save processing time.
Notice that the regions are by exon, and they overlap and are repeated. This is because what is an exon region is different depending on what transcript it is in. bedTools finds the overlap for a region, so each of these regions will have their own overlap (and so a read might get counted twice.) So this may not be the perfect annotation depending on what kind of summaries you want.
$ coverageBed -split -abam accepted_hits.bam -b hg19-chr1.gtf > chr1Coverage.bed
Error: malformed GFF entry at line 15525. Start was greater than end. Exiting.
There is some error in the gtf file (we are not sure why) where the starts are less than the ends for a few entries, so we will work with a version where these have been taken out (code not shown). For simplicity later on, we will also make a file for just the information for a single gene.
The -split option is important for RNA-Seq: it says do not count the overlap if the overlap is in the intron region (in contrast the summary in samtools appears to counts this)
The old annotation information is there, as well as 4 additional columns:
1) The number of features in A (the bam file) that overlapped (by at least one base pair) the B interval (the exon).
2) The number of bases in B that had non-zero coverage from features in A.
3) The length of the entry in B.
4) The fraction of bases in B that had non-zero coverage from features in A.
To save space, we'll look at just the columns of interest
You can do other things by combining the filtering of samtools with the coverage. For example, we can use the flag that tells us whether an alignment is a proper-pair, to get only overlaps of proper-pairs:
We can also get summaries of only those reads that fall entirely within the exon, by using another bed tool 'intersectBed' and pipe the results to coverageBed.
We can do other things, like only do alignments if they match the strand ('-s' option), which would be appropriate for a stranded protocol. Another that might be of interest, is the coverage per base within each exon. This takes much longer, so we'll do only the single gene we made before:
We might want to get gene information. We're not going to go through any details, but there are different ways this could be done. You could get all reads that overlap the entire region of the gene, in which case you would want an annotation with each gene being an entry and follow the previous commands with this type of gtf.
Section 5: Samtools
After the reads are aligned, a common question is to get summaries of the data, such as how many reads overlap each exon or gene -- we will talk more about what are good estimates of gene expression (especially if there is alternative splicing in the organism). The first set of tools that are used for bam files are called 'samtools'. With these you can quickly access parts of the data. We will give just a couple of examples here.
First you have to create an index for the bam file, so that the reads can be quickly accessed. It will create an '*.bai' file.
We can get some basic alignment summaries:
We can get basic summaries per chromosome (reference sequence name)
Each line consists of reference sequence name, sequence length, # mapped reads and # unmapped reads.
We can get the reads that overlap a region, in this case corresponding to a gene "NM_033487", using 'view'. We'll save it to another file
You can also have it count the alignments, which is just basically a line count of all the alignments listed (so mate pairs count separately if they are listed separately in the bam file); same with multiple alignments. Furthermore, if the intron gap of the alignment (the 'N') overlaps the region, that will be counted too.
You can also use samtools to filter out reads based on some flags. A common one is to only get reads that align as a proper pair.
The 'b' options tells samtools that the output should be in bam format. The -f is the filter (0x2) corresponds to proper pairs. You can actually do this all in one line
Section 6: bedTools
To get specific types of summaries, many people just write programs using samtools. bedTools (http://code.google.com/p/bedtools/) provides a lot of summarization tools as well as other tools for comparing overlap of annotation, etc. They can work with BED files, but also GTF annotation files and BAM files. We are not going to discuss how to get the right annotation file for the summaries you want, but bedTools also have a lot tools for this. All of these tools can also be used between annotations, for example to find intersections of exons, etc.
We will work with the '.gtf' that we used to align the files (its important that the annotation for the alignment match the annotation for the summaries, in terms of the name of the chromosome, the genome build, etc.). Since we are only working with chr1 reads, we can limit ourselves to these to save processing time.
Notice that the regions are by exon, and they overlap and are repeated. This is because what is an exon region is different depending on what transcript it is in. bedTools finds the overlap for a region, so each of these regions will have their own overlap (and so a read might get counted twice.) So this may not be the perfect annotation depending on what kind of summaries you want.
There is some error in the gtf file (we are not sure why) where the starts are less than the ends for a few entries, so we will work with a version where these have been taken out (code not shown). For simplicity later on, we will also make a file for just the information for a single gene.
The -split option is important for RNA-Seq: it says do not count the overlap if the overlap is in the intron region (in contrast the summary in samtools appears to counts this)
The old annotation information is there, as well as 4 additional columns:
1) The number of features in A (the bam file) that overlapped (by at least one base pair) the B interval (the exon).
2) The number of bases in B that had non-zero coverage from features in A.
3) The length of the entry in B.
4) The fraction of bases in B that had non-zero coverage from features in A.
To save space, we'll look at just the columns of interest
You can do other things by combining the filtering of samtools with the coverage. For example, we can use the flag that tells us whether an alignment is a proper-pair, to get only overlaps of proper-pairs:
We can compare the number of counts for the two:
We can also get summaries of only those reads that fall entirely within the exon, by using another bed tool 'intersectBed' and pipe the results to coverageBed.
We can do other things, like only do alignments if they match the strand ('-s' option), which would be appropriate for a stranded protocol. Another that might be of interest, is the coverage per base within each exon. This takes much longer, so we'll do only the single gene we made before:
We might want to get gene information. We're not going to go through any details, but there are different ways this could be done. You could get all reads that overlap the entire region of the gene, in which case you would want an annotation with each gene being an entry and follow the previous commands with this type of gtf.