SAMtools and BEDtools workshop

Introduction

  • Learning to program is essential for performing sophisticated/customized analyses but one can accomplish an impressive amount of computational tasks by simply taking advantage of pre-existing software, without ever having to write your own script.
  • The types of analyses that are possible are increasing rapidly as the field matures and the community adopts standardized file formats.
  • One of the most common types of file formats related to next-gen sequence is called SAM and is used to store information about how and where short reads align to a reference sequence. Often, in conjunction with SAM, you will see BAM files mentioned. BAM just stands for Binary SAM. They store the same exact information except that BAM files are compressed so that they take up less disk space. Two other file formats that are commonly used in genomics that we will also talk a bit about are GFF and BED, which are ways of storing the coordinates of genome features.
  • For basically all types of analyses related to next-gen sequence, you will at some point end up having to align your reads to a reference sequence. Back in the early days, every piece of software for aligning short reads output the alignment information in a slightly different format which meant that you either had to write your own scripts to convert alignments between formats, or hope to find a script on the internet. These days, the vast majority of short read aligners output alignments in SAM format. This has made it easier to develop suites of utilities for downstream processing of short read alignments. The two that I'm going to cover in detail today are SAMtools and BEDtools
  • The goal of this workshop is introduce you to these two suites of utilities and some of the fairly sophisticated computation that they will allow you to perform, without ever having to write your own script.


Links of Interest


A note about the format of this wiki

Examples of input and output to and from the command line are included below. Commands that you will type into the command line are shown in Bold Lucida font and are preceded by a dollar sign. You should not actually type the dollar sign into the command line. Output is shown in Italic Lucida. Example:

$ input command
output from command


If you are working on one of the CGRL servers

You will need to type in the following commands to make the software we will use today available to you:

$ module load samtools/0.1.18
$ module load bedtools
$ module load picard-tools
$ module load R


If you logout and log back in, you will need to type these commands in again. You most likely will not need to do this if you have these utilities installed on your own machine or are working on a different server.


SAMtools


We're going to start with SAMtools, which by itself actually doesn't allow you to do anything really interesting, but you will still end up using it all the time to convert between SAM and BAM formats. Also, many downstream tools, such as BEDtools and many SNP callers, require that BAM files be indexed and sorted, which is easily accomplished with SAMtools.

The SAM format, while not exactly easy on the eyes, encodes almost every piece of information you would ever want to know about how and where a read aligns to a reference sequence.

Let's take a look:

$ less example.sam
@HD VN:1.0 SO:unsorted
@SQ SN:chromosome_I LN:230218
@PG ID:Bowtie VN:0.12.7 CL:"bowtie -v 1 -m1 -S chromosome_I example.fastq example.sam"
Frag_1 0 chromosome_I 128138 255 33M * 0 0 ATACAAATATATGATGAATCATTAAAGAGGAGG AIGH?IFIIIF+GGGHIHII+=FD?FIFH;IHF XA:i:0 MD:Z:101 NM:i:0
Frag_2 16 chromosome_I 54594 255 33M * 0 0 AATCTGAAGTGTCTATAAGTACTGTATTCTTAG ICCGIHIFHIIIHI=EIIFFG@?IICEEDHI>H XA:i:0 MD:Z:101 NM:i:0
Frag_3 0 chromosome_I 181175 255 33M * 0 0 CAAGAAATATCTTGACCGCAGTGAACTGTGGGA HCIIIHIH?HIG:IIHGDHFBHIIIGI2HIII@ XA:i:0 MD:Z:101 NM:i:0
Frag_4 16 chromosome_I 63331 255 33M * 0 0 GTTTCAACAAGGTGCGCCCCTCTGTATACTTTT HEIIGIBIBGH=2GFIIIHGIHIIH7HGI?HHI XA:i:0 MD:Z:101 NM:i:0
Frag_5 0 chromosome_I 141473 255 33M * 0 0 TGATTACTTTTCTTTTGATGTGCTTATCTTACA IHI6IIIIIHEEIFIIHEIFIHIIHICHIII%C XA:i:0 MD:Z:101 NM:i:0
...


Every SAM files starts with a number of lines that begin with the '@' symbol. The amount of information included in this section varies depending on the short read aligner that was used but all will include at least the '@SQ' lines which list the IDs and lengths of the all sequences present within the reference file that the short reads were aligned to. In this example, two other lines are
included. The '@HD' line lists the SAM format version that was used and whether or not the file has been sorted. The @PG line lists the aligner name and version that was used to align the reads as well as the actual command line.

Below these lines, the actual read alignment information begins.

Column:
  1. Read ID
  2. Bitwise flag (encodes additional alignment information such as whether mate of current was mapped)
  3. ID of reference sequence that read aligned to
  4. Coordinate on reference sequence of left-most position of first matching base
  5. Mapping quality
  6. CIGAR string (encodes how each base of read aligned, e.g. match, mismatch, insertion, deletion, etc.)
  7. Reference ID where mate aligned (if paired-end)
  8. Reference position where mate aligned (if paired-end)
  9. Fragment length (if paired-end)
  10. Read sequence
  11. Read quality scores
  12. Alignment section (see SAM format specification and aligner manual for more details)

Let's go over some of the basic things you will want to do with SAMtools. First let's see what tools are available within this suite:

$ samtools

Program: samtools (Tools for alignments in the SAM format)
Version: 0.1.16 (r963:234)

Usage: samtools <command> [options]

Command: view SAM<->BAM conversion
sort sort alignment file
pileup generate pileup output
mpileup multi-way pileup
depth compute the depth
faidx index/extract FASTA
index index alignment
idxstats BAM index stats (r595 or later)
fixmate fix mate information
glfview print GLFv3 file
flagstat simple stats
calmd recalculate MD/NM tags and '=' bases
merge merge sorted alignments
rmdup remove PCR duplicates
reheader replace BAM header
cat concatenate BAMs
targetcut cut fosmid regions (for fosmid pool only)
phase phase heterozygotes

The most common thing you are likely to be using SAMtools for is converting between SAM and BAM files. Let's figure out how to do that:

$ samtools view

Usage: samtools view [options] <in.bam>|<in.sam> [region1 [...]]

Options:
-b output BAM
-h print header for the SAM output
-H print header only (no alignments)
-S input is SAM
-u uncompressed BAM output (force -b)
-1 fast compression (force -b)
-x output FLAG in HEX (samtools-C specific)
-X output FLAG in string (samtools-C specific)
-c print only the count of matching records
-L FILE output alignments overlapping the input BED FILE [null]
-t FILE list of reference names and lengths (force -S) [null]
-T FILE reference sequence file (force -S) [null]
-o FILE output file name [stdout]
-R FILE list of read groups to be outputted [null]
-f INT required flag, 0 for unset [0]
-F INT filtering flag, 0 for unset [0]
-q INT minimum mapping quality [0]
-l STR only output reads in library STR [null]
-r STR only output reads in read group STR [null]
-? longer help

If we want to go from SAM to BAM, it looks like we have to specify that the input file is in SAM format (-S) and that we want samtools to output a BAM file (-b) and we have to capture the output in a new file:

$ samtools view -bS example.sam > example.bam
[samopen] SAM header is present: 1 sequences.

The opposite conversion doesn't require any arguments:

$ samtools view example.bam > example.sam

The other two most commonly used samtools utilties are sort and index.

$ samtools sort
Usage: samtools sort [-on] [-m <maxMem>] <in.bam> <out.prefix>

It looks like we need specify the BAM file to sort as well as the prefix of the output file:

$ samtools sort example.bam example.sorted
This should create a new sorted file called example.sorted.bam.

Often, after you have sorted a BAM file you will also need to index it:

$ samtools index
Usage: samtools index <in.bam> [out.index]

This time, the name of the output file is optional. If you don't include anything, if will create an index with the same name as the BAM file ending with ".bai"

$ samtools index example.sorted.bam

This should create a file called example.sorted.bam.bai

If you want to visualize short read alignments, you'll also need to index the reference fasta file:

$ samtools faidx reference.fasta

Now that we have all the required files, we can explore a couple of ways of visualizing short read alignments. SAMtools comes with its own utility for visualizing read alignments called tview. You don't have to specify the BAM and reference indexes, but they need to be in the same directory as the BAM and reference FASTA files.

$ samtools tview example.sorted.bam reference.fasta
Screen_Shot_2012-04-22_at_7.12.39_PM.png

If you want to see the menu that lists the various options for displaying the alignments and moving around type '?'. If you want to jump to a specific location along the chromosome, type 'g' followed by the location in the format referenceID:coordinate e.g. chromosome_I:150000.

SAMtools tview is fast and runs directly in the terminal window which makes it nice if you want to quickly look at read alignments at a specific location in the reference sequence. For browsing read alignments, I prefer the standalone program Tablet which is more feature-rich and easier to navigate.

Moving files off of the server onto your laptop

You are going to run Tablet locally on your own laptop which means you need to move the appropriate files from the server onto your laptop. The easiest way to do this if you have a Mac is with the program Fugu otherwise you can use the command 'scp'. First open a new terminal window then type:

$ scp username@poset.cgrl.berkeley.edu:example.bam .
Password:
example.bam 100% 5448KB 454.0KB/s 00:12

Before transferring the file it asks you for a password. Type in the password for your account on the remote server. Also, don't forget to include the period at the end of the command and to replace 'username' in the above example with your actual username. Repeat this command for all the files you need:

  • sorted BAM file (e.g. example.sorted.bam)
  • index for sorted BAM file (e.g. example.sorted.bam.bai)
  • reference FASTA file (e.g. chromosome_I.fasta)
  • reference FASTA index (e.g. chromosome_I.fasta.fai)

Tablet


After downloading and installing Tablet, start the program and click on 'Import an assembly into Tablet'. The only tricky thing about using Tablet is figuring out which files need to be imported into which box. You should click on the 'Browse' button and locate the appropriate files so that the sorted BAM file goes into the box labeled 'Primary assembly file' and the reference FASTA file goes into the Reference/consensus box.

BEDtools


BEDtools is a great suite of tools that can be used for comparing large sets of genome features. A genome feature is basically anything that can represented by a start and end coordinate within the genome sequence space. Examples include genes, transposable elements, short read alignments, etc.

Standard file formats for representing genome features

GFF

- 1-based (coordinate system starts at 1)
- start and end positions are inclusive
- always has 9 columns
- brief description of columns can be found here

BED

- 0-based (coordinate system starts at 0)
- start is inclusive but end is not
- basic version has 3 columns with 9 additional optional columns
- description of format can be found here

BEDtools will accept GFF, BED, and BAM formats as input but only outputs BED files. It is aware of the different coordinate representations in the different file formats (e.g. GFF vs BED) so you don't have to worry about that.

Most tools within the BEDtools suite perform comparisons between two different sets of genome features. The filename that contains the first set (Set A) is given after the -a flag while the second (Set B) is given after the -b flag. If one of the feature sets is a BAM file, it should be given as Set A (Set B is read into memory and since BAM files can have millions of lines, avoiding using it as Set B will save memory) and the -a flag has to be changed to -abam. All genome feature files need to be tab-delimited. A quick example for two BED files:

Set_A:
chromosome_1 10 200
chromosome_2 500 1000
chromosome_3 2000 3000

Set_B:
chromosome_1 190 250
chromosome_2 400 600
chromosome_3 3001 4000

The most basic BEDtools utility is called intersectBed and returns the coordinates of the portion of each feature in Set_A that overlaps with a feature in Set_B

$ intersectBed -a Set_A -b Set_B
chromosome_1 190 200
chromosome_2 500 600

Typing the name of the utility without any arguments will return a list of all the possible arguments you can use to control what the utility reports. For example, adding the -wo argument tells intersectBed to report the full coordinates of both features that overlap plus the number of basepairs that lie within the overlap:

$ intersectBed -a Set_A -b Set_B -wo
chromosome_1 10 200 chromosome_1 190 250 10
chromosome_2 500 1000 chromosome_2 400 600 100

The BEDtools genome file

Some of the BEDtools utilities require a genome file which is simply a tab-delimited list of each sequence ID found in the reference FASTA file, followed by the length of the sequence.

You will have a chance to explore more BEDtools utilities in the excercises.


Exercises


1. You will recognize the premise of the first exercise if you took the Intro to Unix workshop:

Another lab in the department has been studying a mutant in S. cerevisiae S288C and have successfully mapped the single point mutation to a 20kb region of the genome. They PCR amplified this region in overlapping segments, pooled the segments and sequenced them on an Illumina machine, 101 basepair single-end reads.

In the Unix workshop, the bonus exercise involved a fairly convoluted approach to find the mutation using only built-in Unix utilities like grep. This time, examine the read alignments in Tablet to see if you can simply find the mutation by eye.

You can find the read alignments here:
/global/courses/spr2012/SAMtoolsworkshop/S288C_mutant_chromV.sam
Reference file:
/global/courses/spr2012/SAMtoolsworkshop/S288C_chromosome_V.fa

You will have to download and install Tablet on your laptop and use samtools to generate the appropriate files for Tablet. You may want to play around with how Tablet displays the read alignments to make it easier to spot read bases that differ from the reference.

Once you have identified the coordinate of the mutation, use samtools tview to visualize the read alignments. Jump to the location of the mutation and make sure tview displays it as well.

2. The premise of this exercise is that you evolved a strain of yeast under high salt conditions and generated iIlumina sequence from the genome of the evolved strain. You've aligned the reads back to the reference genome and would like to determine if any deletions or duplications of genomic regions occurred in the evolved strain by examining read coverage along the chromosomes.

You can find the read alignments here (only chromosome #1 is included to reduce the file size):
/global/courses/spr2012/SAMtoolsworkshop/S288C_evolved.sam

2A. Read the BEDtools manual to find a utility that you can use to calculate read coverage along the genome. Run this utility to calculate per-base read depth coverage along chromosome #1. Hint: you will need to know the length of chromosome #1 to make the BEDtools genome file. Luckily there is a samtools utility that will do this. Look at the samtools website to figure out which one.

2B. Create a plot of read coverage along chrom #1 using the output from 2A. You can either move the file to your laptop and use your graphing method of choice, do it yourself in R, or use this R script (rename the coverage file R.in first):

$ R --no-save < /global/courses/spr2012/SAMtoolsworkshop/plotCoverage.R

3. The premise of this exercise is that you have ChIP-Seq data from two different histone modifications in Drosophila, a "silencing" mark associated with constitutive heterochromatin (H3K9me3) and an "active" mark associated with transcribed genes (H3K36me3). You have already identified the genomic locations of peaks of binding for each mark and want to quickly determine if the peak locations make biological sense by determining, for each mark, the fraction of peaks that overlap genes versus repetitive elements.

3A. You identified repetitive elements in the D. melanogaster genome using RepeatMasker, however this program isn't very good at identifying full-length elements and will often report multiple segments of a given element that are directly adjacent or within a few hundred basepairs of each other in the genome as separate elements. Find a BEDtools utility that you can use to merge features that are within a specified distance of each other into a single feature and save the output as a new BED file.

The un-merged output of RepeatMasker is here:
/global/courses/spr2012/SAMtoolsworkshop/dmel_r5.6_repeatmasker.gff

3B. Use the merged RepeatMasker output, along with BEDtools, to determine, for each mark, the fraction of peaks that overlap genes versus repetitive elements.

The other files you will need are here:
Dmel gene coordinates: /global/courses/spr2012/SAMtoolsworkshop/dmel_r5.6_genes.gff
H3K9me3 peak locations: /global/courses/spr2012/SAMtoolsworkshop/dmel_r5.6_H3K9me3_E0-12h.gff
H3K36me3 peak locations: /global/courses/spr2012/SAMtoolsworkshop/dmel_r5.6_H3K36me3_E0-12h.gff

Bonus Exercises

1. There is a utility that comes with SAMtools that can be used to identify sequence variants from short read alignments. Figure out which tool and run it on the
S288C_mutant_chromV.sam alignments to see if it identifies the same mutation that you found in Tablet and/or tview.

2. Use BEDtools to create 10 random permutations of the locations of the ChIP-seq peaks. Use these permutations to get a rough idea of whether the overlap that you observed between the peaks and genes and/or repetitive elements is more than you would expect by chance.

3. If you finish early and are still excited for more, figure out how to run one of the Picard utilities on one of the SAM/BAM files you used previously.


Solutions


1. To get your files ready to input into Tablet, you should have converted the SAM file to BAM, sorted the BAM file, and then indexed the sorted BAM file:

$ samtools view -bS /global/courses/spr2012/SAMtoolsworkshop/S288C_mutant_chromV.sam > S288C_mutant_chromV.bam
$ samtools sort S288C_mutant_chromV.bam S288C_mutant_chromV.sorted
$ samtools index S288C_mutant_chromV.sorted.bam

You will also need to index the reference FASTA file:

$ cp /global/courses/spr2012/SAMtoolsworkshop/S288C_chromosome_V.fa .
$ samtools faidx S288C_chromosome_V.fa

Now you will need to move the BAM file and the reference FASTA file to your local machine, along with their respective indexes.

Open a new terminal window so that you are working from your own machine, rather than on the CGRL server

$ scp cellison@poset.cgrl.berkeley.edu:S288C_mutant_chromV.sorted.bam Documents/.
$ scp cellison@poset.cgrl.berkeley.edu:S288C_mutant_chromV.sorted.bam.bai Documents/.
$ scp cellison@poset.cgrl.berkeley.edu:S288C_chromosome_V.fasta Documents/.
$ scp cellison@poset.cgrl.berkeley.edu:S288C_chromosome_V.fa Documents/.
$ scp cellison@poset.cgrl.berkeley.edu:S288C_chromosome_V.fa.fai Documents/.

Open Tablet and load the BAM and reference FASTA files as described in the wiki above. You won't see any reads immediately but that is because you only sequenced PCR product from a 20kb region of this chromosome. The easiest way to figure out which region of chromosome V you should scroll to is to look at the first few lines within the SAM file to see which coordinates they are mapping to.

You can control how easy it is to see the variants by sliding the 'Variant' scrollbar or by clicking on 'Read Colors' and selecting 'Variants'.

The mutation should be G->A and should be at position 505393 of chromosome V.

2. We have to start this exercise the same way as the last one: convert to BAM, sort BAM, index BAM

$ cp /global/courses/spr2012/SAMtoolsworkshop/S288C_evolved.sam .
$ samtools view -bS S288C_evolved.sam > S288C_evolved.bam
$ samtools sort S288C_evolved.bam S288C_evolved.sorted
$ samtools index S288C_evolved.sorted.bam

The samtools utility called idxstats will tell you the length of each sequence that is present in the reference FASTA file.

$ samtools idxstats S288C_evolved.sorted.bam

chromosome_I 230218 51928 0
* 0 0 1458

Although the BEDtools utility coverageBed will also work for this exercise, genomeCoverageBed is a bit more straightforward for what we want to do in this case. genomeCoverageBed requires a genome file which is simply a tab-delimited text file that contains the ID of each sequence in the reference FASTA followed by its length.

You can just use your favorite text editor to create the genome file. In this case I named mine genome.bed:
$ cat genome.bed
chromosome_I 230218

You have to include the -d argument when you run genomeCoverageBed to get it to print out the read depth per basepair. In this case, I capture the output in a file named S288C_evolved.cov.

$ genomeCoverageBed -ibam S288C_evolved.sorted.bam -d -g genome.bed > S288C_evolved.cov


To use the R script that is on the CGRL server to plot the coverage data, we have to first rename the file R.in:
$ cp S288C_evolved.cov R.in
$ R --no-save < /global/courses/spr2012/SAMtoolsworkshop/plotCoverage.R

This will create a PDF file named genomeCoverageBed.Rplot.pdf. Move this file back onto your laptop to look at it:
$ scp cellison@poset.cgrl.berkeley.edu:genomeCoverageBed.Rplot.pdf Documents/.



It looks like there was one large deletion and one large duplication!

3. We can use intersectBed to print out all the ChIP-seq peaks that overlap at least one gene in the GFF file. The -u argument makes it so that intersectBed prints out each feature in set A only once even if that feature overlaps more than one feature in set B. We want to know the total number of peaks that overlap one or more features so we simply count the number of lines output by intersectBed:

$ intersectBed -u -a 620_MAT_H3K36me3_E0-12h.bed_dcc.gff -b dmel-genes-r5.6.gff | wc -l
6074

We also count the total number of peaks:
$ wc -l 620_MAT_H3K36me3_E0-12h.bed_dcc.gff
6846 620_MAT_H3K36me3_E0-12h.bed_dcc.gff

Same thing for the H3K9 mark:
$ intersectBed -u -a 621_MAT_H3K9me3_E0-12h.bed_dcc.gff -b dmel-genes-r5.6.gff | wc -l
1898
$ wc -l 621_MAT_H3K9me3_E0-12h.bed_dcc.gff
3543 621_MAT_H3K9me3_E0-12h.bed_dcc.gff

Now we repeat everything to find the overlap between the ChIP-seq peaks and repetitive elements, but first we merge repetitive elements that are within 100bp from each other in the genome:

$ mergeBed -i dmel_r5.6_repeatmasker.gff -d 100 > dmel_r5.6_repeatmasker.merged.bed
$ intersectBed -u -a 621_MAT_H3K9me3_E0-12h.bed_dcc.gff -b dmel_r5.6_repeatmasker.merged.bed | wc -l
1303
$ intersectBed -u -a 620_MAT_H3K36me3_E0-12h.bed_dcc.gff -b dmel_r5.6_repeatmasker.merged.bed | wc -l
512

Fraction of peaks overlapping genes: H3K9 1898/3543 (54%) H3K36 6074/6846 (89%)
Fraction of peaks overlapping TEs : H3K9 1303/3543 (37%) H3K36 512/6846 (7%)

The H3K36 mark makes perfect sense: it's a mark associated with active transcription and the vast majority of peaks overlap genes while only a small percentage overlap repetitive elements.
The H3K9 mark makes less: it's a mark associated with silenced regions of the genome but more peaks overlap genes than repetitive elements. If this was my experiment I'd say the H3K36 data seems good but the H3K9 data is questionable.