BEDTools / SAMTools for NGS Data Analysis

Chris Ellison


March 4, 2013

Here are the files from Excercise #2 if you would like to try this on your own laptop (but you will first need to install samtools and BEDtools)


(this is the actual sequence of chromosome I, which isn't necessary for excercise #2, but useful if you want to play around with other tools such as Tablet)

Introduction

  • Learning to program is essential for performing sophisticated/customized analyses but one can accomplish an impressive amount of computational tasks by simply taking advantage of pre-existing software, without ever having to write your own script.
  • The types of analyses that are possible are increasing rapidly as the field matures and the community adopts standardized file formats.
  • One of the most common types of file formats related to next-gen sequence is called SAM and is used to store information about how and where short reads align to a reference sequence. Often, in conjunction with SAM, you will see BAM files mentioned. BAM just stands for Binary SAM. They store the same exact information except that BAM files are compressed so that they take up less disk space. Two other file formats that are commonly used in genomics that we will also talk a bit about are GFF and BED, which are ways of storing the coordinates of genome features.
  • For basically all types of analyses related to next-gen sequence, you will at some point end up having to align your reads to a reference sequence. Back in the early days, every piece of software for aligning short reads output the alignment information in a slightly different format which meant that you either had to write your own scripts to convert alignments between formats, or hope to find a script on the internet. These days, the vast majority of short read aligners output alignments in SAM format. This has made it easier to develop suites of utilities for downstream processing of short read alignments. The two that I'm going to cover in detail today are SAMtools and BEDtools
  • The goal of this workshop is introduce you to these two suites of utilities and some of the fairly sophisticated computation that they will allow you to perform, without ever having to write your own script.


Links of Interest


A note about the format of this wiki

Examples of input and output to and from the command line are included below. Commands that you will type into the command line are shown in Bold Lucida font and are preceded by a dollar sign. You should not actually type the dollar sign into the command line. Output is shown in Italic Lucida. Example:

$ input command
output from command


If you are working on one of the CGRL servers

You will need to type in the following commands to make the software we will use today available to you:

$ module load samtools/0.1.18
$ module load bedtools/2.17.0
$ module load R


If you logout and log back in, you will need to type these commands in again. You most likely will not need to do this if you have these utilities installed on your own machine or are working on a different server.


SAMtools


We're going to start with SAMtools, which by itself actually doesn't allow you to do anything really interesting, but you will still end up using it all the time to convert between SAM and BAM formats. Also, many downstream tools, such as BEDtools and many SNP callers, require that BAM files be indexed and sorted, which is easily accomplished with SAMtools.

The SAM format, while not exactly easy on the eyes, encodes almost every piece of information you would ever want to know about how and where a read aligns to a reference sequence.

Let's take a look:

$ less example.sam
@HD VN:1.0 SO:unsorted
@SQ SN:chromosome_I LN:230218
@PG ID:Bowtie VN:0.12.7 CL:"bowtie -v 1 -m1 -S chromosome_I example.fastq example.sam"
Frag_1 0 chromosome_I 128138 255 33M * 0 0 ATACAAATATATGATGAATCATTAAAGAGGAGG AIGH?IFIIIF+GGGHIHII+=FD?FIFH;IHF XA:i:0 MD:Z:101 NM:i:0
Frag_2 16 chromosome_I 54594 255 33M * 0 0 AATCTGAAGTGTCTATAAGTACTGTATTCTTAG ICCGIHIFHIIIHI=EIIFFG@?IICEEDHI>H XA:i:0 MD:Z:101 NM:i:0
Frag_3 0 chromosome_I 181175 255 33M * 0 0 CAAGAAATATCTTGACCGCAGTGAACTGTGGGA HCIIIHIH?HIG:IIHGDHFBHIIIGI2HIII@ XA:i:0 MD:Z:101 NM:i:0
Frag_4 16 chromosome_I 63331 255 33M * 0 0 GTTTCAACAAGGTGCGCCCCTCTGTATACTTTT HEIIGIBIBGH=2GFIIIHGIHIIH7HGI?HHI XA:i:0 MD:Z:101 NM:i:0
Frag_5 0 chromosome_I 141473 255 33M * 0 0 TGATTACTTTTCTTTTGATGTGCTTATCTTACA IHI6IIIIIHEEIFIIHEIFIHIIHICHIII%C XA:i:0 MD:Z:101 NM:i:0
...


Every SAM files starts with a number of lines that begin with the '@' symbol. The amount of information included in this section varies depending on the short read aligner that was used but all will include at least the '@SQ' lines which list the IDs and lengths of the all sequences present within the reference file that the short reads were aligned to. In this example, two other lines are
included. The '@HD' line lists the SAM format version that was used and whether or not the file has been sorted. The @PG line lists the aligner name and version that was used to align the reads as well as the actual command line.

Below these lines, the actual read alignment information begins.

Column:
  1. Read ID
  2. Bitwise flag (encodes additional alignment information such as whether mate of current was mapped)
  3. ID of reference sequence that read aligned to
  4. Coordinate on reference sequence of left-most position of first matching base
  5. Mapping quality
  6. CIGAR string (encodes how each base of read aligned, e.g. match, mismatch, insertion, deletion, etc.)
  7. Reference ID where mate aligned (if paired-end)
  8. Reference position where mate aligned (if paired-end)
  9. Fragment length (if paired-end)
  10. Read sequence
  11. Read quality scores
  12. Alignment section (see SAM format specification and aligner manual for more details)

Let's go over some of the basic things you will want to do with SAMtools. First let's see what tools are available within this suite:

$ samtools

Program: samtools (Tools for alignments in the SAM format)
Version: 0.1.16 (r963:234)

Usage: samtools <command> [options]

Command: view SAM<->BAM conversion
sort sort alignment file
pileup generate pileup output
mpileup multi-way pileup
depth compute the depth
faidx index/extract FASTA
index index alignment
idxstats BAM index stats (r595 or later)
fixmate fix mate information
glfview print GLFv3 file
flagstat simple stats
calmd recalculate MD/NM tags and '=' bases
merge merge sorted alignments
rmdup remove PCR duplicates
reheader replace BAM header
cat concatenate BAMs
targetcut cut fosmid regions (for fosmid pool only)
phase phase heterozygotes

The most common thing you are likely to be using SAMtools for is converting between SAM and BAM files. Let's figure out how to do that:

$ samtools view

Usage: samtools view [options] <in.bam>|<in.sam> [region1 [...]]

Options:
-b output BAM
-h print header for the SAM output
-H print header only (no alignments)
-S input is SAM
-u uncompressed BAM output (force -b)
-1 fast compression (force -b)
-x output FLAG in HEX (samtools-C specific)
-X output FLAG in string (samtools-C specific)
-c print only the count of matching records
-L FILE output alignments overlapping the input BED FILE [null]
-t FILE list of reference names and lengths (force -S) [null]
-T FILE reference sequence file (force -S) [null]
-o FILE output file name [stdout]
-R FILE list of read groups to be outputted [null]
-f INT required flag, 0 for unset [0]
-F INT filtering flag, 0 for unset [0]
-q INT minimum mapping quality [0]
-l STR only output reads in library STR [null]
-r STR only output reads in read group STR [null]
-? longer help

If we want to go from SAM to BAM, it looks like we have to specify that the input file is in SAM format (-S) and that we want samtools to output a BAM file (-b) and we have to capture the output in a new file:

$ samtools view -bS example.sam > example.bam
[samopen] SAM header is present: 1 sequences.

The opposite conversion doesn't require any arguments:

$ samtools view example.bam > example.sam

The other two most commonly used samtools utilties are sort and index.

$ samtools sort
Usage: samtools sort [-on] [-m <maxMem>] <in.bam> <out.prefix>

It looks like we need specify the BAM file to sort as well as the prefix of the output file:

$ samtools sort example.bam example.sorted
This should create a new sorted file called example.sorted.bam.

Often, after you have sorted a BAM file you will also need to index it:

$ samtools index
Usage: samtools index <in.bam> [out.index]

This time, the name of the output file is optional. If you don't include anything, if will create an index with the same name as the BAM file ending with ".bai"

$ samtools index example.sorted.bam

This should create a file called example.sorted.bam.bai

If you want to visualize short read alignments, you'll also need to index the reference fasta file:

$ samtools faidx reference.fasta

Now that we have all the required files, we can explore a couple of ways of visualizing short read alignments. SAMtools comes with its own utility for visualizing read alignments called tview. You don't have to specify the BAM and reference indexes, but they need to be in the same directory as the BAM and reference FASTA files.

$ samtools tview example.sorted.bam reference.fasta
Screen_Shot_2012-04-22_at_7.12.39_PM.png

If you want to see the menu that lists the various options for displaying the alignments and moving around type '?'. If you want to jump to a specific location along the chromosome, type 'g' followed by the location in the format referenceID:coordinate e.g. chromosome_I:150000.

SAMtools tview is fast and runs directly in the terminal window which makes it nice if you want to quickly look at read alignments at a specific location in the reference sequence. For browsing read alignments, I prefer the standalone program Tablet which is more feature-rich and easier to navigate.

Moving files off of the server onto your laptop

You are going to run Tablet locally on your own laptop which means you need to move the appropriate files from the server onto your laptop. The easiest way to do this if you have a Mac is with the program Fugu otherwise you can use the command 'scp'. First open a new terminal window then type:

$ scp username@poset.cgrl.berkeley.edu:example.bam .
Password:
example.bam 100% 5448KB 454.0KB/s 00:12

Before transferring the file it asks you for a password. Type in the password for your account on the remote server. Also, don't forget to include the period at the end of the command and to replace 'username' in the above example with your actual username. Repeat this command for all the files you need:

  • sorted BAM file (e.g. example.sorted.bam)
  • index for sorted BAM file (e.g. example.sorted.bam.bai)
  • reference FASTA file (e.g. chromosome_I.fasta)
  • reference FASTA index (e.g. chromosome_I.fasta.fai)

Tablet


After downloading and installing Tablet, start the program and click on 'Import an assembly into Tablet'. The only tricky thing about using Tablet is figuring out which files need to be imported into which box. You should click on the 'Browse' button and locate the appropriate files so that the sorted BAM file goes into the box labeled 'Primary assembly file' and the reference FASTA file goes into the Reference/consensus box.

BEDtools


BEDtools is a great suite of tools that can be used for comparing large sets of genome features. A genome feature is basically anything that can represented by a start and end coordinate within the genome sequence space. Examples include genes, transposable elements, short read alignments, etc.

Standard file formats for representing genome features

GFF

- 1-based (coordinate system starts at 1)
- start and end positions are inclusive
- always has 9 columns
- brief description of columns can be found here

BED

- 0-based (coordinate system starts at 0)
- start is inclusive but end is not
- basic version has 3 columns with 9 additional optional columns
- description of format can be found here

BEDtools will accept GFF, BED, and BAM formats as input but only outputs BED files. It is aware of the different coordinate representations in the different file formats (e.g. GFF vs BED) so you don't have to worry about that.

Most tools within the BEDtools suite perform comparisons between two different sets of genome features. The filename that contains the first set (Set A) is given after the -a flag while the second (Set B) is given after the -b flag. If one of the feature sets is a BAM file, it should be given as Set A (Set B is read into memory and since BAM files can have millions of lines, avoiding using it as Set B will save memory) and the -a flag has to be changed to -abam. All genome feature files need to be tab-delimited. A quick example for two BED files:

Set_A:
chromosome_1 10 200
chromosome_2 500 1000
chromosome_3 2000 3000

Set_B:
chromosome_1 190 250
chromosome_2 400 600
chromosome_3 3001 4000

The most basic BEDtools utility is called intersect and returns the coordinates of the portion of each feature in Set_A that overlaps with a feature in Set_B

$ bedtools intersect -a Set_A -b Set_B
chromosome_1 190 200
chromosome_2 500 600

Typing the name of the utility without any arguments will return a list of all the possible arguments you can use to control what the utility reports. For example, adding the -wo argument tells intersect to report the full coordinates of both features that overlap plus the number of basepairs that lie within the overlap:

$ bedtools intersect -a Set_A -b Set_B -wo
chromosome_1 10 200 chromosome_1 190 250 10
chromosome_2 500 1000 chromosome_2 400 600 100

How could we get back the coordinates of the features in Set_A that do NOT overlap any of the features in Set_B?

The BEDtools genome file

Some of the BEDtools utilities require a genome file which is simply a tab-delimited list of each sequence ID found in the reference FASTA file, followed by the length of the sequence.

You will have a chance to explore more BEDtools utilities in the excercises.


Exercises


1. You will recognize the premise of the first exercise if you took the Intro to Unix workshop:

Another lab in the department has been studying a mutant in S. cerevisiae S288C and have successfully mapped the single point mutation to a 20kb region of the genome. They PCR amplified this region in overlapping segments, pooled the segments and sequenced them on an Illumina machine, 101 basepair single-end reads.

In the Unix workshop, the bonus exercise involved an approach to find the mutation using only built-in Unix utilities like grep, the results of which suggest the mutation is located at position 505394 of chromosome_V. Now you just want to use tview to visually confirm that the location is in fact correct.

You can find the read alignments here:
/global/courses/spr2012/SAMtoolsworkshop/S288C_mutant_chromV.sam
Reference file:
/global/courses/spr2012/SAMtoolsworkshop/S288C_chromosome_V.fa

You will have to use samtools to generate the appropriate index files required by tview.


2. The premise of this exercise is that you evolved a strain of yeast under high salt conditions and generated iIlumina sequence from the genome of the evolved strain. You've aligned the reads back to the reference genome and would like to determine if any deletions or duplications of genomic regions occurred in the evolved strain by examining read coverage along the chromosomes.

You can find the read alignments here (only chromosome #1 is included to reduce the file size):
/global/courses/spr2012/SAMtoolsworkshop/S288C_evolved.sam

2A. Read the BEDtools manual to find a utility that you can use to calculate read coverage along the genome. Run this utility to calculate per-base read depth coverage along chromosome #1. Hint: you will need to know the length of chromosome #1 to make the BEDtools genome file. Luckily there is a samtools utility that will do this. Look at the samtools website to figure out which one.

2B. Create a plot of read coverage along chrom #1 using the output from 2A. You can either move the file to your laptop and use your graphing method of choice, do it yourself in R, or use this R script (rename the coverage file R.in first):

$ R --no-save < /global/courses/spr2012/SAMtoolsworkshop/plotCoverage.R

3. The premise of this exercise is that you have ChIP-Seq data from two different histone modifications in Drosophila, a "silencing" mark associated with constitutive heterochromatin (H3K9me3) and an "active" mark associated with transcribed genes (H3K36me3). You have already identified the genomic locations of peaks of binding for each mark and want to quickly determine if the peak locations make biological sense by determining, for each mark, the fraction of peaks that overlap genes versus repetitive elements.

3A. You identified repetitive elements in the D. melanogaster genome using RepeatMasker, however this program isn't very good at identifying full-length elements and will often report multiple segments of a given element that are directly adjacent or within a few hundred basepairs of each other in the genome as separate elements. Find a BEDtools utility that you can use to merge features that are within a specified distance of each other into a single feature and save the output as a new BED file.

The un-merged output of RepeatMasker is here:
/global/courses/spr2012/SAMtoolsworkshop/dmel_r5.6_repeatmasker.sorted.gff

3B. Use the merged RepeatMasker output, along with BEDtools, to determine, for each mark, the fraction of peaks that overlap genes versus repetitive elements.

The other files you will need are here:
Dmel gene coordinates: /global/courses/spr2012/SAMtoolsworkshop/dmel_r5.6_genes.gff
H3K9me3 peak locations: /global/courses/spr2012/SAMtoolsworkshop/dmel_r5.6_H3K9me3_E0-12h.gff
H3K36me3 peak locations: /global/courses/spr2012/SAMtoolsworkshop/dmel_r5.6_H3K36me3_E0-12h.gff

Bonus Exercises

1. GFF files for gene annotations give you the coordinates of the entire gene as well as the coordinates of the exons of each gene. Given such a file, can you use BEDtools to find the coordinates of the introns?
Use this file: /global/courses/spr2012/SAMtoolsworkshop/fly.gff

2. Download and install the read alignment visualization tool Tablet (link is at the top of this page). Use it to look at the alignments from exercise #1.

3. If you finish early and are still excited for more, figure out how to run one of the Picard utilities on one of the SAM/BAM files you used previously.


Solutions


1. To get your files ready to input into tview, you should have converted the SAM file to BAM, sorted the BAM file, and then indexed the sorted BAM file:

$ samtools view -bS /global/courses/spr2012/SAMtoolsworkshop/S288C_mutant_chromV.sam > S288C_mutant_chromV.bam
$ samtools sort S288C_mutant_chromV.bam S288C_mutant_chromV.sorted
$ samtools index S288C_mutant_chromV.sorted.bam

You will also need to index the reference FASTA file:

$ cp /global/courses/spr2012/SAMtoolsworkshop/S288C_chromosome_V.fa .
$ samtools faidx S288C_chromosome_V.fa

Now you can run the tview command:
$ samtools tview S288C_mutant_chromV.sorted.bam S288C_chromosome_V.fa

Remember, typing the question mark will let you see a menu of options which you can use to figure out that typing 'g' lets you move to a specific location on the chromosome.
Enter 'chromosome_V:505394' into the box after pressing 'g' and you will jump to the location of the mutation. Pressing the period once will allow you to see nucleotides for the reads instead of dots and pressing 'n' colors the nucleotides.

The mutation should be G->A and should be at position 505394 of chromosome V.

2. We have to start this exercise the same way as the last one: convert to BAM, sort BAM, index BAM

$ cp /global/courses/spr2012/SAMtoolsworkshop/S288C_evolved.sam .
$ samtools view -bS S288C_evolved.sam > S288C_evolved.bam
$ samtools sort S288C_evolved.bam S288C_evolved.sorted
$ samtools index S288C_evolved.sorted.bam

The samtools utility called idxstats will tell you the length of each sequence that is present in the reference FASTA file.

$ samtools idxstats S288C_evolved.sorted.bam

chromosome_I 230218 51928 0
* 0 0 1458

Although the 'bedtools coverage' will also work for this exercise, 'bedtools genomecov' is a bit more straightforward for what we want to do in this case. bedtools genomecov requires a genome file which is simply a tab-delimited text file that contains the ID of each sequence in the reference FASTA followed by its length.

You can just use your favorite text editor to create the genome file. In this case I named mine genome.bed:
$ cat genome.bed
chromosome_I 230218

You have to include the -d argument when you run genomeCoverageBed to get it to print out the read depth per basepair. In this case, I capture the output in a file named S288C_evolved.cov.

$ bedtools genomecov -ibam S288C_evolved.sorted.bam -d -g genome.bed > S288C_evolved.cov


To use the R script that is on the CGRL server to plot the coverage data, we have to first rename the file R.in:
$ cp S288C_evolved.cov R.in
$ R --no-save < /global/courses/spr2012/SAMtoolsworkshop/plotCoverage.R

This will create a PDF file named genomeCoverageBed.Rplot.pdf. Move this file back onto your laptop to look at it:
$ scp cellison@poset.cgrl.berkeley.edu:genomeCoverageBed.Rplot.pdf Documents/.



It looks like there was one large deletion and one large duplication!

3. First we should copy all the files we need to our home directory:
$ cp /global/courses/spr2012/SAMtoolsworkshop/dmel_r5.6_genes.gff .
$ cp /global/courses/spr2012/SAMtoolsworkshop/dmel_r5.6_H3K36me3_E0-12h.gff .
$ cp /global/courses/spr2012/SAMtoolsworkshop/dmel_r5.6_H3K9me3_E0-12h.gff .
$ cp /global/courses/spr2012/SAMtoolsworkshop/dmel_r5.6_repeatmasker.sorted.gff .

We can use 'bedtools intersect' to print out all the ChIP-seq peaks that overlap at least one gene in the GFF file. The -u argument makes it so that bedtools prints out each feature in set A only once even if that feature overlaps more than one feature in set B. We want to know the total number of peaks that overlap one or more features so we simply count the number of lines output by the intersect command:

$ bedtools intersect -u -a dmel_r5.6_H3K36me3_E0-12h.gff -b dmel_r5.6_genes.gff | wc -l
6074

We also count the total number of peaks:
$ wc -l dmel_r5.6_H3K36me3_E0-12h.gff
6846 dmel_r5.6_H3K36me3_E0-12h.gff

Same thing for the H3K9 mark:
$ bedtools intersect -u -a dmel_r5.6_H3K9me3_E0-12h.gff -b dmel_r5.6_genes.gff | wc -l
1898
$ wc -l dmel_r5.6_H3K9me3_E0-12h.gff
3543 dmel_r5.6_H3K9me3_E0-12h.gff

Now we repeat everything to find the overlap between the ChIP-seq peaks and repetitive elements, but first we merge repetitive elements that are within 100bp from each other in the genome:

$ bedtools merge -i dmel_r5.6_repeatmasker.sorted.gff -d 100 > dmel_r5.6_repeatmasker.merged.bed
$ bedtools intersect -u -a dmel_r5.6_H3K9me3_E0-12h.gff -b dmel_r5.6_repeatmasker.merged.bed | wc -l
1303
$ intersectBed -u -a dmel_r5.6_H3K36me3_E0-12h.gff -b dmel_r5.6_repeatmasker.merged.bed | wc -l
512

Fraction of peaks overlapping genes: H3K9 1898/3543 (54%) H3K36 6074/6846 (89%)
Fraction of peaks overlapping TEs : H3K9 1303/3543 (37%) H3K36 512/6846 (7%)

The H3K36 mark makes perfect sense: it's a mark associated with active transcription and the vast majority of peaks overlap genes while only a small percentage overlap repetitive elements.
The H3K9 mark makes less: it's a mark associated with silenced regions of the genome but more peaks overlap genes than repetitive elements. If this was my experiment I'd say the H3K36 data seems good but the H3K9 data is questionable.