SAMToolsSpring2013

=BEDTools / SAMTools for NGS Data Analysis=

Chris Ellison
March 4, 2013

**Here are the files from Excercise #2 if you would like to try this on your own laptop (but you will first need to install samtools and BEDtools)**
(this is the actual sequence of chromosome I, which isn't necessary for excercise #2, but useful if you want to play around with other tools such as Tablet)

Introduction

 * Learning to program is essential for performing sophisticated/customized analyses but one can accomplish an impressive amount of computational tasks by simply taking advantage of pre-existing software, without ever having to write your own script.
 * The types of analyses that are possible are increasing rapidly as the field matures and the community adopts standardized file formats.
 * One of the most common types of file formats related to next-gen sequence is called SAM and is used to store information about how and where short reads align to a reference sequence. Often, in conjunction with SAM, you will see BAM files mentioned. BAM just stands for Binary SAM. They store the same exact information except that BAM files are compressed so that they take up less disk space. Two other file formats that are commonly used in genomics that we will also talk a bit about are GFF and BED, which are ways of storing the coordinates of genome features.
 * For basically all types of analyses related to next-gen sequence, you will at some point end up having to align your reads to a reference sequence. Back in the early days, every piece of software for aligning short reads output the alignment information in a slightly different format which meant that you either had to write your own scripts to convert alignments between formats, or hope to find a script on the internet. These days, the vast majority of short read aligners output alignments in SAM format. This has made it easier to develop suites of utilities for downstream processing of short read alignments. The two that I'm going to cover in detail today are SAMtools and BEDtools
 * The goal of this workshop is introduce you to these two suites of utilities and some of the fairly sophisticated computation that they will allow you to perform, without ever having to write your own script.

Links of Interest

 * SAMtools homepage
 * SAMtools manual
 * SAM format specification
 * Fugu download (software for transferring files)
 * Tablet homepage (software for visualizing BAM alignments)
 * BEDtools homepage
 * BEDtools (old) manual
 * BEDtools (new) online manual (in-progress)
 * BED file format
 * GFF file format
 * Picard homepage (another suite of utilities for SAM/BAM files)

A note about the format of this wiki
Examples of input and output to and from the command line are included below. Commands that you will type into the command line are shown in **Bold Lucida** font and are preceded by a dollar sign. You should not actually type the dollar sign into the command line. Output is shown in //Italic Lucida.// Example:

**$ input command** //output from command//

If you are working on one of the CGRL servers
You will need to type in the following commands to make the software we will use today available to you:

**$ module load samtools/0.1.18** **$ module load bedtools/2.17.0** **$ module load R**

If you logout and log back in, you will need to type these commands in again. You most likely will not need to do this if you have these utilities installed on your own machine or are working on a different server.

SAMtools
We're going to start with SAMtools, which by itself actually doesn't allow you to do anything really interesting, but you will still end up using it all the time to convert between SAM and BAM formats. Also, many downstream tools, such as BEDtools and many SNP callers, require that BAM files be indexed and sorted, which is easily accomplished with SAMtools.

The SAM format, while not exactly easy on the eyes, encodes almost every piece of information you would ever want to know about how and where a read aligns to a reference sequence.

Let's take a look:

** $ less example.sam ** @HD VN:1.0 SO:unsorted @SQ SN:chromosome_I LN:230218 @PG ID:Bowtie VN:0.12.7 CL:"bowtie -v 1 -m1 -S chromosome_I example.fastq example.sam" Frag_1 0 chromosome_I 128138 255 33M * 0 0 ATACAAATATATGATGAATCATTAAAGAGGAGG AIGH?IFIIIF+GGGHIHII+=FD?FIFH;IHF XA:i:0 MD:Z:101 NM:i:0 Frag_2 16 chromosome_I 54594 255 33M * 0 0 AATCTGAAGTGTCTATAAGTACTGTATTCTTAG ICCGIHIFHIIIHI=EIIFFG@?IICEEDHI>H XA:i:0 MD:Z:101 NM:i:0 Frag_3 0 chromosome_I 181175 255 33M * 0 0 CAAGAAATATCTTGACCGCAGTGAACTGTGGGA HCIIIHIH?HIG:IIHGDHFBHIIIGI2HIII@ XA:i:0 MD:Z:101 NM:i:0 Frag_4 16 chromosome_I 63331 255 33M * 0 0 GTTTCAACAAGGTGCGCCCCTCTGTATACTTTT HEIIGIBIBGH=2GFIIIHGIHIIH7HGI?HHI XA:i:0 MD:Z:101 NM:i:0 <span style="font-family: 'Lucida Console',Monaco,monospace; font-size: 80%;">Frag_5 0 chromosome_I 141473 255 33M * 0 0 TGATTACTTTTCTTTTGATGTGCTTATCTTACA IHI6IIIIIHEEIFIIHEIFIHIIHICHIII%C XA:i:0 MD:Z:101 NM:i:0 <span style="font-family: 'Lucida Console',Monaco,monospace; font-size: 80%;">...

Every SAM files starts with a number of lines that begin with the '@' symbol. The amount of information included in this section varies depending on the short read aligner that was used but all will include at least the '@SQ' lines which list the IDs and lengths of the all sequences present within the reference file that the short reads were aligned to. In this example, two other lines are included. The '@HD' line lists the SAM format version that was used and whether or not the file has been sorted. The @PG line lists the aligner name and version that was used to align the reads as well as the actual command line.

Below these lines, the actual read alignment information begins.

Column:
 * 1) Read ID
 * 2) Bitwise flag (encodes additional alignment information such as whether mate of current was mapped)
 * 3) ID of reference sequence that read aligned to
 * 4) Coordinate on reference sequence of left-most position of first matching base
 * 5) Mapping quality
 * 6) CIGAR string (encodes how each base of read aligned, e.g. match, mismatch, insertion, deletion, etc.)
 * 7) Reference ID where mate aligned (if paired-end)
 * 8) Reference position where mate aligned (if paired-end)
 * 9) Fragment length (if paired-end)
 * 10) Read sequence
 * 11) Read quality scores
 * 12) Alignment section (see SAM format specification and aligner manual for more details)

Let's go over some of the basic things you will want to do with SAMtools. First let's see what tools are available within this suite:


 * <span style="font-family: 'Lucida Console',Monaco,monospace;">$ samtools **

//<span style="font-family: 'Lucida Console',Monaco,monospace;">Program: samtools (Tools for alignments in the SAM format) // //<span style="font-family: 'Lucida Console',Monaco,monospace;">Version: 0.1.16 (r963:234) //

//<span style="font-family: 'Lucida Console',Monaco,monospace;">Usage: samtools [options] //

//<span style="font-family: 'Lucida Console',Monaco,monospace;">Command: view SAM<->BAM conversion // //<span style="font-family: 'Lucida Console',Monaco,monospace;"> sort sort alignment file // //<span style="font-family: 'Lucida Console',Monaco,monospace;"> pileup generate pileup output // //<span style="font-family: 'Lucida Console',Monaco,monospace;"> mpileup multi-way pileup // //<span style="font-family: 'Lucida Console',Monaco,monospace;"> depth compute the depth // //<span style="font-family: 'Lucida Console',Monaco,monospace;"> faidx index/extract FASTA // //<span style="font-family: 'Lucida Console',Monaco,monospace;"> index index alignment // //<span style="font-family: 'Lucida Console',Monaco,monospace;"> idxstats BAM index stats (r595 or later) // //<span style="font-family: 'Lucida Console',Monaco,monospace;"> fixmate fix mate information // //<span style="font-family: 'Lucida Console',Monaco,monospace;"> glfview print GLFv3 file // //<span style="font-family: 'Lucida Console',Monaco,monospace;"> flagstat simple stats // //<span style="font-family: 'Lucida Console',Monaco,monospace;"> calmd recalculate MD/NM tags and '=' bases // //<span style="font-family: 'Lucida Console',Monaco,monospace;"> merge merge sorted alignments // //<span style="font-family: 'Lucida Console',Monaco,monospace;"> rmdup remove PCR duplicates // //<span style="font-family: 'Lucida Console',Monaco,monospace;"> reheader replace BAM header // //<span style="font-family: 'Lucida Console',Monaco,monospace;"> cat concatenate BAMs // //<span style="font-family: 'Lucida Console',Monaco,monospace;"> targetcut cut fosmid regions (for fosmid pool only) // //<span style="font-family: 'Lucida Console',Monaco,monospace;"> phase phase heterozygotes //

The most common thing you are likely to be using SAMtools for is converting between SAM and BAM files. Let's figure out how to do that:


 * <span style="font-family: 'Lucida Console',Monaco,monospace;">$ samtools view **

//<span style="font-family: 'Lucida Console',Monaco,monospace;">Usage: samtools view [options] <in.bam>|<in.sam> [region1 [...]] //

//<span style="font-family: 'Lucida Console',Monaco,monospace;">Options: // //<span style="font-family: 'Lucida Console',Monaco,monospace;">-b output BAM // //<span style="font-family: 'Lucida Console',Monaco,monospace;"> -h print header for the SAM output // //<span style="font-family: 'Lucida Console',Monaco,monospace;"> -H print header only (no alignments) // //<span style="font-family: 'Lucida Console',Monaco,monospace;"> -S input is SAM // //<span style="font-family: 'Lucida Console',Monaco,monospace;"> -u uncompressed BAM output (force -b) // //<span style="font-family: 'Lucida Console',Monaco,monospace;"> -1 fast compression (force -b) // //<span style="font-family: 'Lucida Console',Monaco,monospace;"> -x output FLAG in HEX (samtools-C specific) // //<span style="font-family: 'Lucida Console',Monaco,monospace;"> -X output FLAG in string (samtools-C specific) // //<span style="font-family: 'Lucida Console',Monaco,monospace;"> -c print only the count of matching records // //<span style="font-family: 'Lucida Console',Monaco,monospace;"> -L FILE output alignments overlapping the input BED FILE [null] // //<span style="font-family: 'Lucida Console',Monaco,monospace;"> -t FILE list of reference names and lengths (force -S) [null] // //<span style="font-family: 'Lucida Console',Monaco,monospace;"> -T FILE reference sequence file (force -S) [null] // //<span style="font-family: 'Lucida Console',Monaco,monospace;"> -o FILE output file name [stdout] // //<span style="font-family: 'Lucida Console',Monaco,monospace;"> -R FILE list of read groups to be outputted [null] // //<span style="font-family: 'Lucida Console',Monaco,monospace;"> -f INT required flag, 0 for unset [0] // //<span style="font-family: 'Lucida Console',Monaco,monospace;"> -F INT filtering flag, 0 for unset [0] // //<span style="font-family: 'Lucida Console',Monaco,monospace;"> -q INT minimum mapping quality [0] // //<span style="font-family: 'Lucida Console',Monaco,monospace;"> -l STR only output reads in library STR [null] // //<span style="font-family: 'Lucida Console',Monaco,monospace;"> -r STR only output reads in read group STR [null] // //<span style="font-family: 'Lucida Console',Monaco,monospace;"> -? longer help //

If we want to go from SAM to BAM, it looks like we have to specify that the input file is in SAM format (-S) and that we want samtools to output a BAM file (-b) and we have to capture the output in a new file:

//<span style="font-family: 'Lucida Console',Monaco,monospace;">[samopen] SAM header is present: 1 sequences. //
 * <span style="font-family: 'Lucida Console',Monaco,monospace;">$ samtools view -bS example.sam > example.bam **

The opposite conversion doesn't require any arguments:


 * <span style="font-family: 'Lucida Console',Monaco,monospace;">$ samtools view example.bam > example.sam **

The other two most commonly used samtools utilties are sort and index.

//<span style="font-family: 'Lucida Console',Monaco,monospace;">Usage: samtools sort [-on] [-m <maxMem>] <in.bam> <out.prefix> //
 * <span style="font-family: 'Lucida Console',Monaco,monospace;">$ samtools sort **

It looks like we need specify the BAM file to sort as well as the prefix of the output file:

This should create a new sorted file called example.sorted.bam.
 * <span style="font-family: 'Lucida Console',Monaco,monospace;">$ samtools sort example.bam example.sorted **

Often, after you have sorted a BAM file you will also need to index it:

//<span style="font-family: 'Lucida Console',Monaco,monospace;">Usage: samtools index <in.bam> [out.index] //
 * <span style="font-family: 'Lucida Console',Monaco,monospace;">$ samtools index **

This time, the name of the output file is optional. If you don't include anything, if will create an index with the same name as the BAM file ending with ".bai"


 * <span style="font-family: 'Lucida Console',Monaco,monospace;">$ samtools index example.sorted.bam **

This should create a file called example.sorted.bam.bai

If you want to visualize short read alignments, you'll also need to index the reference fasta file:


 * <span style="font-family: 'Lucida Console',Monaco,monospace;">$ samtools faidx reference.fasta **

Now that we have all the required files, we can explore a couple of ways of visualizing short read alignments. SAMtools comes with its own utility for visualizing read alignments called tview. You don't have to specify the BAM and reference indexes, but they need to be in the same directory as the BAM and reference FASTA files.


 * <span style="font-family: 'Lucida Console',Monaco,monospace;">$ samtools tview example.sorted.bam reference.fasta **

If you want to see the menu that lists the various options for displaying the alignments and moving around type '?'. If you want to jump to a specific location along the chromosome, type 'g' followed by the location in the format referenceID:coordinate e.g. chromosome_I:150000.

SAMtools tview is fast and runs directly in the terminal window which makes it nice if you want to quickly look at read alignments at a specific location in the reference sequence. For browsing read alignments, I prefer the standalone program Tablet which is more feature-rich and easier to navigate.

Moving files off of the server onto your laptop
You are going to run Tablet locally on your own laptop which means you need to move the appropriate files from the server onto your laptop. The easiest way to do this if you have a Mac is with the program Fugu otherwise you can use the command 'scp'. First open a new terminal window then type:

//<span style="font-family: 'Lucida Console',Monaco,monospace;">Password: // //<span style="font-family: 'Lucida Console',Monaco,monospace;">example.bam 100% 5448KB 454.0KB/s 00:12 //
 * <span style="font-family: 'Lucida Console',Monaco,monospace;">$ scp <span class="wiki_link_ext">username@poset.cgrl.berkeley.edu:example.bam . **

Before transferring the file it asks you for a password. Type in the password for your account on the remote server. Also, don't forget to include the period at the end of the command and to replace 'username' in the above example with your actual username. Repeat this command for all the files you need:


 * sorted BAM file (e.g. example.sorted.bam)
 * index for sorted BAM file (e.g. example.sorted.bam.bai)
 * reference FASTA file (e.g. chromosome_I.fasta)
 * reference FASTA index (e.g. chromosome_I.fasta.fai)

Tablet
After downloading and installing Tablet, start the program and click on 'Import an assembly into Tablet'. The only tricky thing about using Tablet is figuring out which files need to be imported into which box. You should click on the 'Browse' button and locate the appropriate files so that the sorted BAM file goes into the box labeled 'Primary assembly file' and the reference FASTA file goes into the Reference/consensus box.

BEDtools
BEDtools is a great suite of tools that can be used for comparing large sets of genome features. A genome feature is basically anything that can represented by a start and end coordinate within the genome sequence space. Examples include genes, transposable elements, short read alignments, etc.

GFF
- 1-based (coordinate system starts at 1) - start and end positions are inclusive - always has 9 columns - brief description of columns can be found here

BED
- 0-based (coordinate system starts at 0) - start is inclusive but end is not - basic version has 3 columns with 9 additional optional columns - description of format can be found here

BEDtools will accept GFF, BED, and BAM formats as input but only outputs BED files. It is aware of the different coordinate representations in the different file formats (e.g. GFF vs BED) so you don't have to worry about that.

Most tools within the BEDtools suite perform comparisons between two different sets of genome features. The filename that contains the first set (Set A) is given after the -a flag while the second (Set B) is given after the -b flag. If one of the feature sets is a BAM file, it should be given as Set A (Set B is read into memory and since BAM files can have millions of lines, avoiding using it as Set B will save memory) and the -a flag has to be changed to -abam. All genome feature files need to be tab-delimited. A quick example for two BED files:

Set_A: chromosome_1 10 200 chromosome_2 500 1000 chromosome_3 2000 3000

Set_B: chromosome_1 190 250 chromosome_2 400 600 chromosome_3 3001 4000

The most basic BEDtools utility is called intersect and returns the coordinates of the portion of each feature in Set_A that overlaps with a feature in Set_B

//<span style="font-family: 'Lucida Console',Monaco,monospace;">chromosome_1 190 200 // //<span style="font-family: 'Lucida Console',Monaco,monospace;">chromosome_2 500 600 //
 * <span style="font-family: 'Lucida Console',Monaco,monospace;">$ bedtools intersect -a Set_A -b Set_B **

Typing the name of the utility without any arguments will return a list of all the possible arguments you can use to control what the utility reports. For example, adding the -wo argument tells intersect to report the full coordinates of both features that overlap plus the number of basepairs that lie within the overlap:

//<span style="font-family: 'Lucida Console',Monaco,monospace;">chromosome_1 10 200 chromosome_1 190 250 10 // //<span style="font-family: 'Lucida Console',Monaco,monospace;">chromosome_2 500 1000 chromosome_2 400 600 100 //
 * <span style="font-family: 'Lucida Console',Monaco,monospace;">$ bedtools intersect -a Set_A -b Set_B -wo **

How could we get back the coordinates of the features in Set_A that do NOT overlap any of the features in Set_B?

The BEDtools genome file
Some of the BEDtools utilities require a genome file which is simply a tab-delimited list of each sequence ID found in the reference FASTA file, followed by the length of the sequence.

You will have a chance to explore more BEDtools utilities in the excercises.

=Exercises=

1. You will recognize the premise of the first exercise if you took the Intro to Unix workshop:

Another lab in the department has been studying a mutant in //S. cerevisiae// S288C and have successfully mapped the single point mutation to a 20kb region of the genome. They PCR amplified this region in overlapping segments, pooled the segments and sequenced them on an Illumina machine, 101 basepair single-end reads.

In the Unix workshop, the bonus exercise involved an approach to find the mutation using only built-in Unix utilities like grep, the results of which suggest the mutation is located at position 505394 of chromosome_V. Now you just want to use tview to visually confirm that the location is in fact correct.

You can find the read alignments here: /global/courses/spr2012/SAMtoolsworkshop/S288C_mutant_chromV.sam Reference file: /global/courses/spr2012/SAMtoolsworkshop/S288C_chromosome_V.fa

You will have to use samtools to generate the appropriate index files required by tview.

2. The premise of this exercise is that you evolved a strain of yeast under high salt conditions and generated iIlumina sequence from the genome of the evolved strain. You've aligned the reads back to the reference genome and would like to determine if any deletions or duplications of genomic regions occurred in the evolved strain by examining read coverage along the chromosomes.

You can find the read alignments here (only chromosome #1 is included to reduce the file size): /global/courses/spr2012/SAMtoolsworkshop/S288C_evolved.sam

2A. Read the BEDtools manual to find a utility that you can use to calculate read coverage along the genome. Run this utility to calculate per-base read depth coverage along chromosome #1. Hint: you will need to know the length of chromosome #1 to make the BEDtools genome file. Luckily there is a samtools utility that will do this. Look at the samtools website to figure out which one.

2B. Create a plot of read coverage along chrom #1 using the output from 2A. You can either move the file to your laptop and use your graphing method of choice, do it yourself in R, or use this R script (rename the coverage file R.in first):


 * <span style="font-family: 'Lucida Console',Monaco,monospace;">$ R --no-save < /global/courses/spr2012/SAMtoolsworkshop/plotCoverage.R **

3. The premise of this exercise is that you have ChIP-Seq data from two different histone modifications in Drosophila, a "silencing" mark associated with constitutive heterochromatin (H3K9me3) and an "active" mark associated with transcribed genes (H3K36me3). You have already identified the genomic locations of peaks of binding for each mark and want to quickly determine if the peak locations make biological sense by determining, for each mark, the fraction of peaks that overlap genes versus repetitive elements.

3A. You identified repetitive elements in the D. melanogaster genome using RepeatMasker, however this program isn't very good at identifying full-length elements and will often report multiple segments of a given element that are directly adjacent or within a few hundred basepairs of each other in the genome as separate elements. Find a BEDtools utility that you can use to merge features that are within a specified distance of each other into a single feature and save the output as a new BED file.

The un-merged output of RepeatMasker is here: /global/courses/spr2012/SAMtoolsworkshop/dmel_r5.6_repeatmasker.sorted.gff

3B. Use the merged RepeatMasker output, along with BEDtools, to determine, for each mark, the fraction of peaks that overlap genes versus repetitive elements.

The other files you will need are here: Dmel gene coordinates: /global/courses/spr2012/SAMtoolsworkshop/dmel_r5.6_genes.gff H3K9me3 peak locations: /global/courses/spr2012/SAMtoolsworkshop/dmel_r5.6_H3K9me3_E0-12h.gff H3K36me3 peak locations: /global/courses/spr2012/SAMtoolsworkshop/dmel_r5.6_H3K36me3_E0-12h.gff

Bonus Exercises
1. GFF files for gene annotations give you the coordinates of the entire gene as well as the coordinates of the exons of each gene. Given such a file, can you use BEDtools to find the coordinates of the introns? Use this file: /global/courses/spr2012/SAMtoolsworkshop/fly.gff

2. Download and install the read alignment visualization tool Tablet (link is at the top of this page). Use it to look at the alignments from exercise #1.

3. If you finish early and are still excited for more, figure out how to run one of the Picard utilities on one of the SAM/BAM files you used previously.

=Solutions=

1. To get your files ready to input into tview, you should have converted the SAM file to BAM, sorted the BAM file, and then indexed the sorted BAM file:

<span style="font-family: 'Lucida Console',Monaco,monospace; font-size: 70%;">**$ samtools view -bS /global/courses/spr2012/SAMtoolsworkshop/S288C_mutant_chromV.sam > S288C_mutant_chromV.bam**
 * <span style="font-family: 'Lucida Console',Monaco,monospace;">$ samtools sort S288C_mutant_chromV.bam S288C_mutant_chromV.sorted **
 * <span style="font-family: 'Lucida Console',Monaco,monospace;">$ samtools index S288C_mutant_chromV.sorted.bam **

You will also need to index the reference FASTA file:


 * <span style="font-family: 'Lucida Console',Monaco,monospace;">$ cp /global/courses/spr2012/SAMtoolsworkshop/S288C_chromosome_V.fa . **
 * <span style="font-family: 'Lucida Console',Monaco,monospace;">$ samtools faidx S288C_chromosome_V.fa **

Now you can run the tview command:
 * <span style="font-family: 'Lucida Console',Monaco,monospace;">$ samtools tview S288C_mutant_chromV.sorted.bam S288C_chromosome_V.fa **

Remember, typing the question mark will let you see a menu of options which you can use to figure out that typing 'g' lets you move to a specific location on the chromosome. Enter 'chromosome_V:505394' into the box after pressing 'g' and you will jump to the location of the mutation. Pressing the period once will allow you to see nucleotides for the reads instead of dots and pressing 'n' colors the nucleotides.

The mutation should be G->A and should be at position <span style="font-family: Arial,Helvetica,sans-serif; font-size: 12.9667px;">505394 of chromosome V.

<span style="font-family: Arial,Helvetica,sans-serif; font-size: 12.9667px;">2. We have to start this exercise the same way as the last one: convert to BAM, sort BAM, index BAM


 * <span style="font-family: 'Lucida Console',Monaco,monospace;">$ cp /global/courses/spr2012/SAMtoolsworkshop/S288C_evolved.sam . **
 * <span style="font-family: 'Lucida Console',Monaco,monospace;">$ samtools view -bS S288C_evolved.sam > S288C_evolved.bam **
 * <span style="font-family: 'Lucida Console',Monaco,monospace;">$ samtools sort S288C_evolved.bam S288C_evolved.sorted **
 * <span style="font-family: 'Lucida Console',Monaco,monospace;">$ samtools index S288C_evolved.sorted.bam **

<span style="font-family: Arial,Helvetica,sans-serif; font-size: 12.9667px;">The samtools utility called idxstats will tell you the length of each sequence that is present in the reference FASTA file.


 * <span style="font-family: 'Lucida Console',Monaco,monospace;">$ samtools idxstats S288C_evolved.sorted.bam **

//<span style="font-family: 'Lucida Console',Monaco,monospace;">chromosome_I 230218 51928 0 // //<span style="font-family: 'Lucida Console',Monaco,monospace;">* 0 0 1458 //

Although the 'bedtools coverage' will also work for this exercise, 'bedtools genomecov' is a bit more straightforward for what we want to do in this case. bedtools genomecov requires a genome file which is simply a tab-delimited text file that contains the ID of each sequence in the reference FASTA followed by its length.

You can just use your favorite text editor to create the genome file. In this case I named mine genome.bed: //<span style="font-family: 'Lucida Console',Monaco,monospace;">chromosome_I 230218 //
 * <span style="font-family: 'Lucida Console',Monaco,monospace;">$ cat genome.bed **

You have to include the -d argument when you run genomeCoverageBed to get it to print out the read depth per basepair. In this case, I capture the output in a file named S288C_evolved.cov.

<span style="font-family: 'Lucida Console',Monaco,monospace; font-size: 80%;">**$ bedtools genomecov -ibam S288C_evolved.sorted.bam -d -g genome.bed > S288C_evolved.cov**

To use the R script that is on the CGRL server to plot the coverage data, we have to first rename the file R.in:
 * <span style="font-family: 'Lucida Console',Monaco,monospace;">$ cp S288C_evolved.cov R.in **
 * <span style="font-family: 'Lucida Console',Monaco,monospace;">$ R --no-save < /global/courses/spr2012/SAMtoolsworkshop/plotCoverage.R **

This will create a PDF file named genomeCoverageBed.Rplot.pdf. Move this file back onto your laptop to look at it: <span style="font-family: 'Lucida Console',Monaco,monospace; font-size: 80%;">**$ scp cellison@poset.cgrl.berkeley.edu:genomeCoverageBed.Rplot.pdf Documents/.**



It looks like there was one large deletion and one large duplication!

3. First we should copy all the files we need to our home directory: **<span style="font-family: 'Lucida Console',Monaco,monospace;">$ cp /global/courses/spr2012/SAMtoolsworkshop/dmel_r5.6_genes.gff. ** **<span style="font-family: 'Lucida Console',Monaco,monospace;">$ cp /global/courses/spr2012/SAMtoolsworkshop/dmel_r5.6_H3K36me3_E0-12h.gff. ** **<span style="font-family: 'Lucida Console',Monaco,monospace;">$ cp /global/courses/spr2012/SAMtoolsworkshop/dmel_r5.6_H3K9me3_E0-12h.gff. ** **<span style="font-family: 'Lucida Console',Monaco,monospace;">$ cp /global/courses/spr2012/SAMtoolsworkshop/dmel_r5.6_repeatmasker.sorted.gff. **

We can use 'bedtools intersect' to print out all the ChIP-seq peaks that overlap at least one gene in the GFF file. The -u argument makes it so that bedtools prints out each feature in set A only once even if that feature overlaps more than one feature in set B. We want to know the total number of peaks that overlap one or more features so we simply count the number of lines output by the intersect command:

**<span style="font-family: 'Lucida Console',Monaco,monospace;">$ bedtools intersect -u -a dmel_r5.6_H3K36me3_E0-12h.gff -b dmel_r5.6_genes.gff | wc -l ** //<span style="font-family: 'Lucida Console',Monaco,monospace; font-size: 90%;">6074 //

We also count the total number of peaks: //<span style="font-family: 'Lucida Console',Monaco,monospace; font-size: 90%;">6846 dmel_r5.6_H3K36me3_E0-12h.gff //
 * <span style="font-family: 'Lucida Console',Monaco,monospace; font-size: 90%;">$ wc -l dmel_r5.6_H3K36me3_E0-12h.gff **

Same thing for the H3K9 mark: //<span style="font-family: 'Lucida Console',Monaco,monospace; font-size: 90%;">1898 // //<span style="font-family: 'Lucida Console',Monaco,monospace; font-size: 90%;">3543 dmel_r5.6_H3K9me3_E0-12h.gff //
 * <span style="font-family: 'Lucida Console',Monaco,monospace; font-size: 90%;">$ bedtools intersect -u -a dmel_r5.6_H3K9me3_E0-12h.gff -b dmel_r5.6_genes.gff | wc -l **
 * <span style="font-family: 'Lucida Console',Monaco,monospace; font-size: 90%;">$ wc -l dmel_r5.6_H3K9me3_E0-12h.gff **

Now we repeat everything to find the overlap between the ChIP-seq peaks and repetitive elements, but first we merge repetitive elements that are within 100bp from each other in the genome:

<span style="font-family: 'Lucida Console',Monaco,monospace; font-size: 80%;">**$ bedtools merge -i dmel_r5.6_repeatmasker.sorted.gff -d 100 > dmel_r5.6_repeatmasker.merged.bed** <span style="font-family: 'Lucida Console',Monaco,monospace; font-size: 80%;">**$ bedtools intersect -u -a dmel_r5.6_H3K9me3_E0-12h.gff -b dmel_r5.6_repeatmasker.merged.bed | wc -l** //<span style="font-family: 'Lucida Console',Monaco,monospace; font-size: 80%;">1303 // <span style="font-family: 'Lucida Console',Monaco,monospace; font-size: 80%;">**$ intersectBed -u -a dmel_r5.6_H3K36me3_E0-12h.gff -b dmel_r5.6_repeatmasker.merged.bed | wc -l** //<span style="font-family: 'Lucida Console',Monaco,monospace; font-size: 80%;">512 //

Fraction of peaks overlapping genes: H3K9 1898/3543 (54%) H3K36 6074/6846 (89%) Fraction of peaks overlapping TEs : H3K9 1303/3543 (37%) H3K36 512/6846 (7%)

The H3K36 mark makes perfect sense: it's a mark associated with active transcription and the vast majority of peaks overlap genes while only a small percentage overlap repetitive elements. The H3K9 mark makes less: it's a mark associated with silenced regions of the genome but more peaks overlap genes than repetitive elements. If this was my experiment I'd say the H3K36 data seems good but the H3K9 data is questionable.