SAMtoolsSpring2012

= =

SAMtools and BEDtools workshop

Introduction

 * Learning to program is essential for performing sophisticated/customized analyses but one can accomplish an impressive amount of computational tasks by simply taking advantage of pre-existing software, without ever having to write your own script.
 * The types of analyses that are possible are increasing rapidly as the field matures and the community adopts standardized file formats.
 * One of the most common types of file formats related to next-gen sequence is called SAM and is used to store information about how and where short reads align to a reference sequence. Often, in conjunction with SAM, you will see BAM files mentioned. BAM just stands for Binary SAM. They store the same exact information except that BAM files are compressed so that they take up less disk space. Two other file formats that are commonly used in genomics that we will also talk a bit about are GFF and BED, which are ways of storing the coordinates of genome features.
 * For basically all types of analyses related to next-gen sequence, you will at some point end up having to align your reads to a reference sequence. Back in the early days, every piece of software for aligning short reads output the alignment information in a slightly different format which meant that you either had to write your own scripts to convert alignments between formats, or hope to find a script on the internet. These days, the vast majority of short read aligners output alignments in SAM format. This has made it easier to develop suites of utilities for downstream processing of short read alignments. The two that I'm going to cover in detail today are SAMtools and BEDtools
 * The goal of this workshop is introduce you to these two suites of utilities and some of the fairly sophisticated computation that they will allow you to perform, without ever having to write your own script.

Links of Interest

 * SAMtools homepage
 * SAMtools manual
 * SAM format specification
 * Fugu download (software for transferring files)
 * Tablet homepage (software for visualizing BAM alignments)
 * BEDtools homepage
 * BEDtools manual
 * BED file format
 * GFF file format
 * Picard homepage (another suite of utilities for SAM/BAM files)

A note about the format of this wiki
Examples of input and output to and from the command line are included below. Commands that you will type into the command line are shown in **Bold Lucida** font and are preceded by a dollar sign. You should not actually type the dollar sign into the command line. Output is shown in //Italic Lucida.// Example:

**$ input command** //output from command//

If you are working on one of the CGRL servers
You will need to type in the following commands to make the software we will use today available to you:

**$ module load samtools/0.1.18** **$ module load bedtools** **$ module load picard-tools** **$ module load R**

If you logout and log back in, you will need to type these commands in again. You most likely will not need to do this if you have these utilities installed on your own machine or are working on a different server.

SAMtools
We're going to start with SAMtools, which by itself actually doesn't allow you to do anything really interesting, but you will still end up using it all the time to convert between SAM and BAM formats. Also, many downstream tools, such as BEDtools and many SNP callers, require that BAM files be indexed and sorted, which is easily accomplished with SAMtools.

The SAM format, while not exactly easy on the eyes, encodes almost every piece of information you would ever want to know about how and where a read aligns to a reference sequence.

Let's take a look:

** $ less example.sam ** @HD VN:1.0 SO:unsorted @SQ SN:chromosome_I LN:230218 @PG ID:Bowtie VN:0.12.7 CL:"bowtie -v 1 -m1 -S chromosome_I example.fastq example.sam" Frag_1 0 chromosome_I 128138 255 33M * 0 0 ATACAAATATATGATGAATCATTAAAGAGGAGG AIGH?IFIIIF+GGGHIHII+=FD?FIFH;IHF XA:i:0 MD:Z:101 NM:i:0 Frag_2 16 chromosome_I 54594 255 33M * 0 0 AATCTGAAGTGTCTATAAGTACTGTATTCTTAG ICCGIHIFHIIIHI=EIIFFG@?IICEEDHI>H XA:i:0 MD:Z:101 NM:i:0 Frag_3 0 chromosome_I 181175 255 33M * 0 0 CAAGAAATATCTTGACCGCAGTGAACTGTGGGA HCIIIHIH?HIG:IIHGDHFBHIIIGI2HIII@ XA:i:0 MD:Z:101 NM:i:0 <span style="font-family: 'Lucida Console',Monaco,monospace; font-size: 80%;">Frag_4 16 chromosome_I 63331 255 33M * 0 0 GTTTCAACAAGGTGCGCCCCTCTGTATACTTTT HEIIGIBIBGH=2GFIIIHGIHIIH7HGI?HHI XA:i:0 MD:Z:101 NM:i:0 <span style="font-family: 'Lucida Console',Monaco,monospace; font-size: 80%;">Frag_5 0 chromosome_I 141473 255 33M * 0 0 TGATTACTTTTCTTTTGATGTGCTTATCTTACA IHI6IIIIIHEEIFIIHEIFIHIIHICHIII%C XA:i:0 MD:Z:101 NM:i:0 <span style="font-family: 'Lucida Console',Monaco,monospace; font-size: 80%;">...

Every SAM files starts with a number of lines that begin with the '@' symbol. The amount of information included in this section varies depending on the short read aligner that was used but all will include at least the '@SQ' lines which list the IDs and lengths of the all sequences present within the reference file that the short reads were aligned to. In this example, two other lines are included. The '@HD' line lists the SAM format version that was used and whether or not the file has been sorted. The @PG line lists the aligner name and version that was used to align the reads as well as the actual command line.

Below these lines, the actual read alignment information begins.

Column:
 * 1) Read ID
 * 2) Bitwise flag (encodes additional alignment information such as whether mate of current was mapped)
 * 3) ID of reference sequence that read aligned to
 * 4) Coordinate on reference sequence of left-most position of first matching base
 * 5) Mapping quality
 * 6) CIGAR string (encodes how each base of read aligned, e.g. match, mismatch, insertion, deletion, etc.)
 * 7) Reference ID where mate aligned (if paired-end)
 * 8) Reference position where mate aligned (if paired-end)
 * 9) Fragment length (if paired-end)
 * 10) Read sequence
 * 11) Read quality scores
 * 12) Alignment section (see SAM format specification and aligner manual for more details)

Let's go over some of the basic things you will want to do with SAMtools. First let's see what tools are available within this suite:


 * <span style="font-family: 'Lucida Console',Monaco,monospace;">$ samtools **

//<span style="font-family: 'Lucida Console',Monaco,monospace;">Program: samtools (Tools for alignments in the SAM format) // //<span style="font-family: 'Lucida Console',Monaco,monospace;">Version: 0.1.16 (r963:234) //

//<span style="font-family: 'Lucida Console',Monaco,monospace;">Usage: samtools [options] //

//<span style="font-family: 'Lucida Console',Monaco,monospace;">Command: view SAM<->BAM conversion // //<span style="font-family: 'Lucida Console',Monaco,monospace;"> sort sort alignment file // //<span style="font-family: 'Lucida Console',Monaco,monospace;"> pileup generate pileup output // //<span style="font-family: 'Lucida Console',Monaco,monospace;"> mpileup multi-way pileup // //<span style="font-family: 'Lucida Console',Monaco,monospace;"> depth compute the depth // //<span style="font-family: 'Lucida Console',Monaco,monospace;"> faidx index/extract FASTA // //<span style="font-family: 'Lucida Console',Monaco,monospace;"> index index alignment // //<span style="font-family: 'Lucida Console',Monaco,monospace;"> idxstats BAM index stats (r595 or later) // //<span style="font-family: 'Lucida Console',Monaco,monospace;"> fixmate fix mate information // //<span style="font-family: 'Lucida Console',Monaco,monospace;"> glfview print GLFv3 file // //<span style="font-family: 'Lucida Console',Monaco,monospace;"> flagstat simple stats // //<span style="font-family: 'Lucida Console',Monaco,monospace;"> calmd recalculate MD/NM tags and '=' bases // //<span style="font-family: 'Lucida Console',Monaco,monospace;"> merge merge sorted alignments // //<span style="font-family: 'Lucida Console',Monaco,monospace;"> rmdup remove PCR duplicates // //<span style="font-family: 'Lucida Console',Monaco,monospace;"> reheader replace BAM header // //<span style="font-family: 'Lucida Console',Monaco,monospace;"> cat concatenate BAMs // //<span style="font-family: 'Lucida Console',Monaco,monospace;"> targetcut cut fosmid regions (for fosmid pool only) // //<span style="font-family: 'Lucida Console',Monaco,monospace;"> phase phase heterozygotes //

The most common thing you are likely to be using SAMtools for is converting between SAM and BAM files. Let's figure out how to do that:


 * <span style="font-family: 'Lucida Console',Monaco,monospace;">$ samtools view **

//<span style="font-family: 'Lucida Console',Monaco,monospace;">Usage: samtools view [options] <in.bam>|<in.sam> [region1 [...]] //

//<span style="font-family: 'Lucida Console',Monaco,monospace;">Options: // //<span style="font-family: 'Lucida Console',Monaco,monospace;">-b output BAM // //<span style="font-family: 'Lucida Console',Monaco,monospace;"> -h print header for the SAM output // //<span style="font-family: 'Lucida Console',Monaco,monospace;"> -H print header only (no alignments) // //<span style="font-family: 'Lucida Console',Monaco,monospace;"> -S input is SAM // //<span style="font-family: 'Lucida Console',Monaco,monospace;"> -u uncompressed BAM output (force -b) // //<span style="font-family: 'Lucida Console',Monaco,monospace;"> -1 fast compression (force -b) // //<span style="font-family: 'Lucida Console',Monaco,monospace;"> -x output FLAG in HEX (samtools-C specific) // //<span style="font-family: 'Lucida Console',Monaco,monospace;"> -X output FLAG in string (samtools-C specific) // //<span style="font-family: 'Lucida Console',Monaco,monospace;"> -c print only the count of matching records // //<span style="font-family: 'Lucida Console',Monaco,monospace;"> -L FILE output alignments overlapping the input BED FILE [null] // //<span style="font-family: 'Lucida Console',Monaco,monospace;"> -t FILE list of reference names and lengths (force -S) [null] // //<span style="font-family: 'Lucida Console',Monaco,monospace;"> -T FILE reference sequence file (force -S) [null] // //<span style="font-family: 'Lucida Console',Monaco,monospace;"> -o FILE output file name [stdout] // //<span style="font-family: 'Lucida Console',Monaco,monospace;"> -R FILE list of read groups to be outputted [null] // //<span style="font-family: 'Lucida Console',Monaco,monospace;"> -f INT required flag, 0 for unset [0] // //<span style="font-family: 'Lucida Console',Monaco,monospace;"> -F INT filtering flag, 0 for unset [0] // //<span style="font-family: 'Lucida Console',Monaco,monospace;"> -q INT minimum mapping quality [0] // //<span style="font-family: 'Lucida Console',Monaco,monospace;"> -l STR only output reads in library STR [null] // //<span style="font-family: 'Lucida Console',Monaco,monospace;"> -r STR only output reads in read group STR [null] // //<span style="font-family: 'Lucida Console',Monaco,monospace;"> -? longer help //

If we want to go from SAM to BAM, it looks like we have to specify that the input file is in SAM format (-S) and that we want samtools to output a BAM file (-b) and we have to capture the output in a new file:

//<span style="font-family: 'Lucida Console',Monaco,monospace;">[samopen] SAM header is present: 1 sequences. //
 * <span style="font-family: 'Lucida Console',Monaco,monospace;">$ samtools view -bS example.sam > example.bam **

The opposite conversion doesn't require any arguments:


 * <span style="font-family: 'Lucida Console',Monaco,monospace;">$ samtools view example.bam > example.sam **

The other two most commonly used samtools utilties are sort and index.

//<span style="font-family: 'Lucida Console',Monaco,monospace;">Usage: samtools sort [-on] [-m <maxMem>] <in.bam> <out.prefix> //
 * <span style="font-family: 'Lucida Console',Monaco,monospace;">$ samtools sort **

It looks like we need specify the BAM file to sort as well as the prefix of the output file:

This should create a new sorted file called example.sorted.bam.
 * <span style="font-family: 'Lucida Console',Monaco,monospace;">$ samtools sort example.bam example.sorted **

Often, after you have sorted a BAM file you will also need to index it:

//<span style="font-family: 'Lucida Console',Monaco,monospace;">Usage: samtools index <in.bam> [out.index] //
 * <span style="font-family: 'Lucida Console',Monaco,monospace;">$ samtools index **

This time, the name of the output file is optional. If you don't include anything, if will create an index with the same name as the BAM file ending with ".bai"


 * <span style="font-family: 'Lucida Console',Monaco,monospace;">$ samtools index example.sorted.bam **

This should create a file called example.sorted.bam.bai

If you want to visualize short read alignments, you'll also need to index the reference fasta file:


 * <span style="font-family: 'Lucida Console',Monaco,monospace;">$ samtools faidx reference.fasta **

Now that we have all the required files, we can explore a couple of ways of visualizing short read alignments. SAMtools comes with its own utility for visualizing read alignments called tview. You don't have to specify the BAM and reference indexes, but they need to be in the same directory as the BAM and reference FASTA files.


 * <span style="font-family: 'Lucida Console',Monaco,monospace;">$ samtools tview example.sorted.bam reference.fasta **

If you want to see the menu that lists the various options for displaying the alignments and moving around type '?'. If you want to jump to a specific location along the chromosome, type 'g' followed by the location in the format referenceID:coordinate e.g. chromosome_I:150000.

SAMtools tview is fast and runs directly in the terminal window which makes it nice if you want to quickly look at read alignments at a specific location in the reference sequence. For browsing read alignments, I prefer the standalone program Tablet which is more feature-rich and easier to navigate.

Moving files off of the server onto your laptop
You are going to run Tablet locally on your own laptop which means you need to move the appropriate files from the server onto your laptop. The easiest way to do this if you have a Mac is with the program Fugu otherwise you can use the command 'scp'. First open a new terminal window then type:

//<span style="font-family: 'Lucida Console',Monaco,monospace;">Password: // //<span style="font-family: 'Lucida Console',Monaco,monospace;">example.bam 100% 5448KB 454.0KB/s 00:12 //
 * <span style="font-family: 'Lucida Console',Monaco,monospace;">$ scp <span class="wiki_link_ext">username@poset.cgrl.berkeley.edu:example.bam . **

Before transferring the file it asks you for a password. Type in the password for your account on the remote server. Also, don't forget to include the period at the end of the command and to replace 'username' in the above example with your actual username. Repeat this command for all the files you need:


 * sorted BAM file (e.g. example.sorted.bam)
 * index for sorted BAM file (e.g. example.sorted.bam.bai)
 * reference FASTA file (e.g. chromosome_I.fasta)
 * reference FASTA index (e.g. chromosome_I.fasta.fai)

Tablet
After downloading and installing Tablet, start the program and click on 'Import an assembly into Tablet'. The only tricky thing about using Tablet is figuring out which files need to be imported into which box. You should click on the 'Browse' button and locate the appropriate files so that the sorted BAM file goes into the box labeled 'Primary assembly file' and the reference FASTA file goes into the Reference/consensus box.

BEDtools
BEDtools is a great suite of tools that can be used for comparing large sets of genome features. A genome feature is basically anything that can represented by a start and end coordinate within the genome sequence space. Examples include genes, transposable elements, short read alignments, etc.

GFF
- 1-based (coordinate system starts at 1) - start and end positions are inclusive - always has 9 columns - brief description of columns can be found here

BED
- 0-based (coordinate system starts at 0) - start is inclusive but end is not - basic version has 3 columns with 9 additional optional columns - description of format can be found here

BEDtools will accept GFF, BED, and BAM formats as input but only outputs BED files. It is aware of the different coordinate representations in the different file formats (e.g. GFF vs BED) so you don't have to worry about that.

Most tools within the BEDtools suite perform comparisons between two different sets of genome features. The filename that contains the first set (Set A) is given after the -a flag while the second (Set B) is given after the -b flag. If one of the feature sets is a BAM file, it should be given as Set A (Set B is read into memory and since BAM files can have millions of lines, avoiding using it as Set B will save memory) and the -a flag has to be changed to -abam. All genome feature files need to be tab-delimited. A quick example for two BED files:

Set_A: chromosome_1 10 200 chromosome_2 500 1000 chromosome_3 2000 3000

Set_B: chromosome_1 190 250 chromosome_2 400 600 chromosome_3 3001 4000

The most basic BEDtools utility is called intersectBed and returns the coordinates of the portion of each feature in Set_A that overlaps with a feature in Set_B

//<span style="font-family: 'Lucida Console',Monaco,monospace;">chromosome_1 190 200 // //<span style="font-family: 'Lucida Console',Monaco,monospace;">chromosome_2 500 600 //
 * <span style="font-family: 'Lucida Console',Monaco,monospace;">$ intersectBed -a Set_A -b Set_B **

Typing the name of the utility without any arguments will return a list of all the possible arguments you can use to control what the utility reports. For example, adding the -wo argument tells intersectBed to report the full coordinates of both features that overlap plus the number of basepairs that lie within the overlap:

//<span style="font-family: 'Lucida Console',Monaco,monospace;">chromosome_1 10 200 chromosome_1 190 250 10 // //<span style="font-family: 'Lucida Console',Monaco,monospace;">chromosome_2 500 1000 chromosome_2 400 600 100 //
 * <span style="font-family: 'Lucida Console',Monaco,monospace;">$ intersectBed -a Set_A -b Set_B -wo **

The BEDtools genome file
Some of the BEDtools utilities require a genome file which is simply a tab-delimited list of each sequence ID found in the reference FASTA file, followed by the length of the sequence.

You will have a chance to explore more BEDtools utilities in the excercises.

=Exercises=

1. You will recognize the premise of the first exercise if you took the Intro to Unix workshop:

Another lab in the department has been studying a mutant in //S. cerevisiae// S288C and have successfully mapped the single point mutation to a 20kb region of the genome. They PCR amplified this region in overlapping segments, pooled the segments and sequenced them on an Illumina machine, 101 basepair single-end reads.

In the Unix workshop, the bonus exercise involved a fairly convoluted approach to find the mutation using only built-in Unix utilities like grep. This time, examine the read alignments in Tablet to see if you can simply find the mutation by eye.

You can find the read alignments here: /global/courses/spr2012/SAMtoolsworkshop/S288C_mutant_chromV.sam Reference file: /global/courses/spr2012/SAMtoolsworkshop/S288C_chromosome_V.fa

You will have to download and install Tablet on your laptop and use samtools to generate the appropriate files for Tablet. You may want to play around with how Tablet displays the read alignments to make it easier to spot read bases that differ from the reference.

Once you have identified the coordinate of the mutation, use samtools tview to visualize the read alignments. Jump to the location of the mutation and make sure tview displays it as well.

2. The premise of this exercise is that you evolved a strain of yeast under high salt conditions and generated iIlumina sequence from the genome of the evolved strain. You've aligned the reads back to the reference genome and would like to determine if any deletions or duplications of genomic regions occurred in the evolved strain by examining read coverage along the chromosomes.

You can find the read alignments here (only chromosome #1 is included to reduce the file size): /global/courses/spr2012/SAMtoolsworkshop/S288C_evolved.sam

2A. Read the BEDtools manual to find a utility that you can use to calculate read coverage along the genome. Run this utility to calculate per-base read depth coverage along chromosome #1. Hint: you will need to know the length of chromosome #1 to make the BEDtools genome file. Luckily there is a samtools utility that will do this. Look at the samtools website to figure out which one.

2B. Create a plot of read coverage along chrom #1 using the output from 2A. You can either move the file to your laptop and use your graphing method of choice, do it yourself in R, or use this R script (rename the coverage file R.in first):


 * <span style="font-family: 'Lucida Console',Monaco,monospace;">$ R --no-save < /global/courses/spr2012/SAMtoolsworkshop/plotCoverage.R **

3. The premise of this exercise is that you have ChIP-Seq data from two different histone modifications in Drosophila, a "silencing" mark associated with constitutive heterochromatin (H3K9me3) and an "active" mark associated with transcribed genes (H3K36me3). You have already identified the genomic locations of peaks of binding for each mark and want to quickly determine if the peak locations make biological sense by determining, for each mark, the fraction of peaks that overlap genes versus repetitive elements.

3A. You identified repetitive elements in the D. melanogaster genome using RepeatMasker, however this program isn't very good at identifying full-length elements and will often report multiple segments of a given element that are directly adjacent or within a few hundred basepairs of each other in the genome as separate elements. Find a BEDtools utility that you can use to merge features that are within a specified distance of each other into a single feature and save the output as a new BED file.

The un-merged output of RepeatMasker is here: /global/courses/spr2012/SAMtoolsworkshop/dmel_r5.6_repeatmasker.gff

3B. Use the merged RepeatMasker output, along with BEDtools, to determine, for each mark, the fraction of peaks that overlap genes versus repetitive elements.

The other files you will need are here: Dmel gene coordinates: /global/courses/spr2012/SAMtoolsworkshop/dmel_r5.6_genes.gff H3K9me3 peak locations: /global/courses/spr2012/SAMtoolsworkshop/dmel_r5.6_H3K9me3_E0-12h.gff H3K36me3 peak locations: /global/courses/spr2012/SAMtoolsworkshop/dmel_r5.6_H3K36me3_E0-12h.gff

Bonus Exercises
1. There is a utility that comes with SAMtools that can be used to identify sequence variants from short read alignments. Figure out which tool and run it on the S288C_mutant_chromV.sam alignments to see if it identifies the same mutation that you found in Tablet and/or tview.

2. Use BEDtools to create 10 random permutations of the locations of the ChIP-seq peaks. Use these permutations to get a rough idea of whether the overlap that you observed between the peaks and genes and/or repetitive elements is more than you would expect by chance.

3. If you finish early and are still excited for more, figure out how to run one of the Picard utilities on one of the SAM/BAM files you used previously.

=Solutions=

1. To get your files ready to input into Tablet, you should have converted the SAM file to BAM, sorted the BAM file, and then indexed the sorted BAM file:

<span style="font-family: 'Lucida Console',Monaco,monospace; font-size: 70%;">**$ samtools view -bS /global/courses/spr2012/SAMtoolsworkshop/S288C_mutant_chromV.sam > S288C_mutant_chromV.bam**
 * <span style="font-family: 'Lucida Console',Monaco,monospace;">$ samtools sort S288C_mutant_chromV.bam S288C_mutant_chromV.sorted **
 * <span style="font-family: 'Lucida Console',Monaco,monospace;">$ samtools index S288C_mutant_chromV.sorted.bam **

You will also need to index the reference FASTA file:


 * <span style="font-family: 'Lucida Console',Monaco,monospace;">$ cp /global/courses/spr2012/SAMtoolsworkshop/S288C_chromosome_V.fa . **
 * <span style="font-family: 'Lucida Console',Monaco,monospace;">$ samtools faidx S288C_chromosome_V.fa **

Now you will need to move the BAM file and the reference FASTA file to your local machine, along with their respective indexes.

Open a new terminal window so that you are working from your own machine, rather than on the CGRL server

<span style="font-family: 'Lucida Console',Monaco,monospace; font-size: 80%;">**$ scp cellison@poset.cgrl.berkeley.edu:S288C_mutant_chromV.sorted.bam Documents/.** <span style="font-family: 'Lucida Console',Monaco,monospace; font-size: 80%;">**$ scp cellison@poset.cgrl.berkeley.edu:S288C_mutant_chromV.sorted.bam.bai Documents/.** <span style="font-family: 'Lucida Console',Monaco,monospace; font-size: 80%;">**$ scp cellison@poset.cgrl.berkeley.edu:S288C_chromosome_V.fasta Documents/.** <span style="font-family: 'Lucida Console',Monaco,monospace; font-size: 80%;">**$ scp cellison@poset.cgrl.berkeley.edu:S288C_chromosome_V.fa Documents/.** <span style="font-family: 'Lucida Console',Monaco,monospace; font-size: 80%;">**$ scp cellison@poset.cgrl.berkeley.edu:S288C_chromosome_V.fa.fai Documents/.**

Open Tablet and load the BAM and reference FASTA files as described in the wiki above. You won't see any reads immediately but that is because you only sequenced PCR product from a 20kb region of this chromosome. The easiest way to figure out which region of chromosome V you should scroll to is to look at the first few lines within the SAM file to see which coordinates they are mapping to.

You can control how easy it is to see the variants by sliding the 'Variant' scrollbar or by clicking on 'Read Colors' and selecting 'Variants'.

The mutation should be G->A and should be at position <span style="font-family: Arial,Helvetica,sans-serif; font-size: 12.9667px;">505393 of chromosome V.

<span style="font-family: Arial,Helvetica,sans-serif; font-size: 12.9667px;">2. We have to start this exercise the same way as the last one: convert to BAM, sort BAM, index BAM


 * <span style="font-family: 'Lucida Console',Monaco,monospace;">$ cp /global/courses/spr2012/SAMtoolsworkshop/S288C_evolved.sam . **
 * <span style="font-family: 'Lucida Console',Monaco,monospace;">$ samtools view -bS S288C_evolved.sam > S288C_evolved.bam **
 * <span style="font-family: 'Lucida Console',Monaco,monospace;">$ samtools sort S288C_evolved.bam S288C_evolved.sorted **
 * <span style="font-family: 'Lucida Console',Monaco,monospace;">$ samtools index S288C_evolved.sorted.bam **

<span style="font-family: Arial,Helvetica,sans-serif; font-size: 12.9667px;">The samtools utility called idxstats will tell you the length of each sequence that is present in the reference FASTA file.


 * <span style="font-family: 'Lucida Console',Monaco,monospace;">$ samtools idxstats S288C_evolved.sorted.bam **

//<span style="font-family: 'Lucida Console',Monaco,monospace;">chromosome_I 230218 51928 0 // //<span style="font-family: 'Lucida Console',Monaco,monospace;">* 0 0 1458 //

Although the BEDtools utility coverageBed will also work for this exercise, genomeCoverageBed is a bit more straightforward for what we want to do in this case. genomeCoverageBed requires a genome file which is simply a tab-delimited text file that contains the ID of each sequence in the reference FASTA followed by its length.

You can just use your favorite text editor to create the genome file. In this case I named mine genome.bed: //<span style="font-family: 'Lucida Console',Monaco,monospace;">chromosome_I 230218 //
 * <span style="font-family: 'Lucida Console',Monaco,monospace;">$ cat genome.bed **

You have to include the -d argument when you run genomeCoverageBed to get it to print out the read depth per basepair. In this case, I capture the output in a file named S288C_evolved.cov.

<span style="font-family: 'Lucida Console',Monaco,monospace; font-size: 80%;">**$ genomeCoverageBed -ibam S288C_evolved.sorted.bam -d -g genome.bed > S288C_evolved.cov**

To use the R script that is on the CGRL server to plot the coverage data, we have to first rename the file R.in:
 * <span style="font-family: 'Lucida Console',Monaco,monospace;">$ cp S288C_evolved.cov R.in **
 * <span style="font-family: 'Lucida Console',Monaco,monospace;">$ R --no-save < /global/courses/spr2012/SAMtoolsworkshop/plotCoverage.R **

This will create a PDF file named genomeCoverageBed.Rplot.pdf. Move this file back onto your laptop to look at it: <span style="font-family: 'Lucida Console',Monaco,monospace; font-size: 80%;">**$ scp cellison@poset.cgrl.berkeley.edu:genomeCoverageBed.Rplot.pdf Documents/.**



It looks like there was one large deletion and one large duplication!

3. We can use intersectBed to print out all the ChIP-seq peaks that overlap at least one gene in the GFF file. The -u argument makes it so that intersectBed prints out each feature in set A only once even if that feature overlaps more than one feature in set B. We want to know the total number of peaks that overlap one or more features so we simply count the number of lines output by intersectBed:

**<span style="font-family: 'Lucida Console',Monaco,monospace;">$ intersectBed -u -a 620_MAT_H3K36me3_E0-12h.bed_dcc.gff -b dmel-genes-r5.6.gff | wc -l ** //<span style="font-family: 'Lucida Console',Monaco,monospace; font-size: 90%;">6074 //

We also count the total number of peaks: //<span style="font-family: 'Lucida Console',Monaco,monospace; font-size: 90%;">6846 620_MAT_H3K36me3_E0-12h.bed_dcc.gff //
 * <span style="font-family: 'Lucida Console',Monaco,monospace; font-size: 90%;">$ wc -l 620_MAT_H3K36me3_E0-12h.bed_dcc.gff **

Same thing for the H3K9 mark: //<span style="font-family: 'Lucida Console',Monaco,monospace; font-size: 90%;">1898 // //<span style="font-family: 'Lucida Console',Monaco,monospace; font-size: 90%;">3543 621_MAT_H3K9me3_E0-12h.bed_dcc.gff //
 * <span style="font-family: 'Lucida Console',Monaco,monospace; font-size: 90%;">$ intersectBed -u -a 621_MAT_H3K9me3_E0-12h.bed_dcc.gff -b dmel-genes-r5.6.gff | wc -l **
 * <span style="font-family: 'Lucida Console',Monaco,monospace; font-size: 90%;">$ wc -l 621_MAT_H3K9me3_E0-12h.bed_dcc.gff **

Now we repeat everything to find the overlap between the ChIP-seq peaks and repetitive elements, but first we merge repetitive elements that are within 100bp from each other in the genome: <span style="font-family: 'Lucida Console',Monaco,monospace; font-size: 80%;"> <span style="font-family: 'Lucida Console',Monaco,monospace; font-size: 80%;">**$ mergeBed -i dmel_r5.6_repeatmasker.gff -d 100 > dmel_r5.6_repeatmasker.merged.bed** <span style="font-family: 'Lucida Console',Monaco,monospace; font-size: 80%;">**$ intersectBed -u -a 621_MAT_H3K9me3_E0-12h.bed_dcc.gff -b dmel_r5.6_repeatmasker.merged.bed | wc -l** //<span style="font-family: 'Lucida Console',Monaco,monospace; font-size: 80%;">1303 // <span style="font-family: 'Lucida Console',Monaco,monospace; font-size: 80%;">**$ intersectBed -u -a 620_MAT_H3K36me3_E0-12h.bed_dcc.gff -b dmel_r5.6_repeatmasker.merged.bed | wc -l** //<span style="font-family: 'Lucida Console',Monaco,monospace; font-size: 80%;">512 //

Fraction of peaks overlapping genes: H3K9 1898/3543 (54%) H3K36 6074/6846 (89%) Fraction of peaks overlapping TEs : H3K9 1303/3543 (37%) H3K36 512/6846 (7%)

The H3K36 mark makes perfect sense: it's a mark associated with active transcription and the vast majority of peaks overlap genes while only a small percentage overlap repetitive elements. The H3K9 mark makes less: it's a mark associated with silenced regions of the genome but more peaks overlap genes than repetitive elements. If this was my experiment I'd say the H3K36 data seems good but the H3K9 data is questionable.