UNIX_session2_Fall2012

= = =Genomic Analysis at the UNIX Command Line (Afternoon session)=

Your instructors are: Chris Ellison and Madhavan Ganesh.

Installing and running software on the command line: an example using Bowtie
Assuming you are working on an organism whose genome has already been sequenced, the first step for most types of experiments that involve next-gen sequence is to align your sequence reads to a reference genome.

One popular, fast, and easy to use short-read aligner is Bowtie

Unfortunately, there is not enough time in this module to get into the nitty-gritty of aligning short reads to a genome assembly but if you are familiar with BLAST, Bowtie takes a similar approach with a few tweaks thrown in including: speed-ups to make possible the alignment of millions of query sequences in a reasonable amount of time, optimizing the aligning process to specifically deal with short query sequences, and taking advantage of the quality scores that accompany the sequence reads. For more information, check out the Bowtie paper.

Downloading and installing software
Before we can run Bowtie, we need to install it and before we can install it, we need to download it. To download Bowtie, follow the link on the Bowtie homepage, which takes you to sourceforge.

It looks like there are several options depending on which operating system you are running.

Digression: source code versus pre-compiled binary
There are generally two formats that software will be available in for download: source code and binaries. The source code is the human-readable code that the developer(s) wrote which needs to be compiled into a binary before it can be read by your machine. In many cases developers will have already compiled their source code for a handful of different operating systems and will make those "pre-compiled binaries" available in addition to the source code. This saves you the step of having to compile the source code yourself, a more advanced procedure that we won't be able to cover in this workshop. Usually the source code will be labeled with "source" or "src" while the pre-compiled binaries will have the name of the operating system that they are intended for in their label.

So, it looks like the developers of bowtie have already compiled it for both macOS and linux, making it easy for us. However, we still have one more decision to make. You can see that there are two binaries available for both macOS and linux, one labeled i386 and the other x86_64.

Another digression: chip architecture
When downloading software, you will often find binaries labeled with either i386 or x86_64. Without going too deep into details, these numbers refer to the type of processor the binaries were compiled for. Luckily, there is an easy way to figure out which type of processor is on the machine you are installing to. If you are installing to your own local machine (mac or linux) open a new terminal window. If you are installing to a remote server, make sure you are logged on to the server and in the same window, type:

$ **uname -a**

//Linux poset.cgrl.berkeley.edu 2.6.18-238.12.1.el5 #1 SMP Tue May 31 13:22:04 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux//

Look through the output for either i386 or x86_64. You can see that in this case, the server we are all logged into has the x86_64 architecture and thus, the binary with that label is the one we should download.

Ok, now that we know which binary we need, what is the easiest way to download the file onto the server?

Downloading bowtie using wget
$ **wget -O bowtie-0.12.7-linux-x86_64.zip http://sourceforge.net/projects/bowtie-bio/files/bowtie/0.12.7/bowtie-0.12.7-linux-x86_64.zip/download** //Resolving surfnet.dl.sourceforge.net... 130.59.138.21, 2001:620:0:1b::21// //Connecting to surfnet.dl.sourceforge.net|130.59.138.21|:80... connected.// //HTTP request sent, awaiting response... 200 OK// //Length: 10531770 (10M) [application/zip]// //Saving to: `bowtie-0.12.7-linux-x86_64.zip'//

//100%[===============================================================================================================================================================>] 10,531,770 2.76M/s in 4.5s//

//2011-09-06 12:10:29 (2.23 MB/s) - `bowtie-0.12.7-linux-x86_64.zip' saved [10531770/10531770]//

wget reports the download progress and a bunch of other stats about the download. Type 'ls' and you should now see a file named 'bowtie-0.12.7-linux-x86_64.zip' in your directory.

Unzipping the downloaded file
We can see that the file we downloaded ends in .zip which tells us that this file (or group of files) has been compressed using a utility called zip. To uncompress it type:

$ **unzip bowtie-0.12.7-linux-x86_64.zip** //Archive: bowtie-0.12.7-linux-x86_64.zip//  //creating: bowtie-0.12.7///  //creating: bowtie-0.12.7/scripts/// <span style="font-family: Arial,Helvetica,sans-serif;"> //inflating: bowtie-0.12.7/scripts/build_test.sh// <span style="font-family: Arial,Helvetica,sans-serif;"> //inflating: bowtie-0.12.7/scripts/make_a_thaliana_tair.sh// <span style="font-family: Arial,Helvetica,sans-serif;"> //...// <span style="font-family: Arial,Helvetica,sans-serif;"> //...//

<span style="font-family: Arial,Helvetica,sans-serif;">Let's take a look inside the directory:

<span style="font-family: Arial,Helvetica,sans-serif;">$ **ls -l bowtie-0.12.7/** <span style="font-family: Arial,Helvetica,sans-serif;">//total 12420// <span style="font-family: 'Courier New',Courier,monospace;">//-rw-r--r-- 1 cellison cellison 703 2009-12-12 06:29 AUTHORS// <span style="font-family: 'Courier New',Courier,monospace;">//-rwxr-xr-x 1 cellison cellison 773677 2010-09-07 14:20 bowtie// <span style="font-family: 'Courier New',Courier,monospace;">//-rwxr-xr-x 1 cellison cellison 327545 2010-09-07 14:19 bowtie-build// <span style="font-family: 'Courier New',Courier,monospace;">//-rwxr-xr-x 1 cellison cellison 2683055 2010-09-07 14:19 bowtie-build-debug// <span style="font-family: 'Courier New',Courier,monospace;">//-rwxr-xr-x 1 cellison cellison 6930626 2010-09-07 14:19 bowtie-debug// <span style="font-family: 'Courier New',Courier,monospace;">//-rwxr-xr-x 1 cellison cellison 238105 2010-09-07 14:19 bowtie-inspect// <span style="font-family: 'Courier New',Courier,monospace;">//-rwxr-xr-x 1 cellison cellison 1521358 2010-09-07 14:19 bowtie-inspect-debug// <span style="font-family: 'Courier New',Courier,monospace;">//-rw-r--r-- 1 cellison cellison 5207 2008-08-13 09:00 COPYING// <span style="font-family: 'Courier New',Courier,monospace;">//drwxr-xr-x 2 cellison cellison 4096 2010-09-07 14:20 doc// <span style="font-family: 'Courier New',Courier,monospace;">//drwxr-xr-x 2 cellison cellison 4096 2010-09-07 14:20 genomes// <span style="font-family: 'Courier New',Courier,monospace;">//drwxr-xr-x 2 cellison cellison 4096 2011-09-06 10:16 indexes// <span style="font-family: 'Courier New',Courier,monospace;">//-rw-r--r-- 1 cellison cellison 69183 2010-09-07 13:54 MANUAL// <span style="font-family: 'Courier New',Courier,monospace;">//-rw-r--r-- 1 cellison cellison 80488 2010-09-07 13:47 MANUAL.markdown// <span style="font-family: 'Courier New',Courier,monospace;">//-rw-r--r-- 1 cellison cellison 29755 2010-09-07 13:47 NEWS// <span style="font-family: 'Courier New',Courier,monospace;">//drwxr-xr-x 2 cellison cellison 4096 2010-09-07 14:20 reads// <span style="font-family: 'Courier New',Courier,monospace;">//drwxr-xr-x 3 cellison cellison 4096 2010-09-07 14:20 scripts// <span style="font-family: 'Courier New',Courier,monospace;">//-rw-r--r-- 1 cellison cellison 6258 2009-10-05 14:55 TUTORIAL// <span style="font-family: 'Courier New',Courier,monospace;">//-rw-r--r-- 1 cellison cellison 6 2010-09-07 13:47 VERSION//

**Digression: Cheatsheet showing various methods for compressing/uncompressing files and packaging/unpackaging directories**
Although the bowtie binary happens to be stored in a zipped directory, it's actually more common to find downloads (data such as genome sequences or other software) that have been packaged and compressed using a utility called tar. Consult the chart below for the command to unpackage such a file.


 * ~ Goal ||~ Command name ||~ Syntax ||~ Extension ||
 * compress file with zip (fast, less efficient compression) ||< zip || zip output-filename.zip input-filename || .zip ||
 * uncompress with zip ||< unzip || unzip filename.zip || .zip ||
 * compress file with gzip (slower, more efficient compression) ||< gzip || gzip filename || .gz ||
 * uncompress with gzip ||< gunzip || gunzip filename || .gz ||
 * compress file with bzip2 (slowest, most efficient compression) ||< bzip2 || bzip2 filename || .bz2 ||
 * uncompress with bzip2 ||< bunzip2 || bunzip2 filename || .bz2 ||
 * archive a directory of files and compress with gzip ||< tar || tar -czf output-filename.tar.gz input-directory || .tar.gz ||
 * unpack a directory of files that is compressed with gzip ||< tar || tar -xzf filename.tar.gz || .tar.gz ||
 * archive a directory of files and compress with bzip2 ||< tar || tar -cjf output-filename.tar.bz2 input-directory || .tar.bz2 ||
 * unpack a directory of files that is compressed with bzip2 ||< tar || tar -xjf filename.tar.bz2 || .tar.bz2 ||

Okay, we downloaded and uncompressed bowtie. Will it work now?

<span style="font-family: Arial,Helvetica,sans-serif;">$ **bowtie** <span style="font-family: Arial,Helvetica,sans-serif;">//bowtie: command not found//

Why doesn't it work?

Modifying PATH
The bowtie binary is located inside the directory bowtie-0.12.7/ Is this directory in our PATH? From this morning:

<span style="font-family: Arial,Helvetica,sans-serif;">$ **env | grep PATH** //PATH=/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/opt/dell/srvadmin/bin:/global/home/cellison/bin// //MODULEPATH=...*//IGNORE THIS LINE FOR NOW*

So, bowtie is not in your path. That explains it. The bowtie binary is now on the server, but the machine doesn't know where to find it. We need to add the bowtie folder to PATH. There is a special file located in your home directory called .bash_profile that specifies PATH. To add to PATH, we can use the text editor emacs to edit this file.

<span style="font-family: Arial,Helvetica,sans-serif;">$ **emacs** **~/.bash_profile**

//# .bash_profile// //# Get the aliases and functions//

//if [ -f ~/.bashrc ]; then//

//. ~/.bashrc//

//fi// //# User specific environment and startup programs// //PATH=$PATH:$HOME/bin// //export PATH//

Scroll down to the line shown in red above and modify it so that it looks like this: //PATH=$PATH:$HOME/bin:$HOME///bowtie-0.12.7

Do you remember from this morning's session how to save and exit from emacs?

After exiting emacs, reload the modified profile:

<span style="font-family: Arial,Helvetica,sans-serif;">$ **source ~/.bash_profile**

Now we should be able to run bowtie.

Running a program on the command line
When you used grep this morning, you had to type three things into the command: the name of the executable itself, what you wanted to search for, and the name of the file you wanted to search.

How do you know how to structure your command for a program you've never used before, like say, Bowtie?

You have a couple of options. You could try to find documentation on the website or inside the package that you downloaded but it's often easiest to just start by simply typing in the name of the executable:

$ **bowtie** //No index, query, or output file specified!// //Usage:// //bowtie [options]* {-1 <m1> -2 <m2> | --12 <r> | } [ ]//

//<m1> Comma-separated list of files containing upstream mates (or the// //sequences themselves, if -c is set) paired with mates in <m2>// //<m2> Comma-separated list of files containing downstream mates (or the// //sequences themselves if -c is set) paired with mates in <m1>// //<r> Comma-separated list of files containing Crossbow-style reads. Can be// //a mixture of paired and unpaired. Specify "-" for stdin.// // Comma-separated list of files containing unpaired reads, or the// //sequences themselves, if -c is set. Specify "-" for stdin.// // File to write hits to (default: stdout)// //...// //...// //...//

Bowtie gets angry because we tried to run it without giving it all the information it needs to run, but it also conveniently tells us how to structure our command and lists a whole slew of options we can use to control how it will go about aligning our reads.

Bowtie cannot read your mind. Neither can any other unix program. At least not yet...(queue evil laughter)
But seriously, most bioinformatics tools are setup such that you have a great degree of control over exactly how they will go about doing what it is they were created to do. This is great but it is important to be aware that the only way such a program is going to do exactly what you want it to is if you specify all those details. However, to make it easy on you, most of those options come preset with default values that seemed reasonable for X, where X can be the human genome, vertebrates, E. coli or whatever random number the developers had in mind when they wrote the code. This can be very dangerous and result in wacky results if you don't take the time to figure out what the different options available to you mean and what values for those options are appropriate for your data.

Okay, so back to running Bowtie. Let's do exactly what the above digression told you not to do, that is, run bowtie with its default settings.

So, from the usage statement it looks like you run bowtie by typing 'bowtie', followed by whichever options you want to specify, followed by ebwt, then your sequence datafiles (specified differently depending upon how they are organized, i.e. mate pairs split into different files, mate pairs in the same file, etc) then the name of the file to print the alignments to.

Errr, wait a second, what is ebwt? The long list of options bowtie provides doesn't really say anything about what ebwt means. Hmmm, looks like we've exhausted the utility of bowtie's default help. Let's check the website. Looks like there is a link on the homepage to another page titled "Getting started". If you look for ebwt on this page, you'll find it under the sections about making an index for the genome you are mapping against. Aha! ebwt must the name of the genome index and it looks like we need to build an index for our genome of interest before we can use bowtie.

Indexing the reference genome using bowtie-build
The "Getting started" webpage mentions using the command bowtie-build to make the index. Let's type this in without any arguments:

$ **bowtie-build** //No input sequence or sequence file specified!// //Usage: bowtie-build [options]* <reference_in> <ebwt_outfile_base>// //reference_in comma-separated list of files with ref sequences// //ebwt_outfile_base write Ebwt data to files with this dir/basename//

Okay this makes much more sense. To build the genome index we type the command 'bowtie-build' followed by a list of optional arguments, the filename of the reference genome, and the basename of the files to write the index to. Going back to the "Getting started" webpage, we can see that this command will create four files, all of which start with the basename listed in the bowtie-build command.

We have Illumina data from //S. cerevisiae,// which can be found on the server here: **/global/courses/fall2011/unixworkshop/S288C_mutant.fastq** So, obviously we need to make the index from the //S. cerevisiae// reference genome, found on the server here:
 * /global/courses/fall2011/unixworkshop/S288C_reference_sequence_R64-1-1_20110203.fa**

Putting the command together and deciding to name our index 'S288C', we have: $ **bowtie-build** **/global/courses/fall2011/unixworkshop/S288C_reference_sequence_R64-1-1_20110203.fa** **S288C** //Settings:// //Output files: "S288C.*.ebwt"// //Line rate: 6 (line is 64 bytes)// //Lines per side: 1 (side is 64 bytes)// //Offset rate: 5 (one in 32)// //...// //...//

Note: the additional optional arguments for bowtie-build are mainly used to change the balance between memory usage and speed when building the index. See here.

Last thing to do before actually running bowtie is to move the newly created index files to the 'indexes' subdirectory within the bowtie package folder**:** $ **mv *.ebwt bowtie-0.12.7/indexes/**

IMPORTANT NOTE
If you are working through these examples on your own machine, you can download the //S. cerevisiae// reference sequence here: http://downloads.yeastgenome.org/sequence/S288C_reference/genome_releases/S288C_reference_genome_R64-1-1_20110203.tgz

**Running bowtie with the default settings**
As we discovered previously, bowtie needs, at a minimum, the basename of the index, the full path to the file containing your sequence reads, and the name of an output file to write the alignments to, in that order. So let's give it a whirl:

$ **bowtie** **bowtie-0.12.7/indexes/S288C /global/courses/fall2011/unixworkshop/S288C_mutant.fastq** **S288C_mutant.bowtie.defaults** //# reads processed: 19803// //...// //...//

As you can see, bowtie outputs a nice summary of the mapping results which you can use to your advantage during the excercises.

Interpreting the bowtie output
Let's take a look at the output file:

$ **less S288C_mutant.bowtie.defaults** <span style="font-family: 'Courier New',Courier,monospace; font-size: 90%;">Frag_2 - chromosome_V 517225 ACTTCACCCAAGGAAACACAAGATCTTCTTCAATCGCCCCAATTT N[SFM<UP[ONMVN[YALCVJWSFSTSV[9XKDQ?QPVAM@P[YIXSZ[HXP 0 <span style="font-family: 'Courier New',Courier,monospace; font-size: 90%;">Frag_3 - chromosome_V 504829 TCCTGATATCATTACTGATATAAGATTATGTCAATTATTGATACC 1>C<H8.RGSEKOVDK3YLJ=EXC3SPXG?UQP'TMWJH7QA2T[QWVTZZU 0 <span style="font-family: 'Courier New',Courier,monospace; font-size: 90%;">Frag_4 + chromosome_V 503427 ACCCGTTTTGTTCTCTTACGTTCAAATACTCTCTAACCTCTTGAG IFH;NEU?<OD9MYWVCEYYQHXSYKYMLD[WZ[TJKVSEHMXOMKWX[YLO 0 <span style="font-family: 'Courier New',Courier,monospace; font-size: 90%;">Frag_8 - chromosome_V 500404 TATATTCAAATTACACTAGTTATTTTATTGTATTCATATATTAAT >07.A;(KEGF>SHKRCM2C@KPLJPV=GFTGABTBFRPV5N;OMTDLGKLR 0 97:T>A <span style="font-family: 'Courier New',Courier,monospace; font-size: 90%;">... <span style="font-family: 'Courier New',Courier,monospace; font-size: 90%;">...

<span style="font-family: Arial,Helvetica,sans-serif;">The bowtie default output has 8 columns: <span style="font-family: Arial,Helvetica,sans-serif;">1) read ID <span style="font-family: Arial,Helvetica,sans-serif;">2) strand <span style="font-family: Arial,Helvetica,sans-serif;">3) scaffold/chromosome <span style="font-family: Arial,Helvetica,sans-serif;">4) zero-based coordinate of location of left-most character of aligned read <span style="font-family: Arial,Helvetica,sans-serif;">5) read sequence <span style="font-family: Arial,Helvetica,sans-serif;">6) read quality scores <span style="font-family: Arial,Helvetica,sans-serif;">7) optional <span style="font-family: Arial,Helvetica,sans-serif;">8) mismatch (if present)

See here for more details.

**Excercises**
Another lab in the department has been studying a mutant in //S. cerevisiae// S288C and have successfully mapped the single point mutation to a 20kb region of the genome. They PCR amplified this region in overlapping segments, pooled the segments and sequenced them on an Illumina machine, 101 basepair single-end reads. Now that you know how to run bowtie and navigate the unix command line, this lab has asked you to align their sequence data back to the reference genome.
 * Premise:**

1. A different option for running bowtie ignores quality scores and requires reads to align end-to-end with no more than a user-specified number of mismatches. Use the command-line help and/or the bowtie manual (on the bowtie website) to figure out how to run bowtie with this option. What might be a good number of mismatches to allow under the scenario outlined above? Rerun bowtie with these options and save the output in a new file.

2. Using the new bowtie output, use one of the unix utilities you learned about this morning to figure out which chromosome this 20 kb region resides on.

3. Using a different utility, or a combination of several utilities to estimate the coordinates of this region on this chromosome.

4. The group that developed Bowtie recently released a new program called Bowtie2. Go to the Bowtie2 webpage and read a bit about it. How is it different from Bowtie? What are the two major approaches that Bowtie2 can use to align your reads?

Download and install this program in your home directory. Run Bowtie2 using both alignment approaches to map the same reads that you previously mapped with Bowtie. Note that you have to build a new index for Bowtie2. How does the total number of reads mapped compare between the two different programs and the two different alignment methods? What might account for the differences?

Bonus question
What is the mutation (i.e. a change from which base to which base)? Hint #1: you can infer this by using the bowtie output from Excercise #1. Hint #2: The vast majority of mismatches between the illumina reads and our reference will be due to either sequencing errors or, for the reads that overlap the mutated site, due to a true sequence difference between the reads and the reference. Sequencing errors will involve all possible types of mismatches. Some types of errors are more common than others due to technical reasons related to illumina library prep and sequencing. In this exercise, all possible types of errors are listed below in order from most common to least common:

<span style="font-family: 'Courier New',Courier,monospace;">T->G ~ A->C MOST COMMON <span style="font-family: 'Courier New',Courier,monospace;"> G->T ~ C->A <span style="font-family: 'Courier New',Courier,monospace;"> A->T ~ T->A <span style="font-family: 'Courier New',Courier,monospace;"> A->G ~ T->C <span style="font-family: 'Courier New',Courier,monospace;"> T->C ~ A->G <span style="font-family: 'Courier New',Courier,monospace;"> G->A ~ C->T <span style="font-family: 'Courier New',Courier,monospace;"> G->C ~ C->G LEAST COMMON

What is the coordinate of the mutation? Hint: it's easier to figure this out if you already know the mutation, but not necessary.

Solutions