Genomic Analysis at the UNIX Command Line (Afternoon session)
Your instructors are: Chris Ellison and Madhavan Ganesh.
Installing and running software on the command line: an example using Bowtie
Assuming you are working on an organism whose genome has already been sequenced, the first step for most types of experiments that involve next-gen sequence is to align your sequence reads to a reference genome.
One popular, fast, and easy to use short-read aligner is Bowtie
Unfortunately, there is not enough time in this module to get into the nitty-gritty of aligning short reads to a genome assembly but if you are familiar with BLAST, Bowtie takes a similar approach with a few tweaks thrown in including: speed-ups to make possible the alignment of millions of query sequences in a reasonable amount of time, optimizing the aligning process to specifically deal with short query sequences, and taking advantage of the quality scores that accompany the sequence reads. For more information, check out the Bowtie paper.
Downloading and installing software
Before we can run Bowtie, we need to install it and before we can install it, we need to download it. To download Bowtie, follow the link on the Bowtie homepage, which takes you to sourceforge.
It looks like there are several options depending on which operating system you are running.
Digression: source code versus pre-compiled binary
There are generally two formats that software will be available in for download: source code and binaries. The source code is the human-readable code that the developer(s) wrote which needs to be compiled into a binary before it can be read by your machine. In many cases developers will have already compiled their source code for a handful of different operating systems and will make those "pre-compiled binaries" available in addition to the source code. This saves you the step of having to compile the source code yourself, a more advanced procedure that we won't be able to cover in this workshop. Usually the source code will be labeled with "source" or "src" while the pre-compiled binaries will have the name of the operating system that they are intended for in their label.
So, it looks like the developers of bowtie have already compiled it for both macOS and linux, making it easy for us. However, we still have one more decision to make. You can see that there are two binaries available for both macOS and linux, one labeled i386 and the other x86_64.
Another digression: chip architecture
When downloading software, you will often find binaries labeled with either i386 or x86_64. Without going too deep into details, these numbers refer to the type of processor the binaries were compiled for. Luckily, there is an easy way to figure out which type of processor is on the machine you are installing to. If you are installing to your own local machine (mac or linux) open a new terminal window. If you are installing to a remote server, make sure you are logged on to the server and in the same window, type:
$ uname -a
Linux poset.cgrl.berkeley.edu 2.6.18-238.12.1.el5 #1 SMP Tue May 31 13:22:04 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux
Look through the output for either i386 or x86_64. You can see that in this case, the server we are all logged into has the x86_64 architecture and thus, the binary with that label is the one we should download.
Ok, now that we know which binary we need, what is the easiest way to download the file onto the server?
Downloading bowtie using wget
$ wget -O bowtie-0.12.7-linux-x86_64.ziphttp://sourceforge.net/projects/bowtie-bio/files/bowtie/0.12.7/bowtie-0.12.7-linux-x86_64.zip/download Resolving surfnet.dl.sourceforge.net... 130.59.138.21, 2001:620:0:1b::21 Connecting to surfnet.dl.sourceforge.net|130.59.138.21|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 10531770 (10M) [application/zip] Saving to: `bowtie-0.12.7-linux-x86_64.zip'
100%[===============================================================================================================================================================>] 10,531,770 2.76M/s in 4.5s
wget reports the download progress and a bunch of other stats about the download. Type 'ls' and you should now see a file named 'bowtie-0.12.7-linux-x86_64.zip' in your directory.
Unzipping the downloaded file
We can see that the file we downloaded ends in .zip which tells us that this file (or group of files) has been compressed using a utility called zip. To uncompress it type:
Digression: Cheatsheet showing various methods for compressing/uncompressing files and packaging/unpackaging directories
Although the bowtie binary happens to be stored in a zipped directory, it's actually more common to find downloads (data such as genome sequences or other software) that have been packaged and compressed using a utility called tar. Consult the chart below for the command to unpackage such a file.
Goal
Command name
Syntax
Extension
compress file with zip (fast, less efficient compression)
zip
zip output-filename.zip input-filename
.zip
uncompress with zip
unzip
unzip filename.zip
.zip
compress file with gzip (slower, more efficient compression)
gzip
gzip filename
.gz
uncompress with gzip
gunzip
gunzip filename
.gz
compress file with bzip2 (slowest, most efficient compression)
bzip2
bzip2 filename
.bz2
uncompress with bzip2
bunzip2
bunzip2 filename
.bz2
archive a directory of files and compress with gzip
tar
tar -czf output-filename.tar.gz input-directory
.tar.gz
unpack a directory of files that is compressed with gzip
tar
tar -xzf filename.tar.gz
.tar.gz
archive a directory of files and compress with bzip2
tar
tar -cjf output-filename.tar.bz2 input-directory
.tar.bz2
unpack a directory of files that is compressed with bzip2
tar
tar -xjf filename.tar.bz2
.tar.bz2
Okay, we downloaded and uncompressed bowtie. Will it work now?
$ bowtie bowtie: command not found
Why doesn't it work?
Modifying PATH
The bowtie binary is located inside the directory bowtie-0.12.7/ Is this directory in our PATH? From this morning:
$ env | grep PATH PATH=/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/opt/dell/srvadmin/bin:/global/home/cellison/bin MODULEPATH=...*IGNORE THIS LINE FOR NOW*
So, bowtie is not in your path. That explains it. The bowtie binary is now on the server, but the machine doesn't know where to find it. We need to add the bowtie folder to PATH. There is a special file located in your home directory called .bash_profile that specifies PATH. To add to PATH, we can use the text editor emacs to edit this file.
$ emacs~/.bash_profile
# .bash_profile # Get the aliases and functions
if [ -f ~/.bashrc ]; then
. ~/.bashrc
fi # User specific environment and startup programs PATH=$PATH:$HOME/bin export PATH
Scroll down to the line shown in red above and modify it so that it looks like this: PATH=$PATH:$HOME/bin:$HOME/bowtie-0.12.7
Do you remember from this morning's session how to save and exit from emacs?
After exiting emacs, reload the modified profile:
$ source ~/.bash_profile
Now we should be able to run bowtie.
Running a program on the command line
When you used grep this morning, you had to type three things into the command: the name of the executable itself, what you wanted to search for, and the name of the file you wanted to search.
How do you know how to structure your command for a program you've never used before, like say, Bowtie?
You have a couple of options. You could try to find documentation on the website or inside the package that you downloaded but it's often easiest to just start by simply typing in the name of the executable:
<m1> Comma-separated list of files containing upstream mates (or the sequences themselves, if -c is set) paired with mates in <m2> <m2> Comma-separated list of files containing downstream mates (or the sequences themselves if -c is set) paired with mates in <m1> <r> Comma-separated list of files containing Crossbow-style reads. Can be a mixture of paired and unpaired. Specify "-" for stdin. <s> Comma-separated list of files containing unpaired reads, or the sequences themselves, if -c is set. Specify "-" for stdin. <hit> File to write hits to (default: stdout) ... ... ...
Bowtie gets angry because we tried to run it without giving it all the information it needs to run, but it also conveniently tells us how to structure our command and lists a whole slew of options we can use to control how it will go about aligning our reads.
Important Digression:
Bowtie cannot read your mind. Neither can any other unix program. At least not yet...(queue evil laughter)
But seriously, most bioinformatics tools are setup such that you have a great degree of control over exactly how they will go about doing what it is they were created to do. This is great but it is important to be aware that the only way such a program is going to do exactly what you want it to is if you specify all those details.
However, to make it easy on you, most of those options come preset with default values that seemed reasonable for X, where X can be the human genome, vertebrates, E. coli or whatever random number the developers had in mind when they wrote the code. This can be very dangerous and result in wacky results if you don't take the time to figure out what the different options available to you mean and what values for those options are appropriate for your data.
Okay, so back to running Bowtie. Let's do exactly what the above digression told you not to do, that is, run bowtie with its default settings.
So, from the usage statement it looks like you run bowtie by typing 'bowtie', followed by whichever options you want to specify, followed by ebwt, then your sequence datafiles (specified differently depending upon how they are organized, i.e. mate pairs split into different files, mate pairs in the same file, etc) then the name of the file to print the alignments to.
Errr, wait a second, what is ebwt? The long list of options bowtie provides doesn't really say anything about what ebwt means. Hmmm, looks like we've exhausted the utility of bowtie's default help. Let's check the website. Looks like there is a link on the homepage to another page titled "Getting started". If you look for ebwt on this page, you'll find it under the sections about making an index for the genome you are mapping against. Aha! ebwt must the name of the genome index and it looks like we need to build an index for our genome of interest before we can use bowtie.
Indexing the reference genome using bowtie-build
The "Getting started" webpage mentions using the command bowtie-build to make the index. Let's type this in without any arguments:
$ bowtie-build No input sequence or sequence file specified! Usage: bowtie-build [options]* <reference_in> <ebwt_outfile_base> reference_in comma-separated list of files with ref sequences ebwt_outfile_base write Ebwt data to files with this dir/basename
Okay this makes much more sense. To build the genome index we type the command 'bowtie-build' followed by a list of optional arguments, the filename of the reference genome, and the basename of the files to write the index to. Going back to the "Getting started" webpage, we can see that this command will create four files, all of which start with the basename listed in the bowtie-build command.
We have Illumina data from S. cerevisiae,which can be found on the server here: /global/courses/fall2011/unixworkshop/S288C_mutant.fastq
So, obviously we need to make the index from the S. cerevisiae reference genome, found on the server here: /global/courses/fall2011/unixworkshop/S288C_reference_sequence_R64-1-1_20110203.fa
Putting the command together and deciding to name our index 'S288C', we have:
$ bowtie-build/global/courses/fall2011/unixworkshop/S288C_reference_sequence_R64-1-1_20110203.faS288C Settings: Output files: "S288C.*.ebwt" Line rate: 6 (line is 64 bytes) Lines per side: 1 (side is 64 bytes) Offset rate: 5 (one in 32) ... ...
Note: the additional optional arguments for bowtie-build are mainly used to change the balance between memory usage and speed when building the index. See here.
Last thing to do before actually running bowtie is to move the newly created index files to the 'indexes' subdirectory within the bowtie package folder:
$ mv *.ebwt bowtie-0.12.7/indexes/
As we discovered previously, bowtie needs, at a minimum, the basename of the index, the full path to the file containing your sequence reads, and the name of an output file to write the alignments to, in that order. So let's give it a whirl:
Discussion: Mapping short reads to a reference genome
Excercises
Premise:
Another lab in the department has been studying a mutant in S. cerevisiae S288C and have successfully mapped the single point mutation to a 20kb region of the genome. They PCR amplified this region in overlapping segments, pooled the segments and sequenced them on an Illumina machine, 101 basepair single-end reads. Now that you know how to run bowtie and navigate the unix command line, this lab has asked you to align their sequence data back to the reference genome.
1. A different option for running bowtie ignores quality scores and requires reads to align end-to-end with no more than a user-specified number of mismatches. Use the command-line help and/or the bowtie manual (on the bowtie website) to figure out how to run bowtie with this option. What might be a good number of mismatches to allow under the scenario outlined above? Rerun bowtie with these options and save the output in a new file.
2. Using the new bowtie output, use one of the unix utilities you learned about this morning to figure out which chromosome this 20 kb region resides on.
3. Using a different utility, or a combination of several utilities to estimate the coordinates of this region on this chromosome.
4. The group that developed Bowtie recently released a new program called Bowtie2. Go to the Bowtie2 webpage and read a bit about it. How is it different from Bowtie? What are the two major approaches that Bowtie2 can use to align your reads?
Download and install this program in your home directory. Run Bowtie2 using both alignment approaches to map the same reads that you previously mapped with Bowtie. Note that you have to build a new index for Bowtie2. How does the total number of reads mapped compare between the two different programs and the two different alignment methods? What might account for the differences?
Bonus question
What is the mutation (i.e. a change from which base to which base)?
Hint #1: you can infer this by using the bowtie output from Excercise #1.
Hint #2: The vast majority of mismatches between the illumina reads and our reference will be due to either sequencing errors or, for the reads that overlap the mutated site, due to a true sequence difference between the reads and the reference. Sequencing errors will involve all possible types of mismatches. Some types of errors are more common than others due to technical reasons related to illumina library prep and sequencing. In this exercise, all possible types of errors are listed below in order from most common to least common:
T->G ~ A->C MOST COMMON G->T ~ C->A A->T ~ T->A A->G ~ T->C T->C ~ A->G G->A ~ C->T G->C ~ C->G LEAST COMMON
What is the coordinate of the mutation? Hint: it's easier to figure this out if you already know the mutation, but not necessary.
Genomic Analysis at the UNIX Command Line (Afternoon session)
Your instructors are: Chris Ellison and Madhavan Ganesh.
Installing and running software on the command line: an example using Bowtie
Assuming you are working on an organism whose genome has already been sequenced, the first step for most types of experiments that involve next-gen sequence is to align your sequence reads to a reference genome.
One popular, fast, and easy to use short-read aligner is Bowtie
Unfortunately, there is not enough time in this module to get into the nitty-gritty of aligning short reads to a genome assembly but if you are familiar with BLAST, Bowtie takes a similar approach with a few tweaks thrown in including: speed-ups to make possible the alignment of millions of query sequences in a reasonable amount of time, optimizing the aligning process to specifically deal with short query sequences, and taking advantage of the quality scores that accompany the sequence reads. For more information, check out the Bowtie paper.
Downloading and installing software
Before we can run Bowtie, we need to install it and before we can install it, we need to download it. To download Bowtie, follow the link on the Bowtie homepage, which takes you to sourceforge.It looks like there are several options depending on which operating system you are running.
Digression: source code versus pre-compiled binary
There are generally two formats that software will be available in for download: source code and binaries. The source code is the human-readable code that the developer(s) wrote which needs to be compiled into a binary before it can be read by your machine. In many cases developers will have already compiled their source code for a handful of different operating systems and will make those "pre-compiled binaries" available in addition to the source code. This saves you the step of having to compile the source code yourself, a more advanced procedure that we won't be able to cover in this workshop. Usually the source code will be labeled with "source" or "src" while the pre-compiled binaries will have the name of the operating system that they are intended for in their label.So, it looks like the developers of bowtie have already compiled it for both macOS and linux, making it easy for us. However, we still have one more decision to make. You can see that there are two binaries available for both macOS and linux, one labeled i386 and the other x86_64.
Another digression: chip architecture
When downloading software, you will often find binaries labeled with either i386 or x86_64. Without going too deep into details, these numbers refer to the type of processor the binaries were compiled for. Luckily, there is an easy way to figure out which type of processor is on the machine you are installing to. If you are installing to your own local machine (mac or linux) open a new terminal window. If you are installing to a remote server, make sure you are logged on to the server and in the same window, type:$ uname -a
Linux poset.cgrl.berkeley.edu 2.6.18-238.12.1.el5 #1 SMP Tue May 31 13:22:04 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux
Look through the output for either i386 or x86_64. You can see that in this case, the server we are all logged into has the x86_64 architecture and thus, the binary with that label is the one we should download.
Ok, now that we know which binary we need, what is the easiest way to download the file onto the server?
Downloading bowtie using wget
$ wget -O bowtie-0.12.7-linux-x86_64.zip http://sourceforge.net/projects/bowtie-bio/files/bowtie/0.12.7/bowtie-0.12.7-linux-x86_64.zip/download
Resolving surfnet.dl.sourceforge.net... 130.59.138.21, 2001:620:0:1b::21
Connecting to surfnet.dl.sourceforge.net|130.59.138.21|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10531770 (10M) [application/zip]
Saving to: `bowtie-0.12.7-linux-x86_64.zip'
100%[===============================================================================================================================================================>] 10,531,770 2.76M/s in 4.5s
2011-09-06 12:10:29 (2.23 MB/s) - `bowtie-0.12.7-linux-x86_64.zip' saved [10531770/10531770]
wget reports the download progress and a bunch of other stats about the download. Type 'ls' and you should now see a file named 'bowtie-0.12.7-linux-x86_64.zip' in your directory.
Unzipping the downloaded file
We can see that the file we downloaded ends in .zip which tells us that this file (or group of files) has been compressed using a utility called zip. To uncompress it type:
$ unzip bowtie-0.12.7-linux-x86_64.zip
Archive: bowtie-0.12.7-linux-x86_64.zip
creating: bowtie-0.12.7/
creating: bowtie-0.12.7/scripts/
inflating: bowtie-0.12.7/scripts/build_test.sh
inflating: bowtie-0.12.7/scripts/make_a_thaliana_tair.sh
...
...
Let's take a look inside the directory:
$ ls -l bowtie-0.12.7/
total 12420
-rw-r--r-- 1 cellison cellison 703 2009-12-12 06:29 AUTHORS
-rwxr-xr-x 1 cellison cellison 773677 2010-09-07 14:20 bowtie
-rwxr-xr-x 1 cellison cellison 327545 2010-09-07 14:19 bowtie-build
-rwxr-xr-x 1 cellison cellison 2683055 2010-09-07 14:19 bowtie-build-debug
-rwxr-xr-x 1 cellison cellison 6930626 2010-09-07 14:19 bowtie-debug
-rwxr-xr-x 1 cellison cellison 238105 2010-09-07 14:19 bowtie-inspect
-rwxr-xr-x 1 cellison cellison 1521358 2010-09-07 14:19 bowtie-inspect-debug
-rw-r--r-- 1 cellison cellison 5207 2008-08-13 09:00 COPYING
drwxr-xr-x 2 cellison cellison 4096 2010-09-07 14:20 doc
drwxr-xr-x 2 cellison cellison 4096 2010-09-07 14:20 genomes
drwxr-xr-x 2 cellison cellison 4096 2011-09-06 10:16 indexes
-rw-r--r-- 1 cellison cellison 69183 2010-09-07 13:54 MANUAL
-rw-r--r-- 1 cellison cellison 80488 2010-09-07 13:47 MANUAL.markdown
-rw-r--r-- 1 cellison cellison 29755 2010-09-07 13:47 NEWS
drwxr-xr-x 2 cellison cellison 4096 2010-09-07 14:20 reads
drwxr-xr-x 3 cellison cellison 4096 2010-09-07 14:20 scripts
-rw-r--r-- 1 cellison cellison 6258 2009-10-05 14:55 TUTORIAL
-rw-r--r-- 1 cellison cellison 6 2010-09-07 13:47 VERSION
Digression: Cheatsheet showing various methods for compressing/uncompressing files and packaging/unpackaging directories
Although the bowtie binary happens to be stored in a zipped directory, it's actually more common to find downloads (data such as genome sequences or other software) that have been packaged and compressed using a utility called tar. Consult the chart below for the command to unpackage such a file.
Okay, we downloaded and uncompressed bowtie. Will it work now?
$ bowtie
bowtie: command not found
Why doesn't it work?
Modifying PATH
The bowtie binary is located inside the directory bowtie-0.12.7/ Is this directory in our PATH? From this morning:$ env | grep PATH
PATH=/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/opt/dell/srvadmin/bin:/global/home/cellison/bin
MODULEPATH=...*IGNORE THIS LINE FOR NOW*
So, bowtie is not in your path. That explains it. The bowtie binary is now on the server, but the machine doesn't know where to find it. We need to add the bowtie folder to PATH. There is a special file located in your home directory called .bash_profile that specifies PATH. To add to PATH, we can use the text editor emacs to edit this file.
$ emacs ~/.bash_profile
# .bash_profile
# Get the aliases and functions
if [ -f ~/.bashrc ]; then
. ~/.bashrc
fi
# User specific environment and startup programs
PATH=$PATH:$HOME/bin
export PATH
Scroll down to the line shown in red above and modify it so that it looks like this:
PATH=$PATH:$HOME/bin:$HOME/bowtie-0.12.7
Do you remember from this morning's session how to save and exit from emacs?
After exiting emacs, reload the modified profile:
$ source ~/.bash_profile
Now we should be able to run bowtie.
Running a program on the command line
When you used grep this morning, you had to type three things into the command: the name of the executable itself, what you wanted to search for, and the name of the file you wanted to search.How do you know how to structure your command for a program you've never used before, like say, Bowtie?
You have a couple of options. You could try to find documentation on the website or inside the package that you downloaded but it's often easiest to just start by simply typing in the name of the executable:
$ bowtie
No index, query, or output file specified!
Usage:
bowtie [options]* <ebwt> {-1 <m1> -2 <m2> | --12 <r> | <s>} [<hit>]
<m1> Comma-separated list of files containing upstream mates (or the
sequences themselves, if -c is set) paired with mates in <m2>
<m2> Comma-separated list of files containing downstream mates (or the
sequences themselves if -c is set) paired with mates in <m1>
<r> Comma-separated list of files containing Crossbow-style reads. Can be
a mixture of paired and unpaired. Specify "-" for stdin.
<s> Comma-separated list of files containing unpaired reads, or the
sequences themselves, if -c is set. Specify "-" for stdin.
<hit> File to write hits to (default: stdout)
...
...
...
Bowtie gets angry because we tried to run it without giving it all the information it needs to run, but it also conveniently tells us how to structure our command and lists a whole slew of options we can use to control how it will go about aligning our reads.
Important Digression:
Bowtie cannot read your mind. Neither can any other unix program. At least not yet...(queue evil laughter)
But seriously, most bioinformatics tools are setup such that you have a great degree of control over exactly how they will go about doing what it is they were created to do. This is great but it is important to be aware that the only way such a program is going to do exactly what you want it to is if you specify all those details.
However, to make it easy on you, most of those options come preset with default values that seemed reasonable for X, where X can be the human genome, vertebrates, E. coli or whatever random number the developers had in mind when they wrote the code. This can be very dangerous and result in wacky results if you don't take the time to figure out what the different options available to you mean and what values for those options are appropriate for your data.
Okay, so back to running Bowtie. Let's do exactly what the above digression told you not to do, that is, run bowtie with its default settings.
So, from the usage statement it looks like you run bowtie by typing 'bowtie', followed by whichever options you want to specify, followed by ebwt, then your sequence datafiles (specified differently depending upon how they are organized, i.e. mate pairs split into different files, mate pairs in the same file, etc) then the name of the file to print the alignments to.
Errr, wait a second, what is ebwt? The long list of options bowtie provides doesn't really say anything about what ebwt means. Hmmm, looks like we've exhausted the utility of bowtie's default help. Let's check the website. Looks like there is a link on the homepage to another page titled "Getting started". If you look for ebwt on this page, you'll find it under the sections about making an index for the genome you are mapping against. Aha! ebwt must the name of the genome index and it looks like we need to build an index for our genome of interest before we can use bowtie.
Indexing the reference genome using bowtie-build
The "Getting started" webpage mentions using the command bowtie-build to make the index. Let's type this in without any arguments:$ bowtie-build
No input sequence or sequence file specified!
Usage: bowtie-build [options]* <reference_in> <ebwt_outfile_base>
reference_in comma-separated list of files with ref sequences
ebwt_outfile_base write Ebwt data to files with this dir/basename
Okay this makes much more sense. To build the genome index we type the command 'bowtie-build' followed by a list of optional arguments, the filename of the reference genome, and the basename of the files to write the index to. Going back to the "Getting started" webpage, we can see that this command will create four files, all of which start with the basename listed in the bowtie-build command.
We have Illumina data from S. cerevisiae, which can be found on the server here: /global/courses/fall2011/unixworkshop/S288C_mutant.fastq
So, obviously we need to make the index from the S. cerevisiae reference genome, found on the server here:
/global/courses/fall2011/unixworkshop/S288C_reference_sequence_R64-1-1_20110203.fa
Putting the command together and deciding to name our index 'S288C', we have:
$ bowtie-build /global/courses/fall2011/unixworkshop/S288C_reference_sequence_R64-1-1_20110203.fa S288C
Settings:
Output files: "S288C.*.ebwt"
Line rate: 6 (line is 64 bytes)
Lines per side: 1 (side is 64 bytes)
Offset rate: 5 (one in 32)
...
...
Note: the additional optional arguments for bowtie-build are mainly used to change the balance between memory usage and speed when building the index. See here.
Last thing to do before actually running bowtie is to move the newly created index files to the 'indexes' subdirectory within the bowtie package folder:
$ mv *.ebwt bowtie-0.12.7/indexes/
IMPORTANT NOTE
If you are working through these examples on your own machine, you can download the S. cerevisiae reference sequence here:http://downloads.yeastgenome.org/sequence/S288C_reference/genome_releases/S288C_reference_genome_R64-1-1_20110203.tgz
Running bowtie with the default settings
As we discovered previously, bowtie needs, at a minimum, the basename of the index, the full path to the file containing your sequence reads, and the name of an output file to write the alignments to, in that order. So let's give it a whirl:$ bowtie bowtie-0.12.7/indexes/S288C /global/courses/fall2011/unixworkshop/S288C_mutant.fastq S288C_mutant.bowtie.defaults
# reads processed: 19803
...
...
As you can see, bowtie outputs a nice summary of the mapping results which you can use to your advantage during the excercises.
Interpreting the bowtie output
Let's take a look at the output file:
$ less S288C_mutant.bowtie.defaults
Frag_2 - chromosome_V 517225 ACTTCACCCAAGGAAACACAAGATCTTCTTCAATCGCCCCAATTT N[SFM<UP[ONMVN[YALCVJWSFSTSV[9XKDQ?QPVAM@P[YIXSZ[HXP 0
Frag_3 - chromosome_V 504829 TCCTGATATCATTACTGATATAAGATTATGTCAATTATTGATACC 1>C<H8.RGSEKOVDK3YLJ=EXC3SPXG?UQP'TMWJH7QA2T[QWVTZZU 0
Frag_4 + chromosome_V 503427 ACCCGTTTTGTTCTCTTACGTTCAAATACTCTCTAACCTCTTGAG IFH;NEU?<OD9MYWVCEYYQHXSYKYMLD[WZ[TJKVSEHMXOMKWX[YLO 0
Frag_8 - chromosome_V 500404 TATATTCAAATTACACTAGTTATTTTATTGTATTCATATATTAAT >07.A;(KEGF>SHKRCM2C@KPLJPV=GFTGABTBFRPV5N;OMTDLGKLR 0 97:T>A
...
...
The bowtie default output has 8 columns:
1) read ID
2) strand
3) scaffold/chromosome
4) zero-based coordinate of location of left-most character of aligned read
5) read sequence
6) read quality scores
7) optional
8) mismatch (if present)
See here for more details.
Discussion: Mapping short reads to a reference genome
Excercises
Premise:
Another lab in the department has been studying a mutant in S. cerevisiae S288C and have successfully mapped the single point mutation to a 20kb region of the genome. They PCR amplified this region in overlapping segments, pooled the segments and sequenced them on an Illumina machine, 101 basepair single-end reads. Now that you know how to run bowtie and navigate the unix command line, this lab has asked you to align their sequence data back to the reference genome.
1. A different option for running bowtie ignores quality scores and requires reads to align end-to-end with no more than a user-specified number of mismatches. Use the command-line help and/or the bowtie manual (on the bowtie website) to figure out how to run bowtie with this option. What might be a good number of mismatches to allow under the scenario outlined above? Rerun bowtie with these options and save the output in a new file.
2. Using the new bowtie output, use one of the unix utilities you learned about this morning to figure out which chromosome this 20 kb region resides on.
3. Using a different utility, or a combination of several utilities to estimate the coordinates of this region on this chromosome.
4. The group that developed Bowtie recently released a new program called Bowtie2. Go to the Bowtie2 webpage and read a bit about it. How is it different from Bowtie? What are the two major approaches that Bowtie2 can use to align your reads?
Download and install this program in your home directory. Run Bowtie2 using both alignment approaches to map the same reads that you previously mapped with Bowtie. Note that you have to build a new index for Bowtie2. How does the total number of reads mapped compare between the two different programs and the two different alignment methods? What might account for the differences?
Bonus question
What is the mutation (i.e. a change from which base to which base)?Hint #1: you can infer this by using the bowtie output from Excercise #1.
Hint #2: The vast majority of mismatches between the illumina reads and our reference will be due to either sequencing errors or, for the reads that overlap the mutated site, due to a true sequence difference between the reads and the reference. Sequencing errors will involve all possible types of mismatches. Some types of errors are more common than others due to technical reasons related to illumina library prep and sequencing. In this exercise, all possible types of errors are listed below in order from most common to least common:
T->G ~ A->C MOST COMMON
G->T ~ C->A
A->T ~ T->A
A->G ~ T->C
T->C ~ A->G
G->A ~ C->T
G->C ~ C->G LEAST COMMON
What is the coordinate of the mutation? Hint: it's easier to figure this out if you already know the mutation, but not necessary.
Solutions