This is additional material that we will cover if time permits.


Installing and running software on the command line: an example using Bowtie


Assuming you are working on an organism whose genome has already been sequenced, the first step for most types of experiments that involve next-gen sequence is to align your sequence reads to a reference genome.

One popular, fast, and easy to use short-read aligner is Bowtie

Unfortunately, there is not enough time in this module to get into the nitty-gritty of aligning short reads to a genome assembly but if you are familiar with BLAST, Bowtie takes a similar approach with a few tweaks thrown in including: speed-ups to make possible the alignment of millions of query sequences in a reasonable amount of time, optimizing the aligning process to specifically deal with short query sequences, and taking advantage of the quality scores that accompany the sequence reads. For more information, check out the Bowtie paper.

Downloading and installing software

Before we can run Bowtie, we need to install it and before we can install it, we need to download it. To download Bowtie, follow the link on the Bowtie homepage, which takes you to sourceforge.

It looks like there are several options depending on which operating system you are running.


Digression: source code versus pre-compiled binary

There are generally two formats that software will be available in for download: source code and binaries. The source code is the human-readable code that the developer(s) wrote which needs to be compiled into a binary before it can be read by your machine. In many cases developers will have already compiled their source code for a handful of different operating systems and will make those "pre-compiled binaries" available in addition to the source code. This saves you the step of having to compile the source code yourself, a more advanced procedure that we won't be able to cover in this workshop. Usually the source code will be labeled with "source" or "src" while the pre-compiled binaries will have the name of the operating system that they are intended for in their label.


So, it looks like the developers of bowtie have already compiled it for both macOS and linux, making it easy for us. However, we still have one more decision to make. You can see that there are two binaries available for both macOS and linux, one labeled i386 and the other x86_64.


Another digression: chip architecture

When downloading software, you will often find binaries labeled with either i386 or x86_64. Without going too deep into details, these numbers refer to the type of processor the binaries were compiled for. Luckily, there is an easy way to figure out which type of processor is on the machine you are installing to. If you are installing to your own local machine (mac or linux) open a new terminal window. If you are installing to a remote server, make sure you are logged on to the server and in the same window, type:

$ uname -a

Linux poset.cgrl.berkeley.edu 2.6.18-238.12.1.el5 #1 SMP Tue May 31 13:22:04 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux


Look through the output for either i386 or x86_64. You can see that in this case, the server we are all logged into has the x86_64 architecture and thus, the binary with that label is the one we should download.


Ok, now that we know which binary we need, what is the easiest way to download the file onto the server?

$ wget -O bowtie-0.12.9-linux-x86_64.zip http://sourceforge.net/projects/bowtie-bio/files/bowtie/0.12.9/bowtie-0.12.9-linux-x86_64.zip/download
--2013-02-11 06:23:59-- http://sourceforge.net/projects/bowtie-bio/files/bowtie/0.12.9/bowtie-0.12.9-linux-x86_64.zip/download
Resolving sourceforge.net... 216.34.181.60
Connecting to sourceforge.net|216.34.181.60|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: http://downloads.sourceforge.net/project/bowtie-bio/bowtie/0.12.9/bowtie-0.12.9-linux-x86_64.zip?r=&ts=1360592640&use_mirror=voxel [following]
--2013-02-11 06:24:00-- http://downloads.sourceforge.net/project/bowtie-bio/bowtie/0.12.9/bowtie-0.12.9-linux-x86_64.zip?r=&ts=1360592640&use_mirror=voxel
Resolving downloads.sourceforge.net... 216.34.181.59
Connecting to downloads.sourceforge.net|216.34.181.59|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: http://voxel.dl.sourceforge.net/project/bowtie-bio/bowtie/0.12.9/bowtie-0.12.9-linux-x86_64.zip [following]
--2013-02-11 06:24:00-- http://voxel.dl.sourceforge.net/project/bowtie-bio/bowtie/0.12.9/bowtie-0.12.9-linux-x86_64.zip
Resolving voxel.dl.sourceforge.net... 107.6.88.167, 107.6.92.102, 107.6.92.101
Connecting to voxel.dl.sourceforge.net|107.6.88.167|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10432225 (9.9M) [application/zip]
Saving to: `bowtie-0.12.9-linux-x86_64.zip'
100%[======================================>] 10,432,225 3.42M/s in 2.9s
2013-02-11 06:24:03 (3.42 MB/s) - `bowtie-0.12.9-linux-x86_64.zip' saved [10432225/10432225]

wget reports the download progress and a bunch of other stats about the download. Type 'ls' and you should now see a file named 'bowtie-0.12.7-linux-x86_64.zip' in your directory.

Unzipping the downloaded file


We can see that the file we downloaded ends in .zip which tells us that this file (or group of files) has been compressed using a utility called zip. To uncompress it type:

$ unzip bowtie-0.12.9-linux-x86_64.zip
Archive: bowtie-0.12.9-linux-x86_64.zip
creating: bowtie-0.12.9/
creating: bowtie-0.12.9/scripts/
inflating: bowtie-0.12.9/scripts/build_test.sh
inflating: bowtie-0.12.9/scripts/make_a_thaliana_tair.sh
...
...

Let's take a look inside the directory:


$ ls -l bowtie-0.12.9

total 12004
-rw-r--r-- 1 mganesh cgrl 703 Dec 12 2009 AUTHORS
-rw-r--r-- 1 mganesh cgrl 5207 Aug 13 2008 COPYING
-rw-r--r-- 1 mganesh cgrl 69556 Dec 15 19:11 MANUAL
-rw-r--r-- 1 mganesh cgrl 80863 Dec 15 19:11 MANUAL.markdown
-rw-r--r-- 1 mganesh cgrl 30715 Dec 15 19:11 NEWS
-rw-r--r-- 1 mganesh cgrl 6258 Oct 5 2009 TUTORIAL
-rw-r--r-- 1 mganesh cgrl 6 Dec 15 19:11 VERSION
-rwxr-xr-x 1 mganesh cgrl 744331 Dec 16 11:37 bowtie
-rwxr-xr-x 1 mganesh cgrl 327131 Dec 16 11:36 bowtie-build
-rwxr-xr-x 1 mganesh cgrl 2661665 Dec 16 11:36 bowtie-build-debug
-rwxr-xr-x 1 mganesh cgrl 6570169 Dec 16 11:37 bowtie-debug
-rwxr-xr-x 1 mganesh cgrl 238154 Dec 16 11:36 bowtie-inspect
-rwxr-xr-x 1 mganesh cgrl 1520824 Dec 16 11:36 bowtie-inspect-debug
drwxr-xr-x 2 mganesh cgrl 53 Dec 16 11:37 doc
drwxr-xr-x 2 mganesh cgrl 26 Dec 16 11:37 genomes
drwxr-xr-x 2 mganesh cgrl 154 Dec 16 11:37 indexes
drwxr-xr-x 2 mganesh cgrl 4096 Dec 16 11:37 reads
drwxr-xr-x 3 mganesh cgrl 4096 Dec 16 11:37 scripts




Digression: Cheatsheet showing various methods for compressing/uncompressing files and packaging/unpackaging directories


Although the bowtie binary happens to be stored in a zipped directory, it's actually more common to find downloads (data such as genome sequences or other software) that have been packaged and compressed using a utility called tar. Consult the chart below for the command to unpackage such a file.

Goal
Command name
Syntax
Extension
compress file with zip (fast, less efficient compression)
zip
zip output-filename.zip input-filename
.zip
uncompress with zip
unzip
unzip filename.zip
.zip
compress file with gzip (slower, more efficient compression)
gzip
gzip filename
.gz
uncompress with gzip
gunzip
gunzip filename
.gz
compress file with bzip2 (slowest, most efficient compression)
bzip2
bzip2 filename
.bz2
uncompress with bzip2
bunzip2
bunzip2 filename
.bz2
archive a directory of files and compress with gzip
tar
tar -czf output-filename.tar.gz input-directory
.tar.gz
unpack a directory of files that is compressed with gzip
tar
tar -xzf filename.tar.gz
.tar.gz
archive a directory of files and compress with bzip2
tar
tar -cjf output-filename.tar.bz2 input-directory
.tar.bz2
unpack a directory of files that is compressed with bzip2
tar
tar -xjf filename.tar.bz2
.tar.bz2




Okay, we downloaded and uncompressed bowtie. Will it work now?

$ bowtie

bowtie: command not found

Why doesn't it work?

Modifying PATH


The bowtie binary is located inside the directory bowtie-0.12.7/ Is this directory in our PATH? From this morning:


$ env | grep PATH

PATH=/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/opt/dell/srvadmin/bin:/global/home/mganesh/bin
MODULEPATH=...*IGNORE THIS LINE FOR NOW*

So, bowtie is not in your path. That explains it. The bowtie binary is now on the server, but the machine doesn't know where to find it. We need to add the bowtie folder to PATH. There is a special file located in your home directory called .bash_profile that specifies PATH. To add to PATH, we can use the text editor emacs to edit this file.

$ emacs ~/.bash_profile

# .bash_profile
# Get the aliases and functions

if [ -f ~/.bashrc ]; then
. ~/.bashrc
fi

# User specific environment and startup programs
PATH=$PATH:$HOME/bin
export PATH

Scroll down to the line shown in red above and modify it so that it looks like this:

PATH=$PATH:$HOME/bin:$HOME/bowtie-0.12.7

Do you remember from this morning's session how to save and exit from emacs?

After exiting emacs, reload the modified profile:

$ source ~/.bash_profile

Now we should be able to run bowtie.

Running a program on the command line


When you used grep this morning, you had to type three things into the command: the name of the executable itself, what you wanted to search for, and the name of the file you wanted to search.

How do you know how to structure your command for a program you've never used before, like say, Bowtie?

You have a couple of options. You could try to find documentation on the website or inside the package that you downloaded but it's often easiest to just start by simply typing in the name of the executable:

$ bowtie

No index, query, or output file specified!

Usage:

bowtie [options]* <ebwt> {-1 <m1> -2 <m2> | --12 <r> | <s>} [<hit>]



<m1> Comma-separated list of files containing upstream mates (or the

sequences themselves, if -c is set) paired with mates in <m2>

<m2> Comma-separated list of files containing downstream mates (or the

sequences themselves if -c is set) paired with mates in <m1>

<r> Comma-separated list of files containing Crossbow-style reads. Can be

a mixture of paired and unpaired. Specify "-" for stdin.

<s> Comma-separated list of files containing unpaired reads, or the

sequences themselves, if -c is set. Specify "-" for stdin.

<hit> File to write hits to (default: stdout)

...

...

...


Bowtie gets angry because we tried to run it without giving it all the information it needs to run, but it also conveniently tells us how to structure our command and lists a whole slew of options we can use to control how it will go about aligning our reads.