Your instructors are: Chris Ellison and Madhavan Ganesh.
IMPORTANT NOTE
The material below is intended for the in-class instruction. Others trying to learn this material offline can follow along on any Unix machine, however, please be aware the file names and path names will be different on your machine.
Topics covered in this module:
Why familiarity with UNIX is necessary for genomic analyses
Navigating the UNIX environment
Basic UNIX utilities
Text editors
Basic data manipulation with UNIX
Introduction and Requirements
Welcome to the CGRL module for Genomic Analysis at the UNIX command line. This module is designed for scientists hoping to analyze genomic datasets, but who have no experience with the UNIX computing environment or common genomic analysis tools.
The module requires that you use a recent MacOS or Linux based system. Future versions of the module may accomodate Microsoft Windows users.
What are you going to learn?
After this morning's session, you will be able to navigate the UNIX environment, allowing you to create, manipulate, and store large data files efficiently. You will find that in many ways, the UNIX environment is designed for speed and simplicity in using large files in a way that is unavailable in more "user-friendly" environments like Microsoft Windows or Mac OS. We will conclude this morning's session with a set of exercises that demonstrate the power of the UNIX environment to generate a basic description of the Saccharomyces cerevisiae genome.
In this afternoon's session, we will extend this morning's lessons to cover a common utility used in high-throughput sequencing, the Bowtie short-read aligner. After this afternoon, you should be able to generate proper output with the Bowtie aligner, and generate a description similar to the one we'll see in this morning's exercise.
What are you not going to learn?
This module is intended to teach very basic skills for the UNIX environment, allowing students to pursue more instruction with other modules that will guide you through the complications and considerations for proper genomic analysis. Accordingly, we will demonstrate the very basics of using one genomic utility (Bowtie), but we will not venture into the algorithms, the considerations, or the statistics behind a proper analysis. We hope that you are interested enough after finishing this module to pursue those questions with our other instructors.
Schedule for today
We are dividing today's module into two segments, each with time dedicated to practical exercises, and with an hour lunch break in between. Given the limited time allotted to cover today's material, it is important that we stay on schedule are return promptly from our break.
Warning: Learning curve ahead!
If you're unfamiliar with a text-based computing environment (some of us are old enough to have started with one as a kid), things may seem a bit archaic at first. That's okay, because UNIX is more than a bit archaic. It's not your fault. So if things seem opaque or difficult to remember, don't fret, and please feel free to ask us questions. In addition, you may want to queue up a couple of websites for quick reference. For example, this Unix Quick Reference might be of use if you want to search for key words to remind you which command does what.
In the longer term, you may appreciate the following text available to us via the UC Berkeley Library: Linux Pocket Guide
For the lecture portion of this course, I would actually recommend something a bit more old-fashioned. You might find it easiest to just have a pen and paper handy and jot down each command we go over along with a brief explanation of what it does. This will come in very useful during the exercises and you won't have to flip back and forth between your terminal screen and your web browser.
Using the UNIX shell
Your time in the UNIX environment will be spent in the shell, which is lingo for the Terminal window, if you're using Mac OS. The shell allows you to move and copy files, run programs, and more. We will also touch on the basic usage of the the text editor emacs.
Informative Interlude: Some notes on the formatting of the lessons for this course
Periodically, we may stop with an informative interlude outlined with a horizontal line above and below (like the one two lines up!). In this case, we're taking a quick break to discuss this and other aspects of the formatting.
For this and all further examples, a $ represents your shell prompt, and boldface indicates the commands to type at the prompt. Italics will be used for output you should see when you take the described action.
This concludes our first informative interlude.
How do I open my terminal window?
In Mac OS, you can open a new terminal window by opening the Finder, selecting the Applications folder, then the Utilities sub-folder, and then double-clicking on the Terminal application. You may want to drag the Terminal icon to your Application Dock for ready access going forward.
Alternatively, you can just use Spotlight to search for "Terminal" each time, and open the application after it is listed in the Spotlight results list.
In Linux environments, your terminal program will depend on the version of Linux you have installed. If you are using Linux and you don't know how to open a terminal, raise your hand and let us know now.
Accessing your CGRL account
The next step is to login to the CGRL system (poset.cgrl.berkeley.edu) with your user account and password. You can do this with the ssh command.
$ ssh cellison@poset.cgrl.berkeley.edu
Whoa, where am I?
When you first log into a system, you should be placed in your home directory. This is a special directory on the system meant just for you.
You can use the pwd command to see the location of the directory you're in, whether it's your home directory or any other directory you end up navigating.
$ pwd
/global/home/cellison
And you can use the cd command to change to another directory.
$ cd /global
$ pwd /global
To return to our home directory, we can simply change back.
$ cd /global/home/cellison
$ pwd /global/home/cellison
Although there is a shortcut for returning to your home directory from any other directory, which is to simply issue the cd command with no directory argument:
$ cd
$ pwd /global/home/cellison
Another useful shorthand for cd is to use two dots after the command. This will take you one directory "up" in the tree. For example, if we want to go to /global/home from our home directory, we can just issue:
$ cd ..
$ pwd /global/home
And we can go all the way the to the top, or root, of the file system tree if we issue this command again.
$ cd ..
$ cd ..
$ pwd
/
But being that far from home might make us nervous, so with one quick command we can go back to our home directory.
$ cd
$ pwd /global/home/cellison
Whew. That's better.
An aside on directories...
Directories in UNIX are set up the same way as any other computer. Just as you would open up a window into your directories and click to open up folders, here you use cd to go through the directories. You are just typing the command instead of clicking!
And if we want to find out what files are available in each folder, we have the command ls.
$ ls
Of course, we have to have files if we want to see them.
ls has many options. Here are some of the more useful ones to know.
The -l argument shows the detailed list of the files. This is similar to the options you're likely familiar with in MacOS or Windows.
$ ls -l
The -t argument will sort the files by their modification time, so you can see which file is the newest (first) or oldest (last) in the directory list.
$ ls -lt
What does the -r argument do?
$ ls -ltr
The same directory short hand for the directory above us in cd can be used for ls:
We can make new directories with the command mkdir.
$ mkdir myFirstDirectory
And then, of course, we can change to that directory.
$ cd myFirstDirectory
Text Editor Interlude
What if we want to create a new text file? The way to do this that is most similar to what you are probably used to doing in a Mac or Windows operating system is to use a text editor.
The downside to this is that unix text editors tend to have quite unintuitive commands, as you will soon see. I prefer the text editor called emacs and I would now like to create
a new file called myFirstFile.
$ emacs myFirstFile
This command let you enter the emacs text editor and you can immediately begin adding text to the file you created.
I will add this to my text file: Hello, from inside emacs!
Now for the tricky part. To save this text file, you have to hold down the control key and then press x followed by s. If you did this correctly, on the bottom left of your screen, you should see this:
Wrote /global/home/cellison/myFirstFile
And now we want to exit emacs to go back to our command line. Again, hold down the control key but this time press x followed by c
Fantastic! You survived the dreaded unix text editor. You will be relieved to learn that unix has a great set of built-in utilities which make it so that you rarely have to use a text editor. This makes sense given that genome sequences, for example, are essentially text files with millions of characters. You definitely aren't going to be editing those by hand!
In unix, there are many different ways to add text to a file as well as look at the current contents of a file.
The command cat, for example, dumps the entire contents of a file to your screen. Lets dump the contents of the file we just made:
$cat myFirstFile Hello, from inside emacs!
The command echo repeats whatever we tell it to (make sure to use single quotation marks).
$ echo 'Hello, World!' Hello, World!
And we can use it to change the contents of our new file if we redirect the output with the > arrow.
$ echo 'Hello, World!' > myFirstFile
Let's see what text is now contained inside myFirstFile.
$ cat myFirstFile Hello, World!
This is a good time to bring up a few salient points:
1. If you redirect output from the command line into a filename that doesn't exist, that file will automatically be created.
2. If you redirect output from the command line into a filename that already exists, that file will be overwritten (as you saw above).
3. WARNING: Unix is very trusting in the sense that, in many cases, it does whatever you tell it to do without warning you when you are about to overwrite a currently existing file. This applies for the redirection arrow as well as the copy and move commands we will learn next.
"...to reproduce is human."
Often, we will merely want to copy or move existing files.
Or, maybe we just want to move (i.e. rename) the file with the mv command.
$ mv myFirstFile myRenamedFile
But no worries if you change your mind, you can always move it back.
$ mv myRenamedFile myFirstFile
Peeking inside files
For the following examples, we're going to need a bit more text in our file. Since we're genomicists, we may as well use a FASTA file.
> seq1 This is the description of my first sequence.
AGTACGTAGTAGCTGCTGCTACGTGCGCTAGCTAGTACGTCAA
TCGTACGTCGACTGATCGTAGCTACGTCGTACGTAGGTACGTT
CGACGTAGATGCTAGCTGACTCGATGC
> seq2 This is a description of my second sequence.
CGATCGATCGTACGTCGACTGATCGTAGCTACGTCGTACGTAG
GTACGTGATCGTACGTCGACTGATCGTAGCTATCGTAGCTACC
CATCGTCAGTTACTGCATGCTCG
> seq3 and so on...
GATCGTACGTGTAGCTGCTGTAGCTGATCGTACGGTAGCAGCT
CGTCGACTGATCGTAGCTGATCGTACGTCGACTGATCGTAGCT
AGTACGTAGTAGCTGTAGCTGATCGTACGTAGCTGATCGTACG
> seq4
CGATCGATCGTACGTCGACTGATCGTAGCTACGTCGTACGTAG
GTACGTGATCGTACGTCGACTGATCGTAGCTATCGTAGCTACC
GTAGATGCTAGCTGACTCGAT
Let's start by copying this text by highlighting.
Then we'll echo the entire block to a new file. We have to add the first and last quotation marks, the > arrow, and the name of our new file we're echoing the text to, test.FASTA
$ echo '> seq1 This is the description of my first sequence. AGTACGTAGTAGCTGCTGCTACGTGCGCTAGCTAGTACGTCAA TCGTACGTCGACTGATCGTAGCTACGTCGTACGTAGGTACGTT CGACGTAGATGCTAGCTGACTCGATGC >seq2 This is a description of my second sequence. CGATCGATCGTACGTCGACTGATCGTAGCTACGTCGTACGTAG GTACGTGATCGTACGTCGACTGATCGTAGCTATCGTAGCTACC CATCGTCAGTTACTGCATGCTCG >seq3 and so on... GATCGTACGTGTAGCTGCTGTAGCTGATCGTACGGTAGCAGCT CGTCGACTGATCGTAGCTGATCGTACGTCGACTGATCGTAGCT AGTACGTAGTAGCTGTAGCTGATCGTACGTAGCTGATCGTACG >seq4 CGATCGATCGTACGTCGACTGATCGTAGCTACGTCGTACGTAG GTACGTGATCGTACGTCGACTGATCGTAGCTATCGTAGCTACC GTAGATGCTAGCTGACTCGAT ' > test.FASTA
Now we can check to make sure the file looks we want it to with the cat command.
$ cat test.FASTA > seq1 This is the description of my first sequence. AGTACGTAGTAGCTGCTGCTACGTGCGCTAGCTAGTACGTCAA TCGTACGTCGACTGATCGTAGCTACGTCGTACGTAGGTACGTT CGACGTAGATGCTAGCTGACTCGATGC > seq2 This is a description of my second sequence. CGATCGATCGTACGTCGACTGATCGTAGCTACGTCGTACGTAG GTACGTGATCGTACGTCGACTGATCGTAGCTATCGTAGCTACC CATCGTCAGTTACTGCATGCTCG > seq3 and so on... GATCGTACGTGTAGCTGCTGTAGCTGATCGTACGGTAGCAGCT CGTCGACTGATCGTAGCTGATCGTACGTCGACTGATCGTAGCT AGTACGTAGTAGCTGTAGCTGATCGTACGTAGCTGATCGTACG > seq4 CGATCGATCGTACGTCGACTGATCGTAGCTACGTCGTACGTAG GTACGTGATCGTACGTCGACTGATCGTAGCTATCGTAGCTACC GTAGATGCTAGCTGACTCGAT
We've seen how to use the cat command to dump the contents of a file to the screen, but there is a slightly more sophisticated utility, less.
$ less test.FASTA > seq1 This is the description of my first sequence. AGTACGTAGTAGCTGCTGCTACGTGCGCTAGCTAGTACGTCAA TCGTACGTCGACTGATCGTAGCTACGTCGTACGTAGGTACGTT CGACGTAGATGCTAGCTGACTCGATGC > seq2 This is a description of my second sequence. CGATCGATCGTACGTCGACTGATCGTAGCTACGTCGTACGTAG GTACGTGATCGTACGTCGACTGATCGTAGCTATCGTAGCTACC CATCGTCAGTTACTGCATGCTCG > seq3 and so on... GATCGTACGTGTAGCTGCTGTAGCTGATCGTACGGTAGCAGCT CGTCGACTGATCGTAGCTGATCGTACGTCGACTGATCGTAGCT AGTACGTAGTAGCTGTAGCTGATCGTACGTAGCTGATCGTACG > seq4 CGATCGATCGTACGTCGACTGATCGTAGCTACGTCGTACGTAG GTACGTGATCGTACGTCGACTGATCGTAGCTATCGTAGCTACC GTAGATGCTAGCTGACTCGAT test.FASTA (END)
Here are some useful navigational tips for less:
You can use the arrow keys to move up or down a line in the text.
The spacebar will advance an entire page.
You can search for a word by typing a slash (e.g. /) followed by the search word.
To quit, type q.
To see the full help screen, type h.
Informative Interlude: UNIX names tend to be overly clever
As you've seen with the basic commands thus far, the names are generally descriptive abbreviations of the program's function. For example, mkdir is for making a directory, ls is for listing the contents of a directory, etc. However, programmers, especially UNIX programmers, tend to get increasingly clever as things progress. Unaware of the fact that this practice makes things opaque, the typically programmer cries out for attention by making program names self-referentially clever. less is a good example of this. In the olden days, the most basic ways to view a text file could not divide files into individual pages, thus a multipage document would scroll off the screen before the first page could be read. As a solution, a program called more was written, which paused at the bottom of each page and prompted the user to press the spacebar for "more." The program name here is reasonably descriptive, but more had some noticeable feature deficiencies: you could neither advance the text one line at a time nor navigate backward in the document without reloading the whole file. The program written to accommodate these features is less. The cleverness of the name is revealed by the paradoxical adage "less is more ."
By default, the head command prints the top 10 lines of the input file. To print a different number, say 5, lines:
$ head test.FASTA > seq1 This is the description of my first sequence. AGTACGTAGTAGCTGCTGCTACGTGCGCTAGCTAGTACGTCAA TCGTACGTCGACTGATCGTAGCTACGTCGTACGTAGGTACGTT CGACGTAGATGCTAGCTGACTCGATGC > seq2 This is a description of my second sequence. CGATCGATCGTACGTCGACTGATCGTAGCTACGTCGTACGTAG GTACGTGATCGTACGTCGACTGATCGTAGCTATCGTAGCTACC CATCGTCAGTTACTGCATGCTCG > seq3 and so on... GATCGTACGTGTAGCTGCTGTAGCTGATCGTACGGTAGCAGCT
$ head -5 test.FASTA > seq1 This is the description of my first sequence. AGTACGTAGTAGCTGCTGCTACGTGCGCTAGCTAGTACGTCAA TCGTACGTCGACTGATCGTAGCTACGTCGTACGTAGGTACGTT CGACGTAGATGCTAGCTGACTCGATGC > seq2 This is a description of my second sequence.
And if you should want to see the end of the file, you can use the tail command.
$ tailtest.FASTA GTACGTGATCGTACGTCGACTGATCGTAGCTATCGTAGCTACC CATCGTCAGTTACTGCATGCTCG > seq3 and so on... GATCGTACGTGTAGCTGCTGTAGCTGATCGTACGGTAGCAGCT CGTCGACTGATCGTAGCTGATCGTACGTCGACTGATCGTAGCT AGTACGTAGTAGCTGTAGCTGATCGTACGTAGCTGATCGTACG > seq4 CGATCGATCGTACGTCGACTGATCGTAGCTACGTCGTACGTAG GTACGTGATCGTACGTCGACTGATCGTAGCTATCGTAGCTACC GTAGATGCTAGCTGACTCGAT
As I alluded to earlier, the cat command stands for concatenate, and though we were using it to view the contents of a single file, it was originally intended to do just what its name suggests: put multiple files together.
$ catmySecondFile test.FASTA Hello, World! > seq1 This is the description of my first sequence. AGTACGTAGTAGCTGCTGCTACGTGCGCTAGCTAGTACGTCAA TCGTACGTCGACTGATCGTAGCTACGTCGTACGTAGGTACGTT CGACGTAGATGCTAGCTGACTCGATGC > seq2 This is a description of my second sequence. CGATCGATCGTACGTCGACTGATCGTAGCTACGTCGTACGTAG GTACGTGATCGTACGTCGACTGATCGTAGCTATCGTAGCTACC CATCGTCAGTTACTGCATGCTCG > seq3 and so on... GATCGTACGTGTAGCTGCTGTAGCTGATCGTACGGTAGCAGCT CGTCGACTGATCGTAGCTGATCGTACGTCGACTGATCGTAGCT AGTACGTAGTAGCTGTAGCTGATCGTACGTAGCTGATCGTACG > seq4 CGATCGATCGTACGTCGACTGATCGTAGCTACGTCGTACGTAG GTACGTGATCGTACGTCGACTGATCGTAGCTATCGTAGCTACC GTAGATGCTAGCTGACTCGAT
More Utilities
At this point, you may be thinking to yourself, "Hrmmm.... This is all well and good, Mr. Computerpants, but it sure seems like a lot of trouble to go through just to read and write some sequences to a text file. I could have done this at home!" And if so, I say it's hard to blame you. But that's because we're just getting to the good part.
Maybe you were wondering, "Gosh, I wish I could search one of these sequence files for a particular sequence without having to open up a Word document or spreadsheet!" Never fear, grep is here.
grep searches for the a string of text in a file and prints out all lines where it finds the specified text.
$ grep seq test.FASTA > seq1 This is the description of my first sequence. > seq2 This is a description of my second sequence. > seq3 and so on... > seq4
If you wanted to see how many lines contain your query, you can give grep the -c argument.
$ grep -c seq test.FASTA 4
The -v argument inverts the search (i.e. prints lines that do not contain your search string).
Some of the basic functions you would use a complicated speadsheet for can be acheive with the utility sort, which true to its name, sorts the line in a file.
$ sort test.FASTA > seq1 This is the description of my first sequence. > seq2 This is a description of my second sequence. > seq3 and so on... > seq4 AGTACGTAGTAGCTGCTGCTACGTGCGCTAGCTAGTACGTCAA AGTACGTAGTAGCTGTAGCTGATCGTACGTAGCTGATCGTACG CATCGTCAGTTACTGCATGCTCG CGACGTAGATGCTAGCTGACTCGATGC CGATCGATCGTACGTCGACTGATCGTAGCTACGTCGTACGTAG CGATCGATCGTACGTCGACTGATCGTAGCTACGTCGTACGTAG CGTCGACTGATCGTAGCTGATCGTACGTCGACTGATCGTAGCT GATCGTACGTGTAGCTGCTGTAGCTGATCGTACGGTAGCAGCT GTACGTGATCGTACGTCGACTGATCGTAGCTATCGTAGCTACC GTACGTGATCGTACGTCGACTGATCGTAGCTATCGTAGCTACC GTAGATGCTAGCTGACTCGAT TCGTACGTCGACTGATCGTAGCTACGTCGTACGTAGGTACGTT
The next utilities we will cover only with a description, but keep them in mind, because it will help you in the exercises later this morning.
Imagine that you have a spreadsheet worth of data, and that you only want to consider the third column of the spreadsheet. Instead of opening Excel and copying and pasting your column to a new file, you could use the cut utility.
$ cut -f 3 filename
This command will return only the third column from a tab-delimited file.
Wildcard Matching with the asterisk ( * )
You may want to use a utility on multiple files, which is possible using the * wildcard character. The simplest example of this is using ls to list multiple files.
$ ls myFirstFile mySecondFile test.FASTA
$ ls *File myFirstFile mySecondFile
$ ls myF*File myFirstFile
In these examples, we can see that the asterisk will match any set of intervening characters.
Controlling Information Flow with the Pipe
UNIX has several features that have no good analogous function in other operating systems, and one such commonly used feature is called the Unix pipe. Often, you will want to use several utilities serially in one process. The pipe allows you to send the output of one program as the input to the next program automatically.
The pipe character is the | located above the backslash key.
If we wanted to search the a list of files, we could send the output of ls to grep with a pipe.
$ cat myCombinedFile Hello, World! > seq1 This is the description of my first sequence. AGTACGTAGTAGCTGCTGCTACGTGCGCTAGCTAGTACGTCAA TCGTACGTCGACTGATCGTAGCTACGTCGTACGTAGGTACGTT CGACGTAGATGCTAGCTGACTCGATGC > seq2 This is a description of my second sequence. CGATCGATCGTACGTCGACTGATCGTAGCTACGTCGTACGTAG GTACGTGATCGTACGTCGACTGATCGTAGCTATCGTAGCTACC CATCGTCAGTTACTGCATGCTCG > seq3 and so on... GATCGTACGTGTAGCTGCTGTAGCTGATCGTACGGTAGCAGCT CGTCGACTGATCGTAGCTGATCGTACGTCGACTGATCGTAGCT AGTACGTAGTAGCTGTAGCTGATCGTACGTAGCTGATCGTACG > seq4 CGATCGATCGTACGTCGACTGATCGTAGCTACGTCGTACGTAG GTACGTGATCGTACGTCGACTGATCGTAGCTATCGTAGCTACC GTAGATGCTAGCTGACTCGAT Hello, World!
We can see from this example that the redirection arrow tells UNIX to send the output of the command to a file, instead of the screen.
Downloading Files
Instead of using your web browser to download files from a data resource like NCBI, in UNIX, we have efficient command line utilities for retrieving files. There are many tools for this purpose, but we'll use one common one, wget to retrieve files containing the sequences of the yeast genome.
First, let's make a new directory, called fasta_files.
$ mkdir fasta_files
And then we'll change to the new directory.
$ cd fasta_files
Now we can use the wget command to retrieve the files from the UC Santa Cruz sequence archive.
All but two of these files contain the sequence from a chromosome. The README.txt file contains information about the files, and the md5sum.txt contains information that can be used to verify the integrity of the files (but we won't be concerned with this).
You'll notice that all of the other files end with a double file extension, .fa.gz. File extensions tell you what to expect to be in the file, and in this case they're telling you that the file contains fasta sequence (the .fa part), and that the file has been compressed with the gzip utility, which saves disk space. gzip is only of several utilities available for compression. Others you may encounter include tar and bzip, though there are more.
In order to use the files, however, we need to uncompress them with the utility gunzip. We'll do this once here, and the again in this morning's exercises.
There are a lot of things going on behind the scenes of your UNIX shell. For the most part, Mac OS and modern Linux distributions will shield you from these mechanics, but we want to make you aware of two important concepts: the UNIX Path and your environment variables.
If we issue the env command, we will get something similar to the following output.
What we see here are the values assigned to many environment variables. These are variables defined in your shell to make it easier for UNIX to do what you ask. For example, you can see that the variable USER is defined as cellison. UNIX knows what my name is :-) And that is actually required for me to do anything useful as a user. Similarly, you can see HOME is defined as /global/home/cellison. This is what tells UNIX where my home directory is located. If we change this variable to another directory, and then I issue the cd command to take me home, it will take me to the newly specified
directory instead of my original home directory.
Sometimes this list can be very long and difficult to read, but if we know what we're looking for, we can use other utilities to help us.
$ env | grep home HOME=/global/home/cellison MODULESHOME=/usr/Modules
This afternoon we will modify an important environment variable called PATH. The PATH variable tells UNIX a list of directories that it should put on your favorites list. That is, a list of directories that you don't have to specify the whole name of in order to use the files within them. For example, when we issue the command ls, a the program that runs is actually located in a directory called /bin/, so the program that we're actually running is /bin/ls. However, since /bin/ is in our PATH by default, UNIX knows that when we type ls, we really meant /bin/ls.
In fact, if we use the env command, we can see the list of directories that are in our PATH, and they are separated by colons.
Here we see 6 directories in my PATH, separated by colons in the list.
Help, I'm stuck!
Most commands have many useful flags beyond what we've shown you. For information on a particular command, look at the manual pages with man.
bash-3.2$ man ls
LS(1) BSD General Commands Manual LS(1)
NAME
ls -- list directory contents
SYNOPSIS
ls [-ABCFGHLOPRSTUW@abcdefghiklmnopqrstuwx1] [file ...]
DESCRIPTION
For each operand that names a file of a type other than directory, ls
displays its name as well as any requested, associated information. For
each operand that names a file of type directory, ls displays the names
of files contained within that directory, as well as any requested, asso-
ciated information.
If no operands are given, the contents of the current directory are dis-
played. If more than one operand is given, non-directory operands are
displayed first; directory and non-directory operands are sorted sepa-
rately and in lexicographical order.
The following options are available:
-@ Display extended attribute keys and sizes in long (-l) output.
-1 (The numeric digit ``one''.) Force output to be one entry per
line. This is the default when output is not to a terminal.
-A List all entries except for . and ... Always set for the super-
user.
-a Include directory entries whose names begin with a dot (.).
-B Force printing of non-printable characters (as defined by
ctype(3) and current locale settings) in file names as \xxx,
where xxx is the numeric value of the character in octal.
-b As -B, but use C escape codes whenever possible.
-C Force multi-column output; this is the default when output is to
a terminal.
-c Use time when file status was last changed for sorting (-t) or
long printing (-l).
...
The man reader is essentially less, so you already know how to move up and down, and page by page!
Exercises
1) Cerevisiae chromosomes
a) Change into your fasta_files directory, and unzip the remainder of the fasta files using only a one-line command.
b) Combine the sequence files to make a single whole genome file called "cerevisiae_genome.fasta"
d) Count the chromosomes in the whole genome file using commands from the lecture. (HINT: Each of the original FASTA files contains a single chromosome).
e) Look up the command wc and find out what it does. Get size of total genome. (HINT: The size of the genome can be determined by counting the number of bases).
2) Cerevisiae genomic features
a) Get the list of cerevisiae chromosome features from this address:
1. Primary Stanford Gene Database ID (SGDID) (mandatory)
2. Feature type (mandatory)
3. Feature qualifier (optional)
4. Feature name (optional)
5. Standard gene name (optional)
6. Alias (optional, multiples separated by |)
7. Parent feature name (optional)
8. Secondary SGDID (optional, multiples separated by |)
9. Chromosome (optional)
10. Start_coordinate (optional)
11. Stop_coordinate (optional)
12. Strand (optional)
13. Genetic position (optional)
14. Coordinate version (optional)
15. Sequence version (optional)
16. Description (optional)
b) Next, count the total number of ORFs in the features file.
c) Now count the ORFs listed as Verified, and separately count those listed as Dubious. If you do the math and add up these two groups, why don't they make up all the ORF matches from grep?
d) Use the cut utility with grep to count the number of ORFs more accurately.
e) Use a combination of the the commands cut, sort, and uniq generate a list of all the genomic features (column 2) in this file. Don't forget that you can use the man pages to learn how to use new tools.
3) In depth with grep
Is there a way to tell grep to ONLY print the part of the line that matches the pattern you give it?
What does egrep allow you to do?
Use your newfound knowledge of grep from the previous two questions, combined with some of the other unix utilities you learned about earlier, to produce a table that lists every length of GA microsatellite repeat in the yeast genome, along with the number of occurrences of each. You can do this with a single command, piping output between the different unix utilities.
HINT #1: You will need to use an extra argument with the uniq command, one that we haven't covered yet.
HINT #2: Regular expressions provide the flexibility to match patterns rather than a specific string. You can use special characters to construct a regular expression, for example, the plus sign means match one or more times, so the pattern 'H+' will match the H in HAT but also the HHHHHHHH in AHHHHHHHH!!!!!!! The period means match any character so, from the previous example, 'H.+' would match all of HAT as well as HHHHHHHH!!!!!!!. You can also group characters together with parentheses. For example, '(I..)+' would match ISSISSIPP in MISSISSIPPI. Remember to enclose your pattern in 'tick marks' just like with echo.
CAVEAT: Assume that it is safe to ignore the fact that a repeat could possibly span multiple lines in your FASTA file
4) Save your table as a new file. Open the file in emacs and add column headers and/or anything else you can think of to make it pretty. Save the file and exit.
Genomic Analysis at the UNIX Command Line
Your instructors are: Chris Ellison and Madhavan Ganesh.
IMPORTANT NOTE
The material below is intended for the in-class instruction. Others trying to learn this material offline can follow along on any Unix machine, however, please be aware the file names and path names will be different on your machine.Topics covered in this module:
Introduction and Requirements
Welcome to the CGRL module for Genomic Analysis at the UNIX command line. This module is designed for scientists hoping to analyze genomic datasets, but who have no experience with the UNIX computing environment or common genomic analysis tools.
The module requires that you use a recent MacOS or Linux based system. Future versions of the module may accomodate Microsoft Windows users.
What are you going to learn?
After this morning's session, you will be able to navigate the UNIX environment, allowing you to create, manipulate, and store large data files efficiently. You will find that in many ways, the UNIX environment is designed for speed and simplicity in using large files in a way that is unavailable in more "user-friendly" environments like Microsoft Windows or Mac OS. We will conclude this morning's session with a set of exercises that demonstrate the power of the UNIX environment to generate a basic description of the Saccharomyces cerevisiae genome.
In this afternoon's session, we will extend this morning's lessons to cover a common utility used in high-throughput sequencing, the Bowtie short-read aligner. After this afternoon, you should be able to generate proper output with the Bowtie aligner, and generate a description similar to the one we'll see in this morning's exercise.
What are you not going to learn?
This module is intended to teach very basic skills for the UNIX environment, allowing students to pursue more instruction with other modules that will guide you through the complications and considerations for proper genomic analysis. Accordingly, we will demonstrate the very basics of using one genomic utility (Bowtie), but we will not venture into the algorithms, the considerations, or the statistics behind a proper analysis. We hope that you are interested enough after finishing this module to pursue those questions with our other instructors.
Schedule for today
We are dividing today's module into two segments, each with time dedicated to practical exercises, and with an hour lunch break in between. Given the limited time allotted to cover today's material, it is important that we stay on schedule are return promptly from our break.
Warning: Learning curve ahead!
If you're unfamiliar with a text-based computing environment (some of us are old enough to have started with one as a kid), things may seem a bit archaic at first. That's okay, because UNIX is more than a bit archaic. It's not your fault. So if things seem opaque or difficult to remember, don't fret, and please feel free to ask us questions. In addition, you may want to queue up a couple of websites for quick reference. For example, this Unix Quick Reference might be of use if you want to search for key words to remind you which command does what.
In the longer term, you may appreciate the following text available to us via the UC Berkeley Library: Linux Pocket Guide
For the lecture portion of this course, I would actually recommend something a bit more old-fashioned. You might find it easiest to just have a pen and paper handy and jot down each command we go over along with a brief explanation of what it does. This will come in very useful during the exercises and you won't have to flip back and forth between your terminal screen and your web browser.
Using the UNIX shell
Your time in the UNIX environment will be spent in the shell, which is lingo for the Terminal window, if you're using Mac OS. The shell allows you to move and copy files, run programs, and more. We will also touch on the basic usage of the the text editor emacs.
Informative Interlude: Some notes on the formatting of the lessons for this course
Periodically, we may stop with an informative interlude outlined with a horizontal line above and below (like the one two lines up!). In this case, we're taking a quick break to discuss this and other aspects of the formatting.
For this and all further examples, a $ represents your shell prompt, and boldface indicates the commands to type at the prompt. Italics will be used for output you should see when you take the described action.
This concludes our first informative interlude.
How do I open my terminal window?
In Mac OS, you can open a new terminal window by opening the Finder, selecting the Applications folder, then the Utilities sub-folder, and then double-clicking on the Terminal application. You may want to drag the Terminal icon to your Application Dock for ready access going forward.
Alternatively, you can just use Spotlight to search for "Terminal" each time, and open the application after it is listed in the Spotlight results list.
In Linux environments, your terminal program will depend on the version of Linux you have installed. If you are using Linux and you don't know how to open a terminal, raise your hand and let us know now.
Accessing your CGRL account
The next step is to login to the CGRL system (poset.cgrl.berkeley.edu) with your user account and password. You can do this with the ssh command.
$ ssh cellison@poset.cgrl.berkeley.edu
Whoa, where am I?
When you first log into a system, you should be placed in your home directory. This is a special directory on the system meant just for you.
You can use the pwd command to see the location of the directory you're in, whether it's your home directory or any other directory you end up navigating.
$ pwd
/global/home/cellison
And you can use the cd command to change to another directory.
$ cd /global
$ pwd
/global
To return to our home directory, we can simply change back.
$ cd /global/home/cellison
$ pwd
/global/home/cellison
Although there is a shortcut for returning to your home directory from any other directory, which is to simply issue the cd command with no directory argument:
$ cd
$ pwd
/global/home/cellison
Another useful shorthand for cd is to use two dots after the command. This will take you one directory "up" in the tree. For example, if we want to go to /global/home from our home directory, we can just issue:
$ cd ..
$ pwd
/global/home
And we can go all the way the to the top, or root, of the file system tree if we issue this command again.
$ cd ..
$ cd ..
$ pwd
/
But being that far from home might make us nervous, so with one quick command we can go back to our home directory.
$ cd
$ pwd
/global/home/cellison
Whew. That's better.
An aside on directories...
Directories in UNIX are set up the same way as any other computer. Just as you would open up a window into your directories and click to open up folders, here you use cd to go through the directories. You are just typing the command instead of clicking!
And if we want to find out what files are available in each folder, we have the command ls.
$ ls
Of course, we have to have files if we want to see them.
$ ls /global/home
aalmeida aminoda cbendix dhuang fzapata hrichard jtaylor khayden ktakahas mguising nkong sbranco tlee ylin
adarob apfeiffe cellison dkhendrix ganeshm jaspden jtepperm khockett lpu mstrzele pimentel scolmena tlevin yzhang
aihardin bdebebe cschartn epurdom gergana jcraig jyoung kkim mcohn msuarez rbart sruddy tyamaguc zsung
akassen bwheeler dchandra ewhiston hgoodman jstillma kbenton kmuriki mdavis mtaliafe rthomas tkaplan xfeng
ls has many options. Here are some of the more useful ones to know.
The -l argument shows the detailed list of the files. This is similar to the options you're likely familiar with in MacOS or Windows.
$ ls -l
The -t argument will sort the files by their modification time, so you can see which file is the newest (first) or oldest (last) in the directory list.
$ ls -lt
What does the -r argument do?
$ ls -ltr
The same directory short hand for the directory above us in cd can be used for ls:
$ ls ..
aalmeida aminoda cbendix dhuang fzapata hrichard jtaylor khayden ktakahas mguising nkong sbranco tlee ylin
adarob apfeiffe cellison dkhendrix ganeshm jaspden jtepperm khockett lpu mstrzele pimentel scolmena tlevin yzhang
aihardin bdebebe cschartn epurdom gergana jcraig jyoung kkim mcohn msuarez rbart sruddy tyamaguc zsung
akassen bwheeler dchandra ewhiston hgoodman jstillma kbenton kmuriki mdavis mtaliafe rthomas tkaplan xfeng
"To create is divine..."
We can make new directories with the command mkdir.
$ mkdir myFirstDirectory
And then, of course, we can change to that directory.
$ cd myFirstDirectory
Text Editor Interlude
What if we want to create a new text file? The way to do this that is most similar to what you are probably used to doing in a Mac or Windows operating system is to use a text editor.The downside to this is that unix text editors tend to have quite unintuitive commands, as you will soon see. I prefer the text editor called emacs and I would now like to create
a new file called myFirstFile.
$ emacs myFirstFile
This command let you enter the emacs text editor and you can immediately begin adding text to the file you created.
I will add this to my text file: Hello, from inside emacs!
Now for the tricky part. To save this text file, you have to hold down the control key and then press x followed by s. If you did this correctly, on the bottom left of your screen, you should see this:
Wrote /global/home/cellison/myFirstFile
And now we want to exit emacs to go back to our command line. Again, hold down the control key but this time press x followed by c
Fantastic! You survived the dreaded unix text editor. You will be relieved to learn that unix has a great set of built-in utilities which make it so that you rarely have to use a text editor. This makes sense given that genome sequences, for example, are essentially text files with millions of characters. You definitely aren't going to be editing those by hand!
In unix, there are many different ways to add text to a file as well as look at the current contents of a file.
The command cat, for example, dumps the entire contents of a file to your screen. Lets dump the contents of the file we just made:
$cat myFirstFile
Hello, from inside emacs!
The command echo repeats whatever we tell it to (make sure to use single quotation marks).
$ echo 'Hello, World!'
Hello, World!
And we can use it to change the contents of our new file if we redirect the output with the > arrow.
$ echo 'Hello, World!' > myFirstFile
Let's see what text is now contained inside myFirstFile.
$ cat myFirstFile
Hello, World!
This is a good time to bring up a few salient points:
1. If you redirect output from the command line into a filename that doesn't exist, that file will automatically be created.
2. If you redirect output from the command line into a filename that already exists, that file will be overwritten (as you saw above).
3. WARNING: Unix is very trusting in the sense that, in many cases, it does whatever you tell it to do without warning you when you are about to overwrite a currently existing file. This applies for the redirection arrow as well as the copy and move commands we will learn next.
"...to reproduce is human."
Often, we will merely want to copy or move existing files.
To copy, we use the command cp.
$ cp myFirstFile mySecondFile
$ ls -lt
-rw-r--r-- 1 cellison cellison 14 Sep 5 14:22 mySecondFile
-rw-r--r-- 1 cellison cellison 14 Sep 5 14:16 myFirstFile
Or, maybe we just want to move (i.e. rename) the file with the mv command.
$ mv myFirstFile myRenamedFile
But no worries if you change your mind, you can always move it back.
$ mv myRenamedFile myFirstFile
Peeking inside files
For the following examples, we're going to need a bit more text in our file. Since we're genomicists, we may as well use a FASTA file.
Let's start by copying this text by highlighting.
Then we'll echo the entire block to a new file. We have to add the first and last quotation marks, the > arrow, and the name of our new file we're echoing the text to, test.FASTA
$ echo '> seq1 This is the description of my first sequence.
AGTACGTAGTAGCTGCTGCTACGTGCGCTAGCTAGTACGTCAA
TCGTACGTCGACTGATCGTAGCTACGTCGTACGTAGGTACGTT
CGACGTAGATGCTAGCTGACTCGATGC
>seq2 This is a description of my second sequence.
CGATCGATCGTACGTCGACTGATCGTAGCTACGTCGTACGTAG
GTACGTGATCGTACGTCGACTGATCGTAGCTATCGTAGCTACC
CATCGTCAGTTACTGCATGCTCG
>seq3 and so on...
GATCGTACGTGTAGCTGCTGTAGCTGATCGTACGGTAGCAGCT
CGTCGACTGATCGTAGCTGATCGTACGTCGACTGATCGTAGCT
AGTACGTAGTAGCTGTAGCTGATCGTACGTAGCTGATCGTACG
>seq4
CGATCGATCGTACGTCGACTGATCGTAGCTACGTCGTACGTAG
GTACGTGATCGTACGTCGACTGATCGTAGCTATCGTAGCTACC
GTAGATGCTAGCTGACTCGAT
' > test.FASTA
Now we can check to make sure the file looks we want it to with the cat command.
$ cat test.FASTA
> seq1 This is the description of my first sequence.
AGTACGTAGTAGCTGCTGCTACGTGCGCTAGCTAGTACGTCAA
TCGTACGTCGACTGATCGTAGCTACGTCGTACGTAGGTACGTT
CGACGTAGATGCTAGCTGACTCGATGC
> seq2 This is a description of my second sequence.
CGATCGATCGTACGTCGACTGATCGTAGCTACGTCGTACGTAG
GTACGTGATCGTACGTCGACTGATCGTAGCTATCGTAGCTACC
CATCGTCAGTTACTGCATGCTCG
> seq3 and so on...
GATCGTACGTGTAGCTGCTGTAGCTGATCGTACGGTAGCAGCT
CGTCGACTGATCGTAGCTGATCGTACGTCGACTGATCGTAGCT
AGTACGTAGTAGCTGTAGCTGATCGTACGTAGCTGATCGTACG
> seq4
CGATCGATCGTACGTCGACTGATCGTAGCTACGTCGTACGTAG
GTACGTGATCGTACGTCGACTGATCGTAGCTATCGTAGCTACC
GTAGATGCTAGCTGACTCGAT
We've seen how to use the cat command to dump the contents of a file to the screen, but there is a slightly more sophisticated utility, less.
$ less test.FASTA
> seq1 This is the description of my first sequence.
AGTACGTAGTAGCTGCTGCTACGTGCGCTAGCTAGTACGTCAA
TCGTACGTCGACTGATCGTAGCTACGTCGTACGTAGGTACGTT
CGACGTAGATGCTAGCTGACTCGATGC
> seq2 This is a description of my second sequence.
CGATCGATCGTACGTCGACTGATCGTAGCTACGTCGTACGTAG
GTACGTGATCGTACGTCGACTGATCGTAGCTATCGTAGCTACC
CATCGTCAGTTACTGCATGCTCG
> seq3 and so on...
GATCGTACGTGTAGCTGCTGTAGCTGATCGTACGGTAGCAGCT
CGTCGACTGATCGTAGCTGATCGTACGTCGACTGATCGTAGCT
AGTACGTAGTAGCTGTAGCTGATCGTACGTAGCTGATCGTACG
> seq4
CGATCGATCGTACGTCGACTGATCGTAGCTACGTCGTACGTAG
GTACGTGATCGTACGTCGACTGATCGTAGCTATCGTAGCTACC
GTAGATGCTAGCTGACTCGAT
test.FASTA (END)
Here are some useful navigational tips for less:
Informative Interlude: UNIX names tend to be overly clever
As you've seen with the basic commands thus far, the names are generally descriptive abbreviations of the program's function. For example, mkdir is for making a directory, ls is for listing the contents of a directory, etc. However, programmers, especially UNIX programmers, tend to get increasingly clever as things progress. Unaware of the fact that this practice makes things opaque, the typically programmer cries out for attention by making program names self-referentially clever. less is a good example of this. In the olden days, the most basic ways to view a text file could not divide files into individual pages, thus a multipage document would scroll off the screen before the first page could be read. As a solution, a program called more was written, which paused at the bottom of each page and prompted the user to press the spacebar for "more." The program name here is reasonably descriptive, but more had some noticeable feature deficiencies: you could neither advance the text one line at a time nor navigate backward in the document without reloading the whole file. The program written to accommodate these features is less. The cleverness of the name is revealed by the paradoxical adage "less is more ."By default, the head command prints the top 10 lines of the input file. To print a different number, say 5, lines:
$ head test.FASTA
> seq1 This is the description of my first sequence.
AGTACGTAGTAGCTGCTGCTACGTGCGCTAGCTAGTACGTCAA
TCGTACGTCGACTGATCGTAGCTACGTCGTACGTAGGTACGTT
CGACGTAGATGCTAGCTGACTCGATGC
> seq2 This is a description of my second sequence.
CGATCGATCGTACGTCGACTGATCGTAGCTACGTCGTACGTAG
GTACGTGATCGTACGTCGACTGATCGTAGCTATCGTAGCTACC
CATCGTCAGTTACTGCATGCTCG
> seq3 and so on...
GATCGTACGTGTAGCTGCTGTAGCTGATCGTACGGTAGCAGCT
$ head -5 test.FASTA
> seq1 This is the description of my first sequence.
AGTACGTAGTAGCTGCTGCTACGTGCGCTAGCTAGTACGTCAA
TCGTACGTCGACTGATCGTAGCTACGTCGTACGTAGGTACGTT
CGACGTAGATGCTAGCTGACTCGATGC
> seq2 This is a description of my second sequence.
And if you should want to see the end of the file, you can use the tail command.
$ tail test.FASTA
GTACGTGATCGTACGTCGACTGATCGTAGCTATCGTAGCTACC
CATCGTCAGTTACTGCATGCTCG
> seq3 and so on...
GATCGTACGTGTAGCTGCTGTAGCTGATCGTACGGTAGCAGCT
CGTCGACTGATCGTAGCTGATCGTACGTCGACTGATCGTAGCT
AGTACGTAGTAGCTGTAGCTGATCGTACGTAGCTGATCGTACG
> seq4
CGATCGATCGTACGTCGACTGATCGTAGCTACGTCGTACGTAG
GTACGTGATCGTACGTCGACTGATCGTAGCTATCGTAGCTACC
GTAGATGCTAGCTGACTCGAT
As I alluded to earlier, the cat command stands for concatenate, and though we were using it to view the contents of a single file, it was originally intended to do just what its name suggests: put multiple files together.
$ cat mySecondFile test.FASTA
Hello, World!
> seq1 This is the description of my first sequence.
AGTACGTAGTAGCTGCTGCTACGTGCGCTAGCTAGTACGTCAA
TCGTACGTCGACTGATCGTAGCTACGTCGTACGTAGGTACGTT
CGACGTAGATGCTAGCTGACTCGATGC
> seq2 This is a description of my second sequence.
CGATCGATCGTACGTCGACTGATCGTAGCTACGTCGTACGTAG
GTACGTGATCGTACGTCGACTGATCGTAGCTATCGTAGCTACC
CATCGTCAGTTACTGCATGCTCG
> seq3 and so on...
GATCGTACGTGTAGCTGCTGTAGCTGATCGTACGGTAGCAGCT
CGTCGACTGATCGTAGCTGATCGTACGTCGACTGATCGTAGCT
AGTACGTAGTAGCTGTAGCTGATCGTACGTAGCTGATCGTACG
> seq4
CGATCGATCGTACGTCGACTGATCGTAGCTACGTCGTACGTAG
GTACGTGATCGTACGTCGACTGATCGTAGCTATCGTAGCTACC
GTAGATGCTAGCTGACTCGAT
More Utilities
At this point, you may be thinking to yourself, "Hrmmm.... This is all well and good, Mr. Computerpants, but it sure seems like a lot of trouble to go through just to read and write some sequences to a text file. I could have done this at home!" And if so, I say it's hard to blame you. But that's because we're just getting to the good part.
Maybe you were wondering, "Gosh, I wish I could search one of these sequence files for a particular sequence without having to open up a Word document or spreadsheet!" Never fear, grep is here.
grep searches for the a string of text in a file and prints out all lines where it finds the specified text.
$ grep seq test.FASTA
> seq1 This is the description of my first sequence.
> seq2 This is a description of my second sequence.
> seq3 and so on...
> seq4
If you wanted to see how many lines contain your query, you can give grep the -c argument.
$ grep -c seq test.FASTA
4
The -v argument inverts the search (i.e. prints lines that do not contain your search string).
$ grep -v seq test.FASTA
AGTACGTAGTAGCTGCTGCTACGTGCGCTAGCTAGTACGTCAA
TCGTACGTCGACTGATCGTAGCTACGTCGTACGTAGGTACGTT
CGACGTAGATGCTAGCTGACTCGATGC
CGATCGATCGTACGTCGACTGATCGTAGCTACGTCGTACGTAG
GTACGTGATCGTACGTCGACTGATCGTAGCTATCGTAGCTACC
CATCGTCAGTTACTGCATGCTCG
GATCGTACGTGTAGCTGCTGTAGCTGATCGTACGGTAGCAGCT
CGTCGACTGATCGTAGCTGATCGTACGTCGACTGATCGTAGCT
AGTACGTAGTAGCTGTAGCTGATCGTACGTAGCTGATCGTACG
CGATCGATCGTACGTCGACTGATCGTAGCTACGTCGTACGTAG
GTACGTGATCGTACGTCGACTGATCGTAGCTATCGTAGCTACC
GTAGATGCTAGCTGACTCGAT
Some of the basic functions you would use a complicated speadsheet for can be acheive with the utility sort, which true to its name, sorts the line in a file.
$ sort test.FASTA
> seq1 This is the description of my first sequence.
> seq2 This is a description of my second sequence.
> seq3 and so on...
> seq4
AGTACGTAGTAGCTGCTGCTACGTGCGCTAGCTAGTACGTCAA
AGTACGTAGTAGCTGTAGCTGATCGTACGTAGCTGATCGTACG
CATCGTCAGTTACTGCATGCTCG
CGACGTAGATGCTAGCTGACTCGATGC
CGATCGATCGTACGTCGACTGATCGTAGCTACGTCGTACGTAG
CGATCGATCGTACGTCGACTGATCGTAGCTACGTCGTACGTAG
CGTCGACTGATCGTAGCTGATCGTACGTCGACTGATCGTAGCT
GATCGTACGTGTAGCTGCTGTAGCTGATCGTACGGTAGCAGCT
GTACGTGATCGTACGTCGACTGATCGTAGCTATCGTAGCTACC
GTACGTGATCGTACGTCGACTGATCGTAGCTATCGTAGCTACC
GTAGATGCTAGCTGACTCGAT
TCGTACGTCGACTGATCGTAGCTACGTCGTACGTAGGTACGTT
The next utilities we will cover only with a description, but keep them in mind, because it will help you in the exercises later this morning.
Imagine that you have a spreadsheet worth of data, and that you only want to consider the third column of the spreadsheet. Instead of opening Excel and copying and pasting your column to a new file, you could use the cut utility.
$ cut -f 3 filename
This command will return only the third column from a tab-delimited file.
Wildcard Matching with the asterisk ( * )
You may want to use a utility on multiple files, which is possible using the * wildcard character. The simplest example of this is using ls to list multiple files.
$ ls
myFirstFile mySecondFile test.FASTA
$ ls *File
myFirstFile mySecondFile
$ ls myF*File
myFirstFile
In these examples, we can see that the asterisk will match any set of intervening characters.
Controlling Information Flow with the Pipe
UNIX has several features that have no good analogous function in other operating systems, and one such commonly used feature is called the Unix pipe. Often, you will want to use several utilities serially in one process. The pipe allows you to send the output of one program as the input to the next program automatically.
The pipe character is the | located above the backslash key.
If we wanted to search the a list of files, we could send the output of ls to grep with a pipe.
$ ls -lt | grep my
-rw-r--r-- 1 cellison cellison 14 Sep 5 14:22 mySecondFile
-rw-r--r-- 1 cellison cellison 14 Sep 5 14:16 myFirstFile
Here, we see the two files in our directory listing that contain the search term "my" returned by grep.
Redirection
Another built-in UNIX tool is called redirection, and we've already seen it once today (I just didn't bother to explain it earlier!).
When we copied and pasted our sequences to test.FASTA, we used redirection with the > arrow.
$ cat myFirstFile test.FASTA mySecondFile > myCombinedFile
$ cat myCombinedFile
Hello, World!
> seq1 This is the description of my first sequence.
AGTACGTAGTAGCTGCTGCTACGTGCGCTAGCTAGTACGTCAA
TCGTACGTCGACTGATCGTAGCTACGTCGTACGTAGGTACGTT
CGACGTAGATGCTAGCTGACTCGATGC
> seq2 This is a description of my second sequence.
CGATCGATCGTACGTCGACTGATCGTAGCTACGTCGTACGTAG
GTACGTGATCGTACGTCGACTGATCGTAGCTATCGTAGCTACC
CATCGTCAGTTACTGCATGCTCG
> seq3 and so on...
GATCGTACGTGTAGCTGCTGTAGCTGATCGTACGGTAGCAGCT
CGTCGACTGATCGTAGCTGATCGTACGTCGACTGATCGTAGCT
AGTACGTAGTAGCTGTAGCTGATCGTACGTAGCTGATCGTACG
> seq4
CGATCGATCGTACGTCGACTGATCGTAGCTACGTCGTACGTAG
GTACGTGATCGTACGTCGACTGATCGTAGCTATCGTAGCTACC
GTAGATGCTAGCTGACTCGAT
Hello, World!
We can see from this example that the redirection arrow tells UNIX to send the output of the command to a file, instead of the screen.
Downloading Files
Instead of using your web browser to download files from a data resource like NCBI, in UNIX, we have efficient command line utilities for retrieving files. There are many tools for this purpose, but we'll use one common one, wget to retrieve files containing the sequences of the yeast genome.
First, let's make a new directory, called fasta_files.
$ mkdir fasta_files
And then we'll change to the new directory.
$ cd fasta_files
Now we can use the wget command to retrieve the files from the UC Santa Cruz sequence archive.
$ wget 'ftp://hgdownload.cse.ucsc.edu/goldenPath/sacCer2/chromosomes/*'
--12:43:53-- ftp://hgdownload.cse.ucsc.edu/goldenPath/sacCer2/chromosomes/*
=> `.listing'
Resolving hgdownload.cse.ucsc.edu... 128.114.119.148
Connecting to hgdownload.cse.ucsc.edu|128.114.119.148|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done. ==> PWD ... done.
==> TYPE I ... done. ==> CWD /goldenPath/sacCer2/chromosomes ... done.
==> PASV ... done. ==> LIST ... done.
[ <=> ] 1,482 --.--K/s
12:43:53 (214.76 KB/s) - `.listing' saved [1482]
Removed `.listing'.
--12:43:53-- ftp://hgdownload.cse.ucsc.edu/goldenPath/sacCer2/chromosomes/2micron.fa.gz
=> `2micron.fa.gz'
...
Now, when we look at what's happened to our directory, we'll see a list of files:
$ ls
2micron.fa.gz chrII.fa.gz chrIX.fa.gz chrVI.fa.gz chrX.fa.gz chrXIII.fa.gz chrXVI.fa.gz
README.txt chrIII.fa.gz chrM.fa.gz chrVII.fa.gz chrXI.fa.gz chrXIV.fa.gz md5sum.txt
chrI.fa.gz chrIV.fa.gz chrV.fa.gz chrVIII.fa.gz chrXII.fa.gz chrXV.fa.gz
All but two of these files contain the sequence from a chromosome. The README.txt file contains information about the files, and the md5sum.txt contains information that can be used to verify the integrity of the files (but we won't be concerned with this).
You'll notice that all of the other files end with a double file extension, .fa.gz. File extensions tell you what to expect to be in the file, and in this case they're telling you that the file contains fasta sequence (the .fa part), and that the file has been compressed with the gzip utility, which saves disk space. gzip is only of several utilities available for compression. Others you may encounter include tar and bzip, though there are more.
In order to use the files, however, we need to uncompress them with the utility gunzip. We'll do this once here, and the again in this morning's exercises.
$ gunzip 2micron.fa.gz
$ ls
2micron.fa chrII.fa.gz chrIX.fa.gz chrVI.fa.gz chrX.fa.gz chrXIII.fa.gz chrXVI.fa.gz
README.txt chrIII.fa.gz chrM.fa.gz chrVII.fa.gz chrXI.fa.gz chrXIV.fa.gz md5sum.txt
chrI.fa.gz chrIV.fa.gz chrV.fa.gz chrVIII.fa.gz chrXII.fa.gz chrXV.fa.gz
Notice that the only change is the name of the file we uncompressed, 2micron.fa.gz has become 2micron.fa
And if we take a peak at the beginning of the file with head, we can see that indeed looks like a fasta file:
$ head 2micron.fa
ACCTGCGGGCCGTCTAAAAATTAAGGAAAAGCAGCAAAGGTGCATTTTTA
AAATATGAAATGAAGATACCGCAGTACCAATTATTTTCGCAGTACAAATA
ATGCGCGGCCGGTGCATTTTTCGAAAGAACGCGAGACAAACAGGACAATT
AAAGTTAGTTTTTCGAGTTAGCGTGTTTGAATACTGCAAGATACAAGATA
AATAGAGTAGTTGAAACTAGATATCAATTGCACACAAGATCGGCGCTAAG
CATGCCACAATTTGGTATATTATGTAAAACACCACCTAAGGTGCTTGTTC
GTCAGTTTGTGGAAAGGTTTGAAAGACCTTCAGGTGAGAAAATAGCATTA
TGTGCTGCTGAACTAACCTATTTATGTTGGATGATTACACATAACGGAAC
AGCAATCAAGAGAGCCACATTCATGAGCTATAATACTATCATAAGCAATT
The UNIX Path and Environment Variables
There are a lot of things going on behind the scenes of your UNIX shell. For the most part, Mac OS and modern Linux distributions will shield you from these mechanics, but we want to make you aware of two important concepts: the UNIX Path and your environment variables.
If we issue the env command, we will get something similar to the following output.
$ env
HOSTNAME=poset.cgrl.berkeley.edu
TERM=xterm-color
SHELL=/bin/bash
HISTSIZE=1000
SSH_CLIENT=169.229.195.27 55475 22
SSH_TTY=/dev/pts/0
USER=cellison
LS_COLORS=no=00:fi=00:di=01;34:ln=01;36:pi=40;33:so=01;35:bd=40;33;01:cd=40;33;01:or=01;05;37;41:mi=01;05;37;41:ex=01;32:*.cmd=01;32:*.exe=01;32:*.com=01;32:*.btm=01;32:*.bat=01;32:*.sh=01;32:*.csh=01;32:*.tar=01;31:*.tgz=01;31:*.arj=01;31:*.taz=01;31:*.lzh=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.gz=01;31:*.bz2=01;31:*.bz=01;31:*.tz=01;31:*.rpm=01;31:*.cpio=01;31:*.jpg=01;35:*.gif=01;35:*.bmp=01;35:*.xbm=01;35:*.xpm=01;35:*.png=01;35:*.tif=01;35:
MAIL=/var/spool/mail/cellison
PATH=/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/opt/dell/srvadmin/bin:/global/home/cellison/bin
INPUTRC=/etc/inputrc
PWD=/global/home/cellison/myFirstDirectory/fasta_files
LANG=C
MODULEPATH=/usr/Modules/modulefiles:/usr/cports/modulefiles/redhat-5.x86_64:/etc/modulefiles:/global/software/centos-5.x86_64/modules/modfiles
LOADEDMODULES=
SHLVL=1
HOME=/global/home/cellison
LOGNAME=cellison
SSH_CONNECTION=169.229.195.27 55475 169.229.192.160 22
MODULESHOME=/usr/Modules
LESSOPEN=|/usr/bin/lesspipe.sh %s
G_BROKEN_FILENAMES=1
module=() { eval `/usr/Modules/bin/modulecmd bash $*`
}
_=/bin/env
OLDPWD=/global/home/aihardin/myFirstDirectory
What we see here are the values assigned to many environment variables. These are variables defined in your shell to make it easier for UNIX to do what you ask. For example, you can see that the variable USER is defined as cellison. UNIX knows what my name is :-) And that is actually required for me to do anything useful as a user. Similarly, you can see HOME is defined as /global/home/cellison. This is what tells UNIX where my home directory is located. If we change this variable to another directory, and then I issue the cd command to take me home, it will take me to the newly specified
directory instead of my original home directory.
Sometimes this list can be very long and difficult to read, but if we know what we're looking for, we can use other utilities to help us.
$ env | grep home
HOME=/global/home/cellison
MODULESHOME=/usr/Modules
This afternoon we will modify an important environment variable called PATH. The PATH variable tells UNIX a list of directories that it should put on your favorites list. That is, a list of directories that you don't have to specify the whole name of in order to use the files within them. For example, when we issue the command ls, a the program that runs is actually located in a directory called /bin/, so the program that we're actually running is /bin/ls. However, since /bin/ is in our PATH by default, UNIX knows that when we type ls, we really meant /bin/ls.
In fact, if we use the env command, we can see the list of directories that are in our PATH, and they are separated by colons.
$ env | grep PATH
PATH=/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/opt/dell/srvadmin/bin:/global/home/cellison/bin
MODULEPATH=/usr/Modules/modulefiles:/usr/cports/modulefiles/redhat-5.x86_64:/etc/modulefiles:/global/software/centos-5.x86_64/modules/modfiles
Here we see 6 directories in my PATH, separated by colons in the list.
Help, I'm stuck!
Most commands have many useful flags beyond what we've shown you. For information on a particular command, look at the manual pages with man.
The man reader is essentially less, so you already know how to move up and down, and page by page!
Exercises
1) Cerevisiae chromosomes
a) Change into your fasta_files directory, and unzip the remainder of the fasta files using only a one-line command.
b) Combine the sequence files to make a single whole genome file called "cerevisiae_genome.fasta"
d) Count the chromosomes in the whole genome file using commands from the lecture. (HINT: Each of the original FASTA files contains a single chromosome).
e) Look up the command wc and find out what it does. Get size of total genome. (HINT: The size of the genome can be determined by counting the number of bases).
2) Cerevisiae genomic features
a) Get the list of cerevisiae chromosome features from this address:
http://gonzo.qb3.berkeley.edu/~matt/SGD_features.tab
Columns within SGD_features.tab:
1. Primary Stanford Gene Database ID (SGDID) (mandatory)
2. Feature type (mandatory)
3. Feature qualifier (optional)
4. Feature name (optional)
5. Standard gene name (optional)
6. Alias (optional, multiples separated by |)
7. Parent feature name (optional)
8. Secondary SGDID (optional, multiples separated by |)
9. Chromosome (optional)
10. Start_coordinate (optional)
11. Stop_coordinate (optional)
12. Strand (optional)
13. Genetic position (optional)
14. Coordinate version (optional)
15. Sequence version (optional)
16. Description (optional)
b) Next, count the total number of ORFs in the features file.
c) Now count the ORFs listed as Verified, and separately count those listed as Dubious. If you do the math and add up these two groups, why don't they make up all the ORF matches from grep?
d) Use the cut utility with grep to count the number of ORFs more accurately.
e) Use a combination of the the commands cut, sort, and uniq generate a list of all the genomic features (column 2) in this file. Don't forget that you can use the man pages to learn how to use new tools.
3) In depth with grep
Is there a way to tell grep to ONLY print the part of the line that matches the pattern you give it?
What does egrep allow you to do?
Use your newfound knowledge of grep from the previous two questions, combined with some of the other unix utilities you learned about earlier, to produce a table that lists every length of GA microsatellite repeat in the yeast genome, along with the number of occurrences of each. You can do this with a single command, piping output between the different unix utilities.
Example:
5000 GA
3000 GAGA
1000 GAGAGA
100 GAGAGAGA
10 GAGAGAGAGA
...
...
HINT #1: You will need to use an extra argument with the uniq command, one that we haven't covered yet.
HINT #2: Regular expressions provide the flexibility to match patterns rather than a specific string. You can use special characters to construct a regular expression, for example, the plus sign means match one or more times, so the pattern 'H+' will match the H in HAT but also the HHHHHHHH in AHHHHHHHH!!!!!!! The period means match any character so, from the previous example, 'H.+' would match all of HAT as well as HHHHHHHH!!!!!!!. You can also group characters together with parentheses. For example, '(I..)+' would match ISSISSIPP in MISSISSIPPI. Remember to enclose your pattern in 'tick marks' just like with echo.
CAVEAT: Assume that it is safe to ignore the fact that a repeat could possibly span multiple lines in your FASTA file
4) Save your table as a new file. Open the file in emacs and add column headers and/or anything else you can think of to make it pretty. Save the file and exit.
Solutions