The material below is intended for the in-class instruction. Others trying to learn this material offline can follow along on any Unix (like) machine, however, please be aware the file names and path names will be different on your machine. The material here is adapted to a large extent from prior workshops at CGRL and presented by other instructors.
Topics covered in this module:
Why familiarity with UNIX is necessary for genomic analyses
Navigating the UNIX environment
Basic UNIX utilities
Text editors
Basic data manipulation with UNIX
Introduction and Requirements
Welcome to the CGRL module for Genomic Analysis at the UNIX command line. This module is designed for scientists hoping to analyze genomic datasets, but who have no experience with the UNIX computing environment or common genomic analysis tools.
The module requires that you use a recent MacOS or Linux based system. Future versions of the module may accomodate Microsoft Windows users.
What are you going to learn?
After this morning's session, you will be able to navigate the UNIX environment, allowing you to create, manipulate, and store large data files efficiently. You will find that in many ways, the UNIX environment is designed for speed and simplicity in using large files in a way that is unavailable in more "user-friendly" environments like Microsoft Windows or Mac OS. We will conclude this morning's session with a set of exercises that demonstrate the power of the UNIX environment to generate a basic description of the Saccharomyces cerevisiae genome. Further, if time permits we will go through the process of downloading and installing a genomic utility (Bowtie) to familiarize users with the execution environment.
What are you not going to learn?
This module is intended to teach very basic skills for the UNIX environment, allowing students to pursue more instruction with other modules that will guide you through the complications and considerations for proper genomic analysis. Accordingly, we will try to demonstrate the very basics of using one genomic utility (Bowtie), but we will not venture into the algorithms, the considerations, or the statistics behind a proper analysis. We hope that you are interested enough after finishing this module to pursue those questions with our other instructors.
Schedule for today
We will go along with a interleaving of description of the topic and a few hands-on exercises that attendees are expected follow along. To derive benefit from this workshop please try to keep up with the topic being discussed. During the hands-on exercises we will try to make sure that every on is in sync.
Warning: Learning curve ahead!
If you're unfamiliar with a text-based computing environment (some of us are old enough to have started with one as a kid), things may seem a bit archaic at first. That's okay, because UNIX is more than a bit archaic. It's not your fault. So if things seem opaque or difficult to remember, don't fret, and please feel free to ask us questions. In addition, you may want to queue up a couple of websites for quick reference. For example, this Unix Quick Reference might be of use if you want to search for key words to remind you which command does what.
In the longer term, you may appreciate the following text available to us via the UC Berkeley Library: Linux Pocket Guide
For the lecture portion of this course, I would actually recommend something a bit more old-fashioned. You might find it easiest to just have a pen and paper handy and jot down each command we go over along with a brief explanation of what it does. This will come in very useful during the exercises and you won't have to flip back and forth between your terminal screen and your web browser.
Using the UNIX shell
Your time in the UNIX environment will be spent in the shell, which is lingo for the Terminal window, if you're using Mac OS. The shell allows you to move and copy files, run programs, and more. We will also touch on the basic usage of the the text editor emacs.
Informative Interlude: Some notes on the formatting of the lessons for this course
Periodically, we may stop with an informative interlude outlined with a horizontal line above and below (like the one two lines up!). In this case, we're taking a quick break to discuss this and other aspects of the formatting.
For this and all further examples, a $ represents your shell prompt, and boldface indicates the commands to type at the prompt. Italics will be used for output you should see when you take the described action.
This concludes our first informative interlude.
How do I open my terminal window?
In Mac OS, you can open a new terminal window by opening the Finder, selecting the Applications folder, then the Utilities sub-folder, and then double-clicking on the Terminal application. You may want to drag the Terminal icon to your Application Dock for ready access going forward.
Alternatively, you can just use Spotlight to search for "Terminal" each time, and open the application after it is listed in the Spotlight results list.
In Linux environments, your terminal program will depend on the version of Linux you have installed. If you are using Linux and you don't know how to open a terminal, raise your hand and let us know now.
Accessing your CGRL account
The next step is to login to the CGRL system (poset.cgrl.berkeley.edu) with your user account and password. You can do this with the ssh command.
@ravi ~$ ssh ralla@poset.cgrl.berkeley.edu Password: Last login: Tue Mar 18 18:10:47 2014 from 128.32.218.204
Navigation in the terminal
ralla@poset ~]$ whoami ralla
[ralla@poset ~]$ id uid=694(ralla) gid=500(cgrl) groups=500(cgrl)
[ralla@poset ~]$ pwd /global/home/ralla
[ralla@poset ~]$ cd
[ralla@poset ~]$ pwd /global/home/ralla
[ralla@poset ~]$ cd ..
[ralla@poset home]$ pwd /global/home
[ralla@poset home]$ cd ..
[ralla@poset global]$ pwd /global
[ralla@poset global]$ cd ..
[ralla@poset /]$ pwd /
[ralla@poset /]$ cd
[ralla@poset ~]$ pwd /global/home/ralla
[ralla@poset ~]$ cd /global/
[ralla@poset global]$ pwd /global
[ralla@poset global]$ cd ~
[ralla@poset ~]$ pwd /global/home/ralla
Paths can be absolute as in /global/ or relative as in ../ or ./
Looking at files and directories
[ralla@poset ~]$ ls fasta_files galaxy-dist galaxy-python
[ralla@poset ~]$ ls /global/ courses galaxy home scratch software
You can use tab key to autocomplete after you type the first few letters of a file or a directory or even a command. Hitting tab twice will list all the possibilities. Beware this might fill up your screen very fast if there are a lot of possibilities.
You can also use the up and down arrow keys to cycled through your previously typed commands. This saves you a lot of re-typing and reduces errors.
Another useful command is the history command. This prints to screen all your previous commands and their line numbers.
[ralla@poset ~]$ history .... .... 1037 ls 1038 cd .. 1039 clear 1040 ls -R fasta_files/ 1041 ls -Sr 1042 ls -S 1043 ls -s 1044 ls .. 1045 pwd 1046 ls /global/ 1047 ls -l /global/ 1048 clear 1049 history
Including the last command which was history itself.
You can re run any command in the history by using the command's line number in history
[ralla@poset ~]$ !1046 ls /global/ courses galaxy home scratch software
clear is a command that clears your working space so you can have a clean looking terminal
Creating files and directories
[ralla@poset ~]$ mkdir unix_playground
[ralla@poset ~]$ ls -l total 4 drwxr-xr-x 3 ralla cgrl 113 Mar 18 18:35 fasta_files drwxr-xr-x 21 ralla cgrl 4096 Feb 26 10:13 galaxy-dist drwxr-xr-x 2 ralla cgrl 19 Jan 16 12:05 galaxy-python drwxr-xr-x 2 ralla cgrl 6 Mar 18 18:54 unix_playground
Notice the date and time the new directory was created.
[ralla@poset unix_playground]$ ls -l total 0 -rw-r--r-- 1 ralla cgrl 0 Mar 18 18:55 My_First_file -rw-r--r-- 1 ralla cgrl 0 Mar 18 18:56 my_first_file
Notice how touch recreated a new file with an updated time stamp. This is the behavior if file already exists, so be careful not to overwrite an existing file.
A few important things to learn from this, file and directory names are case sensitive in linux. Always name your files and directories something short and informative and use extensions (such as .fasta or .fastq to let you know what kind of file it is). Avoid using spaces in your file and directory names. We will see later why.
Modifying files and directories
[ralla@poset unix_playground]$ rm My_First_file
[ralla@poset unix_playground]$ ls -l
total 0
-rw-r--r-- 1 ralla cgrl 0 Mar 18 18:56 my_first_file
[ralla@poset unix_playground]$ mkdir temp1 temp2
[ralla@poset unix_playground]$ ls -l total 0 -rw-r--r-- 1 ralla cgrl 0 Mar 18 18:56 my_first_file drwxr-xr-x 2 ralla cgrl 6 Mar 18 19:05 temp1 drwxr-xr-x 2 ralla cgrl 6 Mar 18 19:05 temp2
[ralla@poset unix_playground]$ rmdir temp1/
[ralla@poset unix_playground]$ ls -l total 0 -rw-r--r-- 1 ralla cgrl 0 Mar 18 18:56 my_first_file drwxr-xr-x 2 ralla cgrl 6 Mar 18 19:05 temp2
[ralla@poset unix_playground]$ cd temp2
[ralla@poset temp2]$ touch test.txt
[ralla@poset temp2]$ cd ..
[ralla@poset unix_playground]$ rmdir temp2/ rmdir: temp2/: Directory not empty
[ralla@poset unix_playground]$ rm -r temp2/
[ralla@poset unix_playground]$ ls -l total 0 -rw-r--r-- 1 ralla cgrl 0 Mar 18 18:56 my_first_file
[ralla@poset unix_playground]$ ls -l total 0 drwxr-xr-x 2 ralla cgrl 30 Mar 18 19:08 move_dir -rw-r--r-- 1 ralla cgrl 0 Mar 18 19:09 my_first_copy.txt
Be careful with mv and cp they will overwrite existing copies of files without warning. It is a good idea to use rm,mv and cp with -i option to alert before making changes.
Adding text to a file, the dreaded terminal text editor.
Copy the following and paste into the txt file and then ctrl+o to write and ctrl+x to exit
>seq1 This is the description of my first sequence.
AGTACGTAGTAGCTGCTGCTACGTGCGCTAGCTAGTACGTCAA
TCGTACGTCGACTGATCGTAGCTACGTCGTACGTAGGTACGTT
CGACGTAGATGCTAGCTGACTCGATGC
>seq2 This is a description of my second sequence.
CGATCGATCGTACGTCGACTGATCGTAGCTACGTCGTACGTAG
GTACGTGATCGTACGTCGACTGATCGTAGCTATCGTAGCTACC
CATCGTCAGTTACTGCATGCTCG
>seq3 and so on...
GATCGTACGTGTAGCTGCTGTAGCTGATCGTACGGTAGCAGCT
CGTCGACTGATCGTAGCTGATCGTACGTCGACTGATCGTAGCT
AGTACGTAGTAGCTGTAGCTGATCGTACGTAGCTGATCGTACG
>seq4
CGATCGATCGTACGTCGACTGATCGTAGCTACGTCGTACGTAG
GTACGTGATCGTACGTCGACTGATCGTAGCTATCGTAGCTACC
GTAGATGCTAGCTGACTCGAT
Peeking inside text based files
[ralla@poset unix_playground]$ cat my_first_fasta.fasta >seq1 This is the description of my first sequence. AGTACGTAGTAGCTGCTGCTACGTGCGCTAGCTAGTACGTCAA TCGTACGTCGACTGATCGTAGCTACGTCGTACGTAGGTACGTT CGACGTAGATGCTAGCTGACTCGATGC >seq2 This is a description of my second sequence. CGATCGATCGTACGTCGACTGATCGTAGCTACGTCGTACGTAG GTACGTGATCGTACGTCGACTGATCGTAGCTATCGTAGCTACC CATCGTCAGTTACTGCATGCTCG >seq3 and so on... GATCGTACGTGTAGCTGCTGTAGCTGATCGTACGGTAGCAGCT CGTCGACTGATCGTAGCTGATCGTACGTCGACTGATCGTAGCT AGTACGTAGTAGCTGTAGCTGATCGTACGTAGCTGATCGTACG >seq4 CGATCGATCGTACGTCGACTGATCGTAGCTACGTCGTACGTAG GTACGTGATCGTACGTCGACTGATCGTAGCTATCGTAGCTACC GTAGATGCTAGCTGACTCGAT
[ralla@poset unix_playground]$ less my_first_fasta.fasta >seq1 This is the description of my first sequence. AGTACGTAGTAGCTGCTGCTACGTGCGCTAGCTAGTACGTCAA TCGTACGTCGACTGATCGTAGCTACGTCGTACGTAGGTACGTT CGACGTAGATGCTAGCTGACTCGATGC >seq2 This is a description of my second sequence. CGATCGATCGTACGTCGACTGATCGTAGCTACGTCGTACGTAG GTACGTGATCGTACGTCGACTGATCGTAGCTATCGTAGCTACC CATCGTCAGTTACTGCATGCTCG >seq3 and so on... GATCGTACGTGTAGCTGCTGTAGCTGATCGTACGGTAGCAGCT CGTCGACTGATCGTAGCTGATCGTACGTCGACTGATCGTAGCT AGTACGTAGTAGCTGTAGCTGATCGTACGTAGCTGATCGTACG >seq4 CGATCGATCGTACGTCGACTGATCGTAGCTACGTCGTACGTAG GTACGTGATCGTACGTCGACTGATCGTAGCTATCGTAGCTACC GTAGATGCTAGCTGACTCGAT my_first_fasta.fasta (END)
Press q to exit out of this
[ralla@poset unix_playground]$ head my_first_fasta.fasta >seq1 This is the description of my first sequence. AGTACGTAGTAGCTGCTGCTACGTGCGCTAGCTAGTACGTCAA TCGTACGTCGACTGATCGTAGCTACGTCGTACGTAGGTACGTT CGACGTAGATGCTAGCTGACTCGATGC >seq2 This is a description of my second sequence. CGATCGATCGTACGTCGACTGATCGTAGCTACGTCGTACGTAG GTACGTGATCGTACGTCGACTGATCGTAGCTATCGTAGCTACC CATCGTCAGTTACTGCATGCTCG >seq3 and so on... GATCGTACGTGTAGCTGCTGTAGCTGATCGTACGGTAGCAGCT
[ralla@poset unix_playground]$ tail my_first_fasta.fasta GTACGTGATCGTACGTCGACTGATCGTAGCTATCGTAGCTACC CATCGTCAGTTACTGCATGCTCG >seq3 and so on... GATCGTACGTGTAGCTGCTGTAGCTGATCGTACGGTAGCAGCT CGTCGACTGATCGTAGCTGATCGTACGTCGACTGATCGTAGCT AGTACGTAGTAGCTGTAGCTGATCGTACGTAGCTGATCGTACG >seq4 CGATCGATCGTACGTCGACTGATCGTAGCTACGTCGTACGTAG GTACGTGATCGTACGTCGACTGATCGTAGCTATCGTAGCTACC GTAGATGCTAGCTGACTCGAT
[ralla@poset unix_playground]$ head -3 my_first_fasta.fasta >seq1 This is the description of my first sequence. AGTACGTAGTAGCTGCTGCTACGTGCGCTAGCTAGTACGTCAA TCGTACGTCGACTGATCGTAGCTACGTCGTACGTAGGTACGTT
If you need help about any command or its options use man <command_name>. This is very useful when you learning command line. We will be using this in the coming sections.
[ralla@poset unix_playground]$ cat second_file.txt my_first_fasta.fasta This is my second file! I am getting good at this. >seq1 This is the description of my first sequence. AGTACGTAGTAGCTGCTGCTACGTGCGCTAGCTAGTACGTCAA TCGTACGTCGACTGATCGTAGCTACGTCGTACGTAGGTACGTT CGACGTAGATGCTAGCTGACTCGATGC >seq2 This is a description of my second sequence. CGATCGATCGTACGTCGACTGATCGTAGCTACGTCGTACGTAG GTACGTGATCGTACGTCGACTGATCGTAGCTATCGTAGCTACC CATCGTCAGTTACTGCATGCTCG >seq3 and so on... GATCGTACGTGTAGCTGCTGTAGCTGATCGTACGGTAGCAGCT CGTCGACTGATCGTAGCTGATCGTACGTCGACTGATCGTAGCT AGTACGTAGTAGCTGTAGCTGATCGTACGTAGCTGATCGTACG >seq4 CGATCGATCGTACGTCGACTGATCGTAGCTACGTCGTACGTAG GTACGTGATCGTACGTCGACTGATCGTAGCTATCGTAGCTACC GTAGATGCTAGCTGACTCGAT
100%[=========================================================================================================================>] 4,173,755 4.11M/s in 1.0s
This > sign is used to redirect the output of head into a new file called gene_exp_first_100_lines.diff
This will overwrite whatever is in the recipient file name. To append use >> sign
[ralla@poset unix_playground]$ echo "I am adding a new line" > gene_exp_first_100_lines.diff
[ralla@poset unix_playground]$ head gene_exp_first_100_lines.diff I am adding a new line
[ralla@poset unix_playground]$ head -100 gene_exp.diff > gene_exp_first_100_lines.diff
[ralla@poset unix_playground]$ echo "I am adding a new line" >> gene_exp_first_100_lines.diff
[ralla@poset unix_playground]$ tail -5 gene_exp_first_100_lines.diff XLOC_000096XLOC_000096PLOD1chr1:11994723-12035599MockOGTOK4.9559657.11413.526610.8052650.082050.288638no XLOC_000097XLOC_000097MFN2chr1:12040237-12073572MockOGTOK22.837323.35650.03243250.01837010.97060.998471no XLOC_000098XLOC_000098MIIPchr1:12079298-12092106MockOGTOK10.270610.80530.07321580.06297720.8970.998471no XLOC_000099XLOC_000099TNFRSF8chr1:12123433-12204264MockOGTOK0.1032310.4866612.237040.6853430.02570.224628no I am adding a new line
Let's say we want to get the gene name column (column 3) and then sort the output. How would we do this??
This is where we use pipes, pipes take the output of one command and use it as input for the command after the pipe. Useful for chaining commands which have one output and accept one input
How would you use pipes to give me a list of only unique gene names in this file?
Wildcards
A lot of commands can take multiple inputs, like touch, mkdir, rm, cp, mv, wc etc
So if we are able to pass multiple files (or dir) to these commands, they will work on all of them. We could do this file by file or we could use an expression to match these multiple files. This is where wildcards come in. The following are a few wildcards we will talk about
. is a wildcard that can match zero or more characters of any kind
? is a wildcard that can match 1 occurrence of any character
[ABC] can match for any of the characters occurring inside them, A or B or C
! negates match expressions [!ABC], any character except A or B or C
{A,B,C} expands all the elements within the braces
{1..9} expands a range of elements within the braces
[ralla@poset unix_playground]$ mkdir wildcards
[ralla@poset unix_playground]$ cd wildcards/
[ralla@poset wildcards]$ touch file{1..6}.txt
[ralla@poset wildcards]$ ls file1.txt file2.txt file3.txt file4.txt file5.txt file6.txt
Here we are using the output of find by placing its output in parenthesis with a $ and passing it to wc. This is another way of passing arguments to commands.
[ralla@poset unix_playground]$ grep '>' my_first_fasta.fasta >seq1 This is the description of my first sequence. >seq2 This is a description of my second sequence. >seq3 and so on... >seq4
Do not forget the ' ' around the > or you will overwrite your fasta file
[ralla@poset unix_playground]$ grep -n '>' my_first_fasta.fasta 1:>seq1 This is the description of my first sequence. 5:>seq2 This is a description of my second sequence. 9:>seq3 and so on... 13:>seq4
How would you get the header file in the gene_exp.diff using grep?
You can use egrep which is a more powerful form of grep and you can specify wildcards and regular expression to search for patterns. Regular expressions are a whole other beast and we won't be talking about them today. The internet is your friend in this case.
as an example we could use egrep to find fasta files which end in 'T' for example
a) Unzip the downloaded file (HINT: use gunzip, look at its documentation for help) b) Change into your fasta_files directory, and unzip the remainder of the fasta files using only a one-line command. c) Combine the sequence files to make a single whole genome file called "cerevisiae_genome.fasta" d) Count the chromosomes in the whole genome file using commands from the lecture. (HINT: Each of the original FASTA files contains a single chromosome). e) Get size of total genome. (HINT: The size of the genome can be determined by counting the number of bases).
Exercise 2
a) Get the list of cerevisiae chromosome features from this address:
1. Primary Stanford Gene Database ID (SGDID) (mandatory) 2. Feature type (mandatory) 3. Feature qualifier (optional) 4. Feature name (optional) 5. Standard gene name (optional) 6. Alias (optional, multiples separated by |) 7. Parent feature name (optional) 8. Secondary SGDID (optional, multiples separated by |) 9. Chromosome (optional) 10. Start_coordinate (optional) 11. Stop_coordinate (optional) 12. Strand (optional) 13. Genetic position (optional) 14. Coordinate version (optional) 15. Sequence version (optional) 16. Description (optional)
b) Next, count the total number of ORFs in the features file. c) Now count the ORFs listed as Verified, and separately count those listed as Dubious. If you do the math and add up these two groups, why don't they make up all the ORF matches fromgrep? d) Use the cut utility with grep to count the number of ORFs more accurately. e) Use a combination of the the commands cut, sort, and uniq generate a list of all the genomic features (column 2) in this file. Don't forget that you can use the man pages to learn how to use new tools.
Exercise 3
a) Is there a way to tell grep to ONLY print the part of the line that matches the pattern you give it?
b) Use your new found knowledge of grep and egrep, combined with some of the other unix utilities you learned about earlier, to produce a table that lists every length of GA microsatellite repeat in the yeast genome, along with the number of occurrences of each. You can do this with a single command, piping output between the different unix utilities.
HINT #1: You will need to use an extra argument with the uniq command, one that we haven't covered yet.
HINT #2: Regular expressions provide the flexibility to match patterns rather than a specific string. You can use special characters to construct a regular expression, for example, the plus sign means match one or more times, so the pattern 'H+' will match the H in HAT but also theHHHHHHHH in AHHHHHHHH!!!!!!! The period meansmatch any character so, from the previous example, 'H.+' would match all of HAT as well as HHHHHHHH!!!!!!!. You can also group characters together with parentheses. For example, '(I..)+' would match ISSISSIPP in MISSISSIPPI. Remember to enclose your pattern in 'tick marks' just like with echo.
CAVEAT: Assume that it is safe to ignore the fact that a repeat could possibly span multiple lines in your FASTA file
c) Save your table as a new file. Open the file in emacs and add column headers and/or anything else you can think of to make it pretty. Save the file and exit.
Shell Scripting
We can write scripts that run in the bash shell to automate a lot of our commands. We can use this to run a set of commands repeatedly over a bunch of different files. We could use wild cards but not always. For example to get lines 15-20 of every fasta file in fasta_files directory.
[ralla@poset fasta_files]$ head -20 chrI.fa | tail -5
Great our first bash script runs as expected. But we have hard coded it to print only lines 15-20 for chrI.fa. What if instead we wanted to pass it any fasta file?
$1 is the variable that stores the first argument we passed to the script. We could use $2 for a second variable and $3 for a third variable. Let's try that change your subset.sh to the following
head -${2} $1 | tail -${3}
Now when you run it again we can give it arguments for subsetting any number of lines.
Great how about if we wanted to pass this script all the files in the folder, we wouldn't know how many variables to use inside our script, and we would have to loop through our list of variables. A full list of arguments passed is captured in a special variable $*. Let us loop through all elements of this special variable and print out lines 15-20
Change subset.sh to this
for fasta in $*
do
echo "lines 15-20 of" $fasta
head -20 $fasta | tail -5
done
[ralla@poset fasta_files]$ bash subset.sh chr*
This is an example of for loops in a shell script, we are looping over each element of $* and assigning to a variable fasta and we are accessing that fasta file through $fasta and getting lines 15-20 and printing out. We could pipe the output and save it. There are other loops you can run in bash script as well. For example you could run Let us make it such that we don't have to type bash every time.
Add the following lines to the top of your shell script #!/bin/bash
This tells the terminal to use bash to execute the script. But now we also need to make the script executable
[ralla@poset fasta_files]$ ls -l subset.sh
-rw-r--r-- 1 ralla cgrl 79 Mar 18 21:43 subset.sh
Notice how the permissions are not setup with an x to execute this script. So we will change permissions to allow execution
[ralla@poset fasta_files]$ chmod +x subset.sh
[ralla@poset fasta_files]$ ls -l subset.sh
-rwxr-xr-x 1 ralla cgrl 79 Mar 18 21:43 subset.sh
[ralla@poset fasta_files]$ ./subset.sh chr*
We still need to type ./ because terminal is not searching the current directory for executables, for example when you type ls or cd or mv these are actually present in the PATH.
[ralla@poset fasta_files]$ which ls
/bin/ls
So /bin/ must be in the path because we are able to type ls. To check this we could do
Indeed it is. So to get our subset.sh to run like ls or any other commands that are in the PATH, we need to add our current directory to the PATH, how do we do this?
Genomic Analysis at the UNIX Command Line
Your instructor for today is Ravi Alla.
IMPORTANT NOTES
The material below is intended for the in-class instruction. Others trying to learn this material offline can follow along on any Unix (like) machine, however, please be aware the file names and path names will be different on your machine. The material here is adapted to a large extent from prior workshops at CGRL and presented by other instructors.Topics covered in this module:
Introduction and Requirements
Welcome to the CGRL module for Genomic Analysis at the UNIX command line. This module is designed for scientists hoping to analyze genomic datasets, but who have no experience with the UNIX computing environment or common genomic analysis tools.
The module requires that you use a recent MacOS or Linux based system. Future versions of the module may accomodate Microsoft Windows users.
What are you going to learn?
After this morning's session, you will be able to navigate the UNIX environment, allowing you to create, manipulate, and store large data files efficiently. You will find that in many ways, the UNIX environment is designed for speed and simplicity in using large files in a way that is unavailable in more "user-friendly" environments like Microsoft Windows or Mac OS. We will conclude this morning's session with a set of exercises that demonstrate the power of the UNIX environment to generate a basic description of the Saccharomyces cerevisiae genome. Further, if time permits we will go through the process of downloading and installing a genomic utility (Bowtie) to familiarize users with the execution environment.
What are you not going to learn?
This module is intended to teach very basic skills for the UNIX environment, allowing students to pursue more instruction with other modules that will guide you through the complications and considerations for proper genomic analysis. Accordingly, we will try to demonstrate the very basics of using one genomic utility (Bowtie), but we will not venture into the algorithms, the considerations, or the statistics behind a proper analysis. We hope that you are interested enough after finishing this module to pursue those questions with our other instructors.
Schedule for today
We will go along with a interleaving of description of the topic and a few hands-on exercises that attendees are expected follow along. To derive benefit from this workshop please try to keep up with the topic being discussed. During the hands-on exercises we will try to make sure that every on is in sync.
Warning: Learning curve ahead!
If you're unfamiliar with a text-based computing environment (some of us are old enough to have started with one as a kid), things may seem a bit archaic at first. That's okay, because UNIX is more than a bit archaic. It's not your fault. So if things seem opaque or difficult to remember, don't fret, and please feel free to ask us questions. In addition, you may want to queue up a couple of websites for quick reference. For example, this Unix Quick Reference might be of use if you want to search for key words to remind you which command does what.
In the longer term, you may appreciate the following text available to us via the UC Berkeley Library: Linux Pocket Guide
For the lecture portion of this course, I would actually recommend something a bit more old-fashioned. You might find it easiest to just have a pen and paper handy and jot down each command we go over along with a brief explanation of what it does. This will come in very useful during the exercises and you won't have to flip back and forth between your terminal screen and your web browser.
Using the UNIX shell
Your time in the UNIX environment will be spent in the shell, which is lingo for the Terminal window, if you're using Mac OS. The shell allows you to move and copy files, run programs, and more. We will also touch on the basic usage of the the text editor emacs.
Informative Interlude: Some notes on the formatting of the lessons for this course
Periodically, we may stop with an informative interlude outlined with a horizontal line above and below (like the one two lines up!). In this case, we're taking a quick break to discuss this and other aspects of the formatting.
For this and all further examples, a $ represents your shell prompt, and boldface indicates the commands to type at the prompt. Italics will be used for output you should see when you take the described action.
This concludes our first informative interlude.
How do I open my terminal window?
In Mac OS, you can open a new terminal window by opening the Finder, selecting the Applications folder, then the Utilities sub-folder, and then double-clicking on the Terminal application. You may want to drag the Terminal icon to your Application Dock for ready access going forward.
Alternatively, you can just use Spotlight to search for "Terminal" each time, and open the application after it is listed in the Spotlight results list.
In Linux environments, your terminal program will depend on the version of Linux you have installed. If you are using Linux and you don't know how to open a terminal, raise your hand and let us know now.
Accessing your CGRL account
The next step is to login to the CGRL system (poset.cgrl.berkeley.edu) with your user account and password. You can do this with the ssh command.
@ravi ~$ ssh ralla@poset.cgrl.berkeley.edu
Password:
Last login: Tue Mar 18 18:10:47 2014 from 128.32.218.204
Navigation in the terminal
ralla@poset ~]$ whoami
ralla
[ralla@poset ~]$ id
uid=694(ralla) gid=500(cgrl) groups=500(cgrl)
[ralla@poset ~]$ pwd
/global/home/ralla
[ralla@poset ~]$ cd
[ralla@poset ~]$ pwd
/global/home/ralla
[ralla@poset ~]$ cd ..
[ralla@poset home]$ pwd
/global/home
[ralla@poset home]$ cd ..
[ralla@poset global]$ pwd
/global
[ralla@poset global]$ cd ..
[ralla@poset /]$ pwd
/
[ralla@poset /]$ cd
[ralla@poset ~]$ pwd
/global/home/ralla
[ralla@poset ~]$ cd /global/
[ralla@poset global]$ pwd
/global
[ralla@poset global]$ cd ~
[ralla@poset ~]$ pwd
/global/home/ralla
Paths can be absolute as in /global/ or relative as in ../ or ./
Looking at files and directories
[ralla@poset ~]$ ls
fasta_files galaxy-dist galaxy-python
[ralla@poset ~]$ ls /global/
courses galaxy home scratch software
[ralla@poset ~]$ ls ..
acarver cellison dportik galaxy jefdaj jrefsnid jzhaolfalcon mjohnson polina rharris sbranco smckay tpeixoto
aihardin ckotwali drisso ganeshm jguevara jswenson kclemenslpu msyed postgres rsung scolmena ssinghal ylee
alam cmarshal earmstrong gsluser jheller jtaylor kgoodmanmberg mvandam qshi rumachik shepp test1 zhouqi
astrom dcotoras efarrer gwang jkarijol jvillaro kmackmchung pbrowne ralla rwelch shykin test2
ccattogl dkhendrix epurdom jchoi jmchoi jxu kmurikimganesh phsieh rghosh sbouzid smajumda test3
[ralla@poset ~]$ ls -l
total 8
drwxr-xr-x 2 ralla cgrl 4096 Mar 11 14:45 fasta_files
drwxr-xr-x 21 ralla cgrl 4096 Feb 26 10:13 galaxy-dist
drwxr-xr-x 2 ralla cgrl 19 Jan 16 12:05 galaxy-python
[ralla@poset ~]$ ls -al
total 88
drwx------ 11 ralla cgrl 4096 Mar 11 14:04 .
drwxr-xr-x 70 root root 4096 Feb 19 18:28 ..
-rw-r--r-- 1 ralla cgrl 8891 Feb 19 15:58 .RData
-rw------- 1 ralla cgrl 154 Feb 19 15:58 .Rhistory
-rw------- 1 ralla cgrl 21333 Mar 18 18:24 .bash_history
-rw-r--r-- 1 ralla cgrl 33 Jan 7 09:57 .bash_logout
-rw-r--r-- 1 ralla cgrl 203 Jan 28 12:52 .bash_profile
-rw-r--r-- 1 ralla cgrl 124 Jan 7 09:57 .bashrc
-rw-r--r-- 1 ralla cgrl 515 Jan 7 09:57 .emacs
drwx------ 2 ralla cgrl 6 Feb 19 16:22 .gconf
drwx------ 2 ralla cgrl 24 Feb 19 16:22 .gconfd
drwx------ 3 ralla cgrl 16 Feb 19 17:12 .local
-rw------- 1 ralla cgrl 6 Jan 29 15:06 .psql_history
drwxr-xr-x 7 ralla cgrl 4096 Jan 7 12:25 .python-eggs
drwx------ 2 ralla cgrl 24 Jan 28 10:35 .ssh
drwxr-xr-x 3 root root 61 Feb 19 17:51 .subversion
-rw------- 1 ralla cgrl 7263 Feb 19 17:38 .viminfo
drwxr-xr-x 2 ralla cgrl 4096 Mar 11 14:45 fasta_files
drwxr-xr-x 21 ralla cgrl 4096 Feb 26 10:13 galaxy-dist
drwxr-xr-x 2 ralla cgrl 19 Jan 16 12:05 galaxy-python
[ralla@poset ~]$ ls -lt
total 8
drwxr-xr-x 2 ralla cgrl 4096 Mar 11 14:45 fasta_files
drwxr-xr-x 21 ralla cgrl 4096 Feb 26 10:13 galaxy-dist
drwxr-xr-x 2 ralla cgrl 19 Jan 16 12:05 galaxy-python
[ralla@poset ~]$ ls -aF
./ .RData.bash_history .bash_profile .emacs .gconfd/ .psql_history.ssh/ .viminfo galaxy-dist/
../ .Rhistory.bash_logout .bashrc .gconf/ .local/ .python-eggs/.subversion/ fasta_files/ galaxy-python/
[ralla@poset ~]$ ls -R fasta_files/
fasta_files/:
GA_repeats.txtSGD_features.tab cerevisiae_genome.fasta fasta_dir md5sum.txt
fasta_files/fasta_dir:
chrI.fa chrIII.fa chrIX.fa chrV.fa chrVII.fa chrX.fa chrXII.fa chrXIV.fa chrXVI.fa
chrII.fa chrIV.fa chrM.fa chrVI.fa chrVIII.fa chrXI.fa chrXIII.fa chrXV.fa
You can use tab key to autocomplete after you type the first few letters of a file or a directory or even a command. Hitting tab twice will list all the possibilities. Beware this might fill up your screen very fast if there are a lot of possibilities.
You can also use the up and down arrow keys to cycled through your previously typed commands. This saves you a lot of re-typing and reduces errors.
Another useful command is the history command. This prints to screen all your previous commands and their line numbers.
[ralla@poset ~]$ history
....
....
1037 ls
1038 cd ..
1039 clear
1040 ls -R fasta_files/
1041 ls -Sr
1042 ls -S
1043 ls -s
1044 ls ..
1045 pwd
1046 ls /global/
1047 ls -l /global/
1048 clear
1049 history
Including the last command which was history itself.
You can re run any command in the history by using the command's line number in history
[ralla@poset ~]$ !1046
ls /global/
courses galaxy home scratch software
clear is a command that clears your working space so you can have a clean looking terminal
Creating files and directories
[ralla@poset ~]$ mkdir unix_playground
[ralla@poset ~]$ ls -l
total 4
drwxr-xr-x 3 ralla cgrl 113 Mar 18 18:35 fasta_files
drwxr-xr-x 21 ralla cgrl 4096 Feb 26 10:13 galaxy-dist
drwxr-xr-x 2 ralla cgrl 19 Jan 16 12:05 galaxy-python
drwxr-xr-x 2 ralla cgrl 6 Mar 18 18:54 unix_playground
Notice the date and time the new directory was created.
[ralla@poset ~]$ cd unix_playground/
[ralla@poset unix_playground]$ touch my_first_file
[ralla@poset unix_playground]$ ls -l
total 0
-rw-r--r-- 1 ralla cgrl 0 Mar 18 18:55 my_first_file
[ralla@poset unix_playground]$ touch My_First_file
[ralla@poset unix_playground]$ ls -l
total 0
-rw-r--r-- 1 ralla cgrl 0 Mar 18 18:55 My_First_file
-rw-r--r-- 1 ralla cgrl 0 Mar 18 18:55 my_first_file
[ralla@poset unix_playground]$ touch my_first_file
[ralla@poset unix_playground]$ ls -l
total 0
-rw-r--r-- 1 ralla cgrl 0 Mar 18 18:55 My_First_file
-rw-r--r-- 1 ralla cgrl 0 Mar 18 18:56 my_first_file
Notice how touch recreated a new file with an updated time stamp. This is the behavior if file already exists, so be careful not to overwrite an existing file.
A few important things to learn from this, file and directory names are case sensitive in linux. Always name your files and directories something short and informative and use extensions (such as .fasta or .fastq to let you know what kind of file it is). Avoid using spaces in your file and directory names. We will see later why.
Modifying files and directories
[ralla@poset unix_playground]$ rm My_First_file
[ralla@poset unix_playground]$ ls -l
total 0
-rw-r--r-- 1 ralla cgrl 0 Mar 18 18:56 my_first_file
[ralla@poset unix_playground]$ mkdir temp1 temp2
[ralla@poset unix_playground]$ ls -l
total 0
-rw-r--r-- 1 ralla cgrl 0 Mar 18 18:56 my_first_file
drwxr-xr-x 2 ralla cgrl 6 Mar 18 19:05 temp1
drwxr-xr-x 2 ralla cgrl 6 Mar 18 19:05 temp2
[ralla@poset unix_playground]$ rmdir temp1/
[ralla@poset unix_playground]$ ls -l
total 0
-rw-r--r-- 1 ralla cgrl 0 Mar 18 18:56 my_first_file
drwxr-xr-x 2 ralla cgrl 6 Mar 18 19:05 temp2
[ralla@poset unix_playground]$ cd temp2
[ralla@poset temp2]$ touch test.txt
[ralla@poset temp2]$ cd ..
[ralla@poset unix_playground]$ rmdir temp2/
rmdir: temp2/: Directory not empty
[ralla@poset unix_playground]$ rm -r temp2/
[ralla@poset unix_playground]$ ls -l
total 0
-rw-r--r-- 1 ralla cgrl 0 Mar 18 18:56 my_first_file
[ralla@poset unix_playground]$ mv my_first_file my_first_move.txt
[ralla@poset unix_playground]$ ls -l
total 0
-rw-r--r-- 1 ralla cgrl 0 Mar 18 18:56 my_first_move.txt
[ralla@poset unix_playground]$ mkdir move_dir
[ralla@poset unix_playground]$ mv my_first_move.txt move_dir/
[ralla@poset unix_playground]$ ls move_dir/
my_first_move.txt
[ralla@poset unix_playground]$ cp move_dir/my_first_move.txt my_first_copy.txt
[ralla@poset unix_playground]$ ls -l
total 0
drwxr-xr-x 2 ralla cgrl 30 Mar 18 19:08 move_dir
-rw-r--r-- 1 ralla cgrl 0 Mar 18 19:09 my_first_copy.txt
Be careful with mv and cp they will overwrite existing copies of files without warning. It is a good idea to use rm, mv and cp with -i option to alert before making changes.
Adding text to a file, the dreaded terminal text editor.
[ralla@poset unix_playground]$ nano my_first_fasta.fasta
Copy the following and paste into the txt file and then ctrl+o to write and ctrl+x to exit
>seq1 This is the description of my first sequence.
AGTACGTAGTAGCTGCTGCTACGTGCGCTAGCTAGTACGTCAA
TCGTACGTCGACTGATCGTAGCTACGTCGTACGTAGGTACGTT
CGACGTAGATGCTAGCTGACTCGATGC
>seq2 This is a description of my second sequence.
CGATCGATCGTACGTCGACTGATCGTAGCTACGTCGTACGTAG
GTACGTGATCGTACGTCGACTGATCGTAGCTATCGTAGCTACC
CATCGTCAGTTACTGCATGCTCG
>seq3 and so on...
GATCGTACGTGTAGCTGCTGTAGCTGATCGTACGGTAGCAGCT
CGTCGACTGATCGTAGCTGATCGTACGTCGACTGATCGTAGCT
AGTACGTAGTAGCTGTAGCTGATCGTACGTAGCTGATCGTACG
>seq4
CGATCGATCGTACGTCGACTGATCGTAGCTACGTCGTACGTAG
GTACGTGATCGTACGTCGACTGATCGTAGCTATCGTAGCTACC
GTAGATGCTAGCTGACTCGAT
Peeking inside text based files
[ralla@poset unix_playground]$ cat my_first_fasta.fasta
>seq1 This is the description of my first sequence.
AGTACGTAGTAGCTGCTGCTACGTGCGCTAGCTAGTACGTCAA
TCGTACGTCGACTGATCGTAGCTACGTCGTACGTAGGTACGTT
CGACGTAGATGCTAGCTGACTCGATGC
>seq2 This is a description of my second sequence.
CGATCGATCGTACGTCGACTGATCGTAGCTACGTCGTACGTAG
GTACGTGATCGTACGTCGACTGATCGTAGCTATCGTAGCTACC
CATCGTCAGTTACTGCATGCTCG
>seq3 and so on...
GATCGTACGTGTAGCTGCTGTAGCTGATCGTACGGTAGCAGCT
CGTCGACTGATCGTAGCTGATCGTACGTCGACTGATCGTAGCT
AGTACGTAGTAGCTGTAGCTGATCGTACGTAGCTGATCGTACG
>seq4
CGATCGATCGTACGTCGACTGATCGTAGCTACGTCGTACGTAG
GTACGTGATCGTACGTCGACTGATCGTAGCTATCGTAGCTACC
GTAGATGCTAGCTGACTCGAT
[ralla@poset unix_playground]$ less my_first_fasta.fasta
>seq1 This is the description of my first sequence.
AGTACGTAGTAGCTGCTGCTACGTGCGCTAGCTAGTACGTCAA
TCGTACGTCGACTGATCGTAGCTACGTCGTACGTAGGTACGTT
CGACGTAGATGCTAGCTGACTCGATGC
>seq2 This is a description of my second sequence.
CGATCGATCGTACGTCGACTGATCGTAGCTACGTCGTACGTAG
GTACGTGATCGTACGTCGACTGATCGTAGCTATCGTAGCTACC
CATCGTCAGTTACTGCATGCTCG
>seq3 and so on...
GATCGTACGTGTAGCTGCTGTAGCTGATCGTACGGTAGCAGCT
CGTCGACTGATCGTAGCTGATCGTACGTCGACTGATCGTAGCT
AGTACGTAGTAGCTGTAGCTGATCGTACGTAGCTGATCGTACG
>seq4
CGATCGATCGTACGTCGACTGATCGTAGCTACGTCGTACGTAG
GTACGTGATCGTACGTCGACTGATCGTAGCTATCGTAGCTACC
GTAGATGCTAGCTGACTCGAT
my_first_fasta.fasta (END)
Press q to exit out of this
[ralla@poset unix_playground]$ head my_first_fasta.fasta
>seq1 This is the description of my first sequence.
AGTACGTAGTAGCTGCTGCTACGTGCGCTAGCTAGTACGTCAA
TCGTACGTCGACTGATCGTAGCTACGTCGTACGTAGGTACGTT
CGACGTAGATGCTAGCTGACTCGATGC
>seq2 This is a description of my second sequence.
CGATCGATCGTACGTCGACTGATCGTAGCTACGTCGTACGTAG
GTACGTGATCGTACGTCGACTGATCGTAGCTATCGTAGCTACC
CATCGTCAGTTACTGCATGCTCG
>seq3 and so on...
GATCGTACGTGTAGCTGCTGTAGCTGATCGTACGGTAGCAGCT
[ralla@poset unix_playground]$ tail my_first_fasta.fasta
GTACGTGATCGTACGTCGACTGATCGTAGCTATCGTAGCTACC
CATCGTCAGTTACTGCATGCTCG
>seq3 and so on...
GATCGTACGTGTAGCTGCTGTAGCTGATCGTACGGTAGCAGCT
CGTCGACTGATCGTAGCTGATCGTACGTCGACTGATCGTAGCT
AGTACGTAGTAGCTGTAGCTGATCGTACGTAGCTGATCGTACG
>seq4
CGATCGATCGTACGTCGACTGATCGTAGCTACGTCGTACGTAG
GTACGTGATCGTACGTCGACTGATCGTAGCTATCGTAGCTACC
GTAGATGCTAGCTGACTCGAT
[ralla@poset unix_playground]$ head -3 my_first_fasta.fasta
>seq1 This is the description of my first sequence.
AGTACGTAGTAGCTGCTGCTACGTGCGCTAGCTAGTACGTCAA
TCGTACGTCGACTGATCGTAGCTACGTCGTACGTAGGTACGTT
[ralla@poset unix_playground]$ tail -3 my_first_fasta.fasta
CGATCGATCGTACGTCGACTGATCGTAGCTACGTCGTACGTAG
GTACGTGATCGTACGTCGACTGATCGTAGCTATCGTAGCTACC
GTAGATGCTAGCTGACTCGAT
If you need help about any command or its options use man <command_name>. This is very useful when you learning command line. We will be using this in the coming sections.
A few more handy utilities
[ralla@poset unix_playground]$ nano second_file.txt[ralla@poset unix_playground]$ cat second_file.txt my_first_fasta.fasta
This is my second file! I am getting good at this.
>seq1 This is the description of my first sequence.
AGTACGTAGTAGCTGCTGCTACGTGCGCTAGCTAGTACGTCAA
TCGTACGTCGACTGATCGTAGCTACGTCGTACGTAGGTACGTT
CGACGTAGATGCTAGCTGACTCGATGC
>seq2 This is a description of my second sequence.
CGATCGATCGTACGTCGACTGATCGTAGCTACGTCGTACGTAG
GTACGTGATCGTACGTCGACTGATCGTAGCTATCGTAGCTACC
CATCGTCAGTTACTGCATGCTCG
>seq3 and so on...
GATCGTACGTGTAGCTGCTGTAGCTGATCGTACGGTAGCAGCT
CGTCGACTGATCGTAGCTGATCGTACGTCGACTGATCGTAGCT
AGTACGTAGTAGCTGTAGCTGATCGTACGTAGCTGATCGTACG
>seq4
CGATCGATCGTACGTCGACTGATCGTAGCTACGTCGTACGTAG
GTACGTGATCGTACGTCGACTGATCGTAGCTATCGTAGCTACC
GTAGATGCTAGCTGACTCGAT
[ralla@poset unix_playground]$ wc my_first_fasta.fasta
16 35 598 my_first_fasta.fasta
[ralla@poset unix_playground]$ wc -l my_first_fasta.fasta
16 my_first_fasta.fasta
[ralla@poset unix_playground]$ wc -c my_first_fasta.fasta
598 my_first_fasta.fasta
Download this tab_delimited_file
Right click and copy link and go to unix_playground and use wget command
[ralla@poset unix_playground]$ wget http://cgrlucb.wikispaces.com/file/view/gene_exp.diff
--2014-03-18 19:41:41-- http://cgrlucb.wikispaces.com/file/view/gene_exp.diff
Resolving cgrlucb.wikispaces.com... 208.43.192.33, 75.126.104.177
Connecting to cgrlucb.wikispaces.com|208.43.192.33|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4173755 (4.0M) [text/plain]
Saving to: `gene_exp.diff'
100%[=========================================================================================================================>] 4,173,755 4.11M/s in 1.0s
2014-03-18 19:41:42 (4.11 MB/s) - `gene_exp.diff' saved [4173755/4173755]
[ralla@poset unix_playground]$ ls
gene_exp.diff my_first_fasta.fasta second_file.txt
Cut can be used to extract columns from a delimited file.
[ralla@poset unix_playground]$ cut -f 3 gene_exp.diff
...
...
TTTY14
CD24
BCORP1
KDM5D
TTTY10
RBMY2EP
TTTY13
RBMY1B
RBMY1B
PRY
RBMY1J
TTTY5
TTTY6,TTTY6B
DAZ1,DAZ4
TTTY3B
CDY1
CSPG4P1Y
GOLGA2P2Y
DAZ3,DAZ4
BPY2C
TTTY4B
TTTY17B
How would you extract columns if the file was a csv??
Sort can be used to sort based on any given column
[ralla@poset unix_playground]$ sort -k3 gene_exp.diff
LOC_001608XLOC_001608ZYG11Bchr1:53192130-53293013MockOGTOK05790.27infnan0.136850.42102no
XLOC_000409XLOC_000409ZYG11Bchr1:53192130-53293013MockOGTOK11.941513.21850.1465790.1444550.82990.998471no
XLOC_031804XLOC_031804ZYXchr7:143078359-143220540MockOGTOK5.088096.067640.254010.1167770.716950.996395no
XLOC_014317XLOC_014317ZZEF1chr17:3907738-4046253MockOGTOK2.743513.341760.2845830.1992790.658150.977286no
XLOC_001676XLOC_001676ZZZ3chr1:78030189-78148343MockOGTOK42.826824.3581-0.814114-0.6450210.29120.682889no
test_idgene_idgenelocussample_1sample_2statusvalue_1value_2log2(fold_change)test_statp_valueq_valuesignificant
How would you sort this in a reverse order? Use man sort
Sort the gene_exp.diff file by column 4 with -n option and without -n option, what is the difference?
What does the uniq command do?
Redirection and piping
Let us get the first 100 lines of gene_exp.diff and save it to a new file. How do we do this?
[ralla@poset unix_playground]$ head -100 gene_exp.diff > gene_exp_first_100_lines.diff
[ralla@poset unix_playground]$ wc -l gene_exp.diff gene_exp_first_100_lines.diff
37605 gene_exp.diff
100 gene_exp_first_100_lines.diff
37705 total
This > sign is used to redirect the output of head into a new file called gene_exp_first_100_lines.diff
This will overwrite whatever is in the recipient file name. To append use >> sign
[ralla@poset unix_playground]$ echo "I am adding a new line" > gene_exp_first_100_lines.diff
[ralla@poset unix_playground]$ head gene_exp_first_100_lines.diff
I am adding a new line
[ralla@poset unix_playground]$ head -100 gene_exp.diff > gene_exp_first_100_lines.diff
[ralla@poset unix_playground]$ echo "I am adding a new line" >> gene_exp_first_100_lines.diff
[ralla@poset unix_playground]$ tail -5 gene_exp_first_100_lines.diff
XLOC_000096XLOC_000096PLOD1chr1:11994723-12035599MockOGTOK4.9559657.11413.526610.8052650.082050.288638no
XLOC_000097XLOC_000097MFN2chr1:12040237-12073572MockOGTOK22.837323.35650.03243250.01837010.97060.998471no
XLOC_000098XLOC_000098MIIPchr1:12079298-12092106MockOGTOK10.270610.80530.07321580.06297720.8970.998471no
XLOC_000099XLOC_000099TNFRSF8chr1:12123433-12204264MockOGTOK0.1032310.4866612.237040.6853430.02570.224628no
I am adding a new line
Let's say we want to get the gene name column (column 3) and then sort the output. How would we do this??
This is where we use pipes, pipes take the output of one command and use it as input for the command after the pipe. Useful for chaining commands which have one output and accept one input
[ralla@poset unix_playground]$ cut -f 3 gene_exp_first_100_lines.diff | sort | head -5
-
-
-
ACTRT2
AGRN
How would you use pipes to give me a list of only unique gene names in this file?
Wildcards
A lot of commands can take multiple inputs, like touch, mkdir, rm, cp, mv, wc etc
So if we are able to pass multiple files (or dir) to these commands, they will work on all of them. We could do this file by file or we could use an expression to match these multiple files. This is where wildcards come in. The following are a few wildcards we will talk about
. is a wildcard that can match zero or more characters of any kind
? is a wildcard that can match 1 occurrence of any character
[ABC] can match for any of the characters occurring inside them, A or B or C
! negates match expressions [!ABC], any character except A or B or C
{A,B,C} expands all the elements within the braces
{1..9} expands a range of elements within the braces
[ralla@poset unix_playground]$ mkdir wildcards
[ralla@poset unix_playground]$ cd wildcards/
[ralla@poset wildcards]$ touch file{1..6}.txt
[ralla@poset wildcards]$ ls
file1.txt file2.txt file3.txt file4.txt file5.txt file6.txt
[ralla@poset wildcards]$ touch file_{a..e}.log
[ralla@poset wildcards]$ ls
file1.txt file2.txt file3.txt file4.txt file5.txt file6.txt file_a.log file_b.log file_c.log file_d.log file_e.log
[ralla@poset wildcards]$ ls *.txt
file1.txt file2.txt file3.txt file4.txt file5.txt file6.txt
[ralla@poset wildcards]$ ls file*
file1.txt file2.txt file3.txt file4.txt file5.txt file6.txt file_a.log file_b.log file_c.log file_d.log file_e.log
[ralla@poset wildcards]$ ls *[23].txt
????
[ralla@poset wildcards]$ ls ?[46].txt
[ralla@poset wildcards]$ ls file?.*
????
[ralla@poset wildcards]$ ls file{4,_?}*
???
[ralla@poset wildcards]$ rm *[!46].txt
[ralla@poset wildcards]$ ls
???
[ralla@poset wildcards]$ touch [!46].txt
[ralla@poset wildcards]$ ls
[!46].txt file4.txt file6.txt file_a.log file_b.log file_c.log file_d.log file_e.log
How do you remove this file [!46].txt? Hint: You would have to escape the [ and ! and ] to make them not special.
Remove all the files in this directory. And then navigate out of the directory and remove the directory named wildcards
Searching for files and their contents
Look up grep and find.
Find is useful for finding file names that match certain criteria
[ralla@poset ~]$ find unix_playground/ -type f
unix_playground/my_first_fasta.fasta
unix_playground/second_file.txt
unix_playground/gene_exp.diff
unix_playground/gene_exp_first_100_lines.diff
[ralla@poset ~]$ find unix_playground/ -type d
unix_playground/
[ralla@poset ~]$ find unix_playground/ -name "*.txt"
unix_playground/second_file.txt
[ralla@poset ~]$ find unix_playground/ -name "*.fasta"
unix_playground/my_first_fasta.fasta
[ralla@poset ~]$ find unix_playground/ -name "*.fasta" -type d
???
[ralla@poset ~]$ wc -l $(find unix_playground/ -name "*" -type f)
16 unix_playground/my_first_fasta.fasta
1 unix_playground/second_file.txt
37605 unix_playground/gene_exp.diff
101 unix_playground/gene_exp_first_100_lines.diff
37723 total
Here we are using the output of find by placing its output in parenthesis with a $ and passing it to wc. This is another way of passing arguments to commands.
[ralla@poset unix_playground]$ grep '>' my_first_fasta.fasta
>seq1 This is the description of my first sequence.
>seq2 This is a description of my second sequence.
>seq3 and so on...
>seq4
Do not forget the ' ' around the > or you will overwrite your fasta file
[ralla@poset unix_playground]$ grep -n '>' my_first_fasta.fasta
1:>seq1 This is the description of my first sequence.
5:>seq2 This is a description of my second sequence.
9:>seq3 and so on...
13:>seq4
[ralla@poset unix_playground]$ grep -no '>' my_first_fasta.fasta
1:>
5:>
9:>
13:>
[ralla@poset unix_playground]$ grep -nv '>' my_first_fasta.fasta
2:AGTACGTAGTAGCTGCTGCTACGTGCGCTAGCTAGTACGTCAA
3:TCGTACGTCGACTGATCGTAGCTACGTCGTACGTAGGTACGTT
4:CGACGTAGATGCTAGCTGACTCGATGC
6:CGATCGATCGTACGTCGACTGATCGTAGCTACGTCGTACGTAG
7:GTACGTGATCGTACGTCGACTGATCGTAGCTATCGTAGCTACC
8:CATCGTCAGTTACTGCATGCTCG
10:GATCGTACGTGTAGCTGCTGTAGCTGATCGTACGGTAGCAGCT
11:CGTCGACTGATCGTAGCTGATCGTACGTCGACTGATCGTAGCT
12:AGTACGTAGTAGCTGTAGCTGATCGTACGTAGCTGATCGTACG
14:CGATCGATCGTACGTCGACTGATCGTAGCTACGTCGTACGTAG
15:GTACGTGATCGTACGTCGACTGATCGTAGCTATCGTAGCTACC
16:GTAGATGCTAGCTGACTCGAT
[ralla@poset unix_playground]$ grep -n TMEM201 gene_exp.diff
77:XLOC_000076XLOC_000076TMEM201chr1:9648931-9674935MockOGTOK1.615331.53688-0.0718273-0.05297790.900450.998471no
How would you get the header file in the gene_exp.diff using grep?
You can use egrep which is a more powerful form of grep and you can specify wildcards and regular expression to search for patterns. Regular expressions are a whole other beast and we won't be talking about them today. The internet is your friend in this case.
as an example we could use egrep to find fasta files which end in 'T' for example
[ralla@poset unix_playground]$ egrep -n 'T$' my_first_fasta.fasta
3:TCGTACGTCGACTGATCGTAGCTACGTCGTACGTAGGTACGTT
10:GATCGTACGTGTAGCTGCTGTAGCTGATCGTACGGTAGCAGCT
11:CGTCGACTGATCGTAGCTGATCGTACGTCGACTGATCGTAGCT
16:GTAGATGCTAGCTGACTCGAT
We could find entries that begin with C and end with T
[ralla@poset unix_playground]$ egrep -n '^C.+T$' my_first_fasta.fasta
11:CGTCGACTGATCGTAGCTGATCGTACGTCGACTGATCGTAGCT
Time for some Exercises
Exercise 1
make a new directory called fasta_files and download files from ftp://hgdownload.cse.ucsc.edu/goldenPath/sacCer3/chromosomes/* and place them in the new directory
Then do the following:
a) Unzip the downloaded file (HINT: use gunzip, look at its documentation for help)
b) Change into your fasta_files directory, and unzip the remainder of the fasta files using only a one-line command.
c) Combine the sequence files to make a single whole genome file called "cerevisiae_genome.fasta"
d) Count the chromosomes in the whole genome file using commands from the lecture. (HINT: Each of the original FASTA files contains a single chromosome).
e) Get size of total genome. (HINT: The size of the genome can be determined by counting the number of bases).
Exercise 2
a) Get the list of cerevisiae chromosome features from this address:
http://downloads.yeastgenome.org/curation/chromosomal_feature/SGD_features.tab
Columns within SGD_features.tab:
1. Primary Stanford Gene Database ID (SGDID) (mandatory)
2. Feature type (mandatory)
3. Feature qualifier (optional)
4. Feature name (optional)
5. Standard gene name (optional)
6. Alias (optional, multiples separated by |)
7. Parent feature name (optional)
8. Secondary SGDID (optional, multiples separated by |)
9. Chromosome (optional)
10. Start_coordinate (optional)
11. Stop_coordinate (optional)
12. Strand (optional)
13. Genetic position (optional)
14. Coordinate version (optional)
15. Sequence version (optional)
16. Description (optional)
b) Next, count the total number of ORFs in the features file.
c) Now count the ORFs listed as Verified, and separately count those listed as Dubious. If you do the math and add up these two groups, why don't they make up all the ORF matches fromgrep?
d) Use the cut utility with grep to count the number of ORFs more accurately.
e) Use a combination of the the commands cut, sort, and uniq generate a list of all the genomic features (column 2) in this file. Don't forget that you can use the man pages to learn how to use new tools.
Exercise 3
a) Is there a way to tell grep to ONLY print the part of the line that matches the pattern you give it?
b) Use your new found knowledge of grep and egrep, combined with some of the other unix utilities you learned about earlier, to produce a table that lists every length of GA microsatellite repeat in the yeast genome, along with the number of occurrences of each. You can do this with a single command, piping output between the different unix utilities.
Example:
5000 GA
3000 GAGA
1000 GAGAGA
100 GAGAGAGA
10 GAGAGAGAGA
...
...
HINT #1: You will need to use an extra argument with the uniq command, one that we haven't covered yet.
HINT #2: Regular expressions provide the flexibility to match patterns rather than a specific string. You can use special characters to construct a regular expression, for example, the plus sign means match one or more times, so the pattern 'H+' will match the H in HAT but also theHHHHHHHH in AHHHHHHHH!!!!!!! The period meansmatch any character so, from the previous example, 'H.+' would match all of HAT as well as HHHHHHHH!!!!!!!. You can also group characters together with parentheses. For example, '(I..)+' would match ISSISSIPP in MISSISSIPPI. Remember to enclose your pattern in 'tick marks' just like with echo.
CAVEAT: Assume that it is safe to ignore the fact that a repeat could possibly span multiple lines in your FASTA file
c) Save your table as a new file. Open the file in emacs and add column headers and/or anything else you can think of to make it pretty. Save the file and exit.
Shell Scripting
We can write scripts that run in the bash shell to automate a lot of our commands. We can use this to run a set of commands repeatedly over a bunch of different files. We could use wild cards but not always. For example to get lines 15-20 of every fasta file in fasta_files directory.
[ralla@poset fasta_files]$ head -20 chrI.fa | tail -5
CATTTATATACACTTATGTCAATATTACAGAAAAATCCCCACAAAAATCA
CCTAAACATAAAAATATTCTACTTTTCAACAATAATACATAAACATATTG
GCTTGTGGTAGCAACACTATCATGGTATCACTAACGTAAAAGTTCCTCAA
TATTGCAATTTGCTTGAACGGATGCTATTTCAGAATATTTCGTACTTACA
CAGGCCATACATTAGAATAATATGTCACATCACTGTCGTAACACTCTTTA
Lets make a bash script to do this for all the files, building our code one step at a time
[ralla@poset fasta_files]$ nano subset.sh
And paste
head -20 chrI.fa | tail -5
into the file and write and exit.
[ralla@poset fasta_files]$ bash subset.sh
CATTTATATACACTTATGTCAATATTACAGAAAAATCCCCACAAAAATCA
CCTAAACATAAAAATATTCTACTTTTCAACAATAATACATAAACATATTG
GCTTGTGGTAGCAACACTATCATGGTATCACTAACGTAAAAGTTCCTCAA
TATTGCAATTTGCTTGAACGGATGCTATTTCAGAATATTTCGTACTTACA
CAGGCCATACATTAGAATAATATGTCACATCACTGTCGTAACACTCTTTA
Great our first bash script runs as expected. But we have hard coded it to print only lines 15-20 for chrI.fa. What if instead we wanted to pass it any fasta file?
Change your subset.sh to the following
head -20 $1 | tail -5
Save the file and rerun
[ralla@poset fasta_files]$ bash subset.sh chrVI.fa
CCGAAATCCAAGAAACTGTAAGACATTCATATTCTCGGAAGTATTGGGAA
ATTGTGCTTTCAGTTTCTTTCTCTCTAGCAAAACCATTTGACTCCCTTTC
CGCTTATACGACTCTTTGTTAATGTCGGTGACTGGATGGAATCTATTATC
CTCAGCATTGCCATCTTTATTGGCGTCCTCCTTGGCACTAGCGTTGGTGC
TGGCAGTGGTAGTAGCATTAGTCCTGACGTTGATGCTGGCAGTGGTAGTC
$1 is the variable that stores the first argument we passed to the script. We could use $2 for a second variable and $3 for a third variable. Let's try that
change your subset.sh to the following
head -${2} $1 | tail -${3}
Now when you run it again we can give it arguments for subsetting any number of lines.
[ralla@poset fasta_files]$ bash subset.sh chrVI.fa 50 10
TTCTTCCAAATTTTATGAACATGCCCTAAAGGCACCTCGGATTTCTCCTT
GATTAGATTAAACATCCGTGTTGGATAGCTGGATAGACCTCTGCTGAGAT
CTTCCGACCGTTTGAGCTCGTTGATGTCCATCGACTTCTTGGCCAGTCCC
GTAAGCCCAATACGCTGCAACGCAGCATCAGCTACAGCCTCAGGGGCTGT
GCCGCTCAAAAAGATTGCTTTCTCAAAAGCGTCAAAATCAAGGTTAGTTA
TGCCCCCAAATTGCGACTGCCGGTAGACCTCCGTTTCAAAGTTGTGAAAC
TCATCTACAATGAGGTAACCCAATTTTACGTTGTTGGTCCTAAAGGTGCA
CTCAACAATATTCTCCCACGCAGCTATCCTGTCTGTGAAATTAGTGCTAG
CAAGATCATCGTAGATCCCCACGTATAAATCAGTAACGCCATCGTAACCT
TCTTCAATAAAGTTTCTTACAGGGGCCACATTCAAGCAACCGCGTCGGCC
Great how about if we wanted to pass this script all the files in the folder, we wouldn't know how many variables to use inside our script, and we would have to loop through our list of variables. A full list of arguments passed is captured in a special variable $*. Let us loop through all elements of this special variable and print out lines 15-20
Change subset.sh to this
for fasta in $*
do
echo "lines 15-20 of" $fasta
head -20 $fasta | tail -5
done
[ralla@poset fasta_files]$ bash subset.sh chr*
This is an example of for loops in a shell script, we are looping over each element of $* and assigning to a variable fasta and we are accessing that fasta file through $fasta and getting lines 15-20 and printing out. We could pipe the output and save it. There are other loops you can run in bash script as well. For example you could run
Let us make it such that we don't have to type bash every time.
Add the following lines to the top of your shell script
#!/bin/bash
This tells the terminal to use bash to execute the script. But now we also need to make the script executable
[ralla@poset fasta_files]$ ls -l subset.sh
-rw-r--r-- 1 ralla cgrl 79 Mar 18 21:43 subset.sh
Notice how the permissions are not setup with an x to execute this script. So we will change permissions to allow execution
[ralla@poset fasta_files]$ chmod +x subset.sh
[ralla@poset fasta_files]$ ls -l subset.sh
-rwxr-xr-x 1 ralla cgrl 79 Mar 18 21:43 subset.sh
[ralla@poset fasta_files]$ ./subset.sh chr*
We still need to type ./ because terminal is not searching the current directory for executables, for example when you type ls or cd or mv these are actually present in the PATH.
[ralla@poset fasta_files]$ which ls
/bin/ls
So /bin/ must be in the path because we are able to type ls. To check this we could do
[ralla@poset fasta_files]$ echo $PATH
/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/opt/dell/srvadmin/bin:/global/home/ralla/bin
Indeed it is. So to get our subset.sh to run like ls or any other commands that are in the PATH, we need to add our current directory to the PATH, how do we do this?
[ralla@poset fasta_files]$ pwd
/global/home/ralla/unix_playground/fasta_files
[ralla@poset fasta_files]$ export PATH=/global/home/ralla/unix_playground/fasta_files:$PATH
[ralla@poset fasta_files]$ echo $PATH
/global/home/ralla/unix_playground/fasta_files:/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/opt/dell/srvadmin/bin:/global/home/ralla/bin
[ralla@poset fasta_files]$ subset.sh chr*
See how my pwd is now in the path so I can start typing sub and use tab to file in subset.sh
Exercise bash scripting
Rename all the fasta files to so chr is now chromosome
for example chrV.fa is now chromosomeV.fa
HINT: ${variable/<search_tem>/<replace_term>} will return a variable with search_term replaced with replace_term
Software installation (if time permits)
UnixSpring2013_Part2