UNIX data tools workshop


Your instructor is Ke Bi.


IMPORTANT NOTES

1.This workshop material is mainly based on “UNIX Tutorial for Beginners” (http://www.ee.surrey.ac.uk/Teaching/Unix/) with modifications. “UNIX Tutorial for Beginners” is covered by a Creative Commons licence. If you need to download a copy of the original tutorial please go to http://www.ee.surrey.ac.uk/Teaching/Unix/download.html

2. Some of the workshop lectures are directory copied from Chapter 7-Unix Data Tools from Bioinformatics Data Skills authored by Vince Buffalo (especially in sessions about advanced grep, sed and awk). Bioinformatics Data Skills introduces various fundamental but extremely useful tools that are commonly used in modern bioinformatics data processing. It is worth a read!

Bioinformatics Data Skills
Reproducible and Robust Research with Open Source Tools
By Vince Buffalo
Publisher: O'Reilly Media
Final Release Date: July 2015
Pages: 538

3. The material below is intended for the in-class instruction. Others trying to learn this material offline can follow along on any Unix (like) machine, however, please be aware the file names and path names will be different on your own machine.

Requirements

The module requires that you use a recent MacOS or Linux based system. Future versions of the module may accommodate Microsoft Windows users.


Using the UNIX shell

Your time in the UNIX environment will be spent in the shell, which is UNIX command line interpreter. Shell runs in the Terminal window and if you're using Mac OS (you'll find the Terminal under Applications -> Utilities -> Terminal). The shell allows you to move and copy files, run programs, and more. On Windows, you could install Cygwin and get a UNIX environment on your Windows machine.



How do I open my terminal window?

In Mac OS, you can open a new terminal window by opening the Finder, selecting the Applications folder, then the Utilities sub-folder, and then double-clicking on the Terminal application. You may want to drag the Terminal icon to your Application Dock for ready access going forward.

Alternatively, you can just use Spotlight to search for "Terminal" each time, and open the application after it is listed in the Spotlight results list.

In Linux environments, your terminal program will depend on the version of Linux you have installed. If you are using Linux and you don't know how to open a terminal, raise your hand and let us know now.

Accessing your temporary CGRL account

The next step is to login to the CGRL server (poset.cgrl.berkeley.edu) with your user account and password. You can do this with the ssh command.
After login, you should see a shell prompt that contains your user name and the name of the machine followed by a dollar sign.

Kes-MacBook-Pro:~ kebi$ ssh cgrlunix@poset.cgrl.berkeley.edu
Password:
[cgrlunix@poset ~]$

1.1 Listing files and directories

ls (list)
When you first login, your current working directory is your home directory. Your home directory has the same name as your user-name, for example, cgrlunix, and it is where your personal files and subdirectories are saved.
To find out what is in your home directory, type
[cgrlunix@poset ~]$ ls
Homo_sapiens.GRCh38.81.gtf  data.fastq file1.fasta
The ls command lists the contents of your current working directory. ls does not, in fact, cause all the files in your home directory to be listed, but only those ones whose name does not begin with a dot (.). Files beginning with a dot (.) are known as hidden files and usually contain important program configuration information. They are hidden because you should not change them unless you are very familiar with Unix!!!

To list all files in your home directory including those whose names begin with a dot, type
[cgrlunix@poset ~]$ ls -a
.  .. .bash_logout  .bashrc  .cache  data.fastq  file1.fasta  Homo_sapiens.GRCh38.81.gtf  .profile
ls is an example of a command which can take options: -a is an example of an option which means do not ignore entries starting with dot (.).

Another bit of information we usually want about a file is its size.
[cgrlunix@poset ~]$ ls -l
total 6568
-rw-r--r-- 1 cgrlunix cgrlunix  636433 Jan  7 12:43 data.fastq
-rw-r--r-- 1 cgrlunix cgrlunix    9193 Jan  7 12:43 file1.fasta
-rw-r--r-- 1 cgrlunix cgrlunix 6073448 Jan  7 12:43 Homo_sapiens.GRCh38.81.gtf
A brief explanation of the information displayed in "ls -l"
Column 1: File permissions and file types
1 2 3 4
- rw- r-- r--
1. leftmost field is file type: "-" means a file and "d"means a directory.
2. The owner (cgrlunix) has read (r) and write (w) permission to this file
3. The group has read permission to this file
4. everyone else has read permission to this file
Note: if the item is an executable then the last field of each section is marked as "x" instead of a "-". For instance: rwx

Column 2: Links field - Number of short cuts to this file from other directories.
Column 3: Name of owner
Column 4: Name of group
Column 5: Size
Column 6: Date/time when the item was created or last modified
Column 7: Name of the item



The fourth column (the one before the creation data) ls -l reports file sizes in bytes. If we wish to use human readable sies, we can use ls -lh:
[cgrlunix@poset ~]$ ls -lh
total 6.5M
-rw-r--r-- 1 cgrlunix cgrlunix 622K Jan  7 12:43 data.fastq
-rw-r--r-- 1 cgrlunix cgrlunix 9.0K Jan  7 12:43 file1.fasta
-rw-r--r-- 1 cgrlunix cgrlunix 5.8M Jan  7 12:43 Homo_sapiens.GRCh38.81.gtf

If you want to sort the files by size, we can use ls -lhS:
[cgrlunix@poset ~]$ ls -lhS
total 6.5M
-rw-r--r-- 1 cgrlunix cgrlunix 5.8M Jan  7 12:43 Homo_sapiens.GRCh38.81.gtf
-rw-r--r-- 1 cgrlunix cgrlunix 622K Jan  7 12:43 data.fastq
-rw-r--r-- 1 cgrlunix cgrlunix 9.0K Jan  7 12:43 file1.fasta


If you want to sort the files by size in a reverse order, we can use ls -lhSr:
[cgrlunix@poset ~]$ ls -lhSr
total 6.5M
-rw-r--r-- 1 cgrlunix cgrlunix 9.0K Jan  7 12:43 file1.fasta
-rw-r--r-- 1 cgrlunix cgrlunix 622K Jan  7 12:43 data.fastq
-rw-r--r-- 1 cgrlunix cgrlunix 5.8M Jan  7 12:43 Homo_sapiens.GRCh38.81.gtf

These options change the behavior of the command. There are online manual pages that tell you which options a particular command can take, and how each option modifies the behavior of the command. (See later in this tutorial)



1.2 Making Directories
mkdir (make directory)
To make a subdirectory called unixstuff in your current working directory type
[cgrlunix@poset ~]$ mkdir unixstuff
To see the directory you have just created, type
[cgrlunix@poset ~]$ ls
Homo_sapiens.GRCh38.81.gtf  data.fastq file1.fasta unixstuff


1.3 Changing to a different directory
cd (change directory)
The command cd directory means change your working directory from the current one to a different directory. The current working directory may be thought of as the directory you are in, i.e. your current position in the file-system tree.

To change to the directory you have just made, type
[cgrlunix@poset ~]$ cd unixstuff
Type ls to see the contents (which should be empty)
[cgrlunix@poset unixstuff]$ls
Now create a new directory backups under unixstuff
[cgrlunix@poset unixstuff]$ mkdir backups


1.4 The directories . and ..
Still in the unixstuff directory, type
[cgrlunix@poset unixstuff]$ls -a
. .. backups
As you can see, in the unixstuff directory (and in all other directories), there are two special directories called (.) and (..)
In UNIX, (.) means the current directory, so type
[cgrlunix@poset unixstuff]$ cd .
means stay where you are (the unixstuff directory). This may not seem very useful at first, but using (.) as the name of the current directory will save a lot of typing, as we shall see later in the tutorial.

(..) means the parent of the current directory, so typing
[cgrlunix@poset unixstuff]$ cd ..
will take you one directory up the hierarchy (back to your home directory).

Note: typing cd with no argument always returns you to your home directory. This is very useful if you are lost in the file system.




1.5 Pathnames
pwd (print working directory)
Pathnames enable you to work out where you are in relation to the whole file-system. For example, to find out the absolute pathname of your working directory:
[unixcgrl@poset ~]$ pwd
/home/cgrlunix



1.6 More about pathnames
Understanding pathnames

Then type ls to list the contents of your unixstuff directory
[cgrlunix@poset ~]$ ls unixstuff
backups
Now type
[cgrlunix@poset ~]$ ls backups
ls: backups: No such file or directory
The reason is, backups is not in your current working directory. To use a command on a file (or directory) not in the current working directory (the directory you are currently in), you must either cd to the correct directory, or specify its pathname. To list the contents of your backups directory, you must type
[cgrlunix@poset ~]$ ls unixstuff/backups



2.1 Copying files
cp (copy)
cp file1 file2 is the command which makes a copy of file1 in the current working directory and calls it file2

First, cd to your home directory and type ls.
[cgrlunix@poset ~]$ cd
[cgrlunix@poset ~]$ ls
Homo_sapiens.GRCh38.81.gtf  data.fastq file1.fasta  unixstuff
now make a copy of file1.fasta and call the copy file2.fasta
[cgrlunix@poset ~]$ cp file1.fasta file2.fasta
now list all the contents in your home directory
[cgrlunix@poset ~]$ ls
Homo_sapiens.GRCh38.81.gtf  data.fastq file1.fasta  file2.fasta  unixstuff
Now try something more fancier: make a copy of file1.fasta to the directory unixstuff and name this copy file3.fasta
[cgrlunix@poset ~]$ cp file1.fasta unixstuff/file3.fasta
Check the content in unixstuff
[cgrlunix@poset ~]$ ls unixstuff
backups file3.fasta
Now cd to unixstuff
[cgrlunix@poset ~]$ cd unixstuff
Now make a copy of file1.fasta to the current directory and name the copy as file4.fasta
[cgrlunix@poset unixstuff]$ cp ../file1.fasta file4.fasta
Now make a copy of file1.fasta to the current directory without renaming it
[cgrlunix@poset unixstuff]$ cp ../file1.fasta .
Don't forget the dot (.) at the end. Remember, in unix, the dot means the current directory.
NOTE: Copy directories follows the same rules except we need to add an option -r to make cp work with directories.


2.2 Moving files
mv (move)
mv file1 file2 moves (or renames) file1 to file2
To move a file from one place to another, use the mv command. This has the effect of moving rather than copying the file, so you end up with only one file rather than two. It can also be used to rename a file, by moving the file to the same directory, but giving it a different name. It can also rename directories or move directories to other directories.

In this lecture we will learn how to move files. Moving directories follow the same rules.

We are now going to move file1.fasta to your backup directory. This can be done in many different ways. Here I just list one example:
cd to your home directory
[unixcgrl@poset unixstuff]$ cd
display all the content in home directory
[cgrlunix@poset ~]$ ls
Homo_sapiens.GRCh38.81.gtf  data.fastq file1.fasta  file2.fasta  unixstuff
now move file1.fasta to the directory backups
[cgrlunix@poset ~]$ mv file1.fasta unixstuff/backups
Now check the content in the home directory again by using ls
[cgrlunix@poset ~]$ ls
Homo_sapiens.GRCh38.81.gtf  data.fastq file2.fasta  unixstuff
You can see file1.fasta is gone. Now check the content in the directory backups
[cgrlunix@poset ~]$ ls unixstuff/backups
file1.txt
mv can sometimes be "dangerous" to use for renaming files unless you clearly understand what you try to do . Try the following
[cgrlunix@poset ~]$ cp file2.fasta test.fasta
[cgrlunix@poset ~]$ ls
Homo_sapiens.GRCh38.81.gtf  data.fastq file2.fasta  test.fasta  unixstuff
[cgrlunix@poset ~]$ mv file2.fasta test.fasta
[cgrlunix@poset ~]$ ls
Homo_sapiens.GRCh38.81.gtf data.fastq test.fasta  unixstuff
You can see that the original copy of file2.fasta is gone. test.fasta is overwritten by file2.fasta.

Let's rename test.fasta to be file2.fasta
[cgrlunix@poset ~]$ mv test.fasta file2.fasta



2.3 Removing files and directories
rm (remove), rmdir (remove empty directory), rm -r (remove directories and their contents recursively)

To delete (remove) a file, use the rm command.
Now cd to your unixstuff directory and list all the content, type
[cgrlunix@poset ~]$cd unixstuff
[cgrlunix@poset unixstuff]$ ls
backups  file1.fasta  file3.fasta  file4.fasta
No remove (permanently!) file1.fasta, file3.fasta and file4.fasta from unixstuff, and then list the content
[cgrlunix@poset ~]$ rm file1.fasta file3.fasta file4.fasta
[cgrlunix@poset unixstuff]$ ls
backups
Now make a new directory test under unixstuff and use rmdir to remove it. Make sure to check the content of unixstuff before and after running rmdir. You can also use rm -r to remove a directory and its contents recursively
[cgrlunix@poset unixstuff]$ mkdir test
[cgrlunix@poset unixstuff]$ ls
backups test
[cgrlunix@poset unixstuff]$ rmdir test
[cgrlunix@poset unixstuff]$ ls
backups
 
[cgrlunix@poset unixstuff]$ mkdir test
[cgrlunix@poset unixstuff]$ cp ../file2.fasta test/
[cgrlunix@poset unixstuff]$ rmdir test/
rmdir: failed to remove ‘test/’: Directory not empty
[cgrlunix@poset unixstuff]$ rm -r test/




2.4 Displaying the contents of a file on the screen
clear (clear screen)
Before you start the next section, you may like to clear the terminal window of the previous commands so the output of the following commands can be clearly demonstrated.

Now cd to your home directory
[cgrlunix@poset unixstuff]$ cd
Type clear in the shell
[cgrlunix@poset ~]$ clear
This will clear all text and leave you with the $ prompt at the top of the window.

Now list the content by using ls
[cgrlunix@poset ~]$ ls
Homo_sapiens.GRCh38.81.gtf  data.fastq file2.fasta  unixstuff
file2.fasta is a copy of file1.fasta, but we have moved file1.fasta to the directory backups. Now we can move file1.fasta back. Then you can type ls to check the content
[cgrlunix@poset ~]$ mv unixstuff/backups/file1.fasta .
[cgrlunix@poset ~]$ ls
Homo_sapiens.GRCh38.81.gtf  data.fastq file1.fasta  file2.fasta  unixstuff

cat
The command cat can be used to display the contents of a file on the screen. Type:
[cgrlunix@poset ~]$ cat file1.fasta
>Transcript1
CAGGCTTAGCG...
>Transcript2
CAGAGCCTCCA…
…
As you can see, the file is longer than the size of the window, so it scrolls past making it unreadable.

less
The command less writes the contents of a file onto the screen a page at a time. Type
[cgrlunix@poset ~]$ less file1.fasta
Press the [space-bar] if you want to see another page, type [b] to go up a page, type [q] if you want to quit reading. As you can see, less is used in preference to cat for long files.

Now let's use less to view file "data.fastq"

[cgrlunix@poset ~]$ less data.fastq
One of the most useful features of less is that it allows you to search text and highlights matches. Visually highlighting matches can be an extremely useful way to find potential problems in data. For example, let’s use less to do a quick search of whether there are 3’ adapter contaminants in the data.fastq file. In this case, we’ll look for AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC — a known adapter from the Illumina Tru‐Seq. kit. Our goal isn’t to do an exhaustive test or remove these adapters — we just want to check if there’s any indication there could be contamination.

First, we open data.fastq in less, the press / and enter AGATCGGAAGA (or longer if you want to be more stringent).
[cgrlunix@poset ~]$ less data.fastq


head
The head command writes the first few lines of a file to the screen. the default is to display 10 lines

First clear the screen then type head file1.fasta
[cgrlunix@poset ~]$ clear
[cgrlunix@poset ~]$ head file1.fasta
>Transcript1
CAGGCTTAGCG...
>Transcript2
CAGAGCCTCCA…
…
Then type
[cgrlunix@poset ~]$ head -n 5 file1.fasta
>Transcript1
CAGGCTTAGCG...
>Transcript2
CAGAGCCTCCA…
>Transcript3
The above shows the first 5 lines in file1.fasta

tail
The tail command writes the last ten lines of a file to the screen. Its function is opposite to head. Please give it a try. In the shell, type clear, then tail file1.fasta, then tail -n 5 file1.fasta


We can also use tail to remove the header of a file. Normally the -n argument specifies how many of the last lines of a file to display, but if -n is given a number x preceded with a + sign (e.g. +x), tail will start from the xth line. So to chop off a header (the first line in our case), we start from the second line with -n +2.
[cgrlunix@poset ~]$ tail -n +2 file1.fasta
CAGGCTTAGCG...
>Transcript2
CAGAGCCTCCA…
>Transcript3



2.5 Searching the contents of a file
grep
grep is one of many standard UNIX utilities. It searches files for specified words or patterns. First clear the screen, then type
[cgrlunix@poset ~]$ grep “Transcript” file1.fasta
>Transcript1
>Transcript2
>Transcript3
…
As you can see, grep has printed out each line containing the word “Transcript”.

Now try typing
[cgrlunix@poset ~]$ grep “transcript” file1.fasta
The grep command is case sensitive; it distinguishes between “Transcript” and “transcript”.

To ignore upper/lower case distinctions, use the -i option, i.e. type
[cgrlunix@poset ~]$ grep -i “transcript” file1.fasta
>Transcript1
>Transcript2
>Transcript3
…
Other common and useful options for grep include:
-v display those lines that do NOT match
-n precede each matching line with the line number
-c print only the total count of matched lines

Try some of them and see the different results. Don't forget, you can use more than one option at a time, for example, the number of lines without the words “Transcript” or “transcript” is
[cgrlunix@poset ~]$ grep -ivc "Transcript" file1.fasta
11


wc (word count)
A handy little utility is the wc command, short for word count. By default, wc outputs the number of words, lines, and characters of the supplied file.

[cgrlunix@poset ~]$ wc file1.fasta
22   22 9193 file1.fasta

To find out how many lines of file1.fasta has, type
[cgrlunix@poset ~]$ wc -l file1.fasta
22
 


3.1 Redirecting the Output
Most processes initiated by UNIX commands take input from the standard input (that is, they read it from the keyboard) and write to the standard output (that is, they write to the terminal). There is also the standard error, where processes write their error messages, by default, to the terminal screen. Sometimes you may wish to save the output in a file rather than just print them in a terminal, and in this case we need to redirect the output.

echo (writes its arguments to standard output)

In a terminal window type
[cgrlunix@poset ~]$ echo hello!
hello!
We now direct the output hello! to a new file called test.txt, check the content of test.txt using cat
[cgrlunix@poset ~]$ echo hello! > test.txt
[cgrlunix@poset ~]$ cat test.txt
hello!
We use the > symbol to redirect the output of a command.

For example, to create a file called list1.txt containing a list of fruits, type
[cgrlunix@poset ~]$ cat > list1.txt
Then type in the names of some fruit. Press [return] after each one
pear
banana
apple
^d (Control + d to stop)

What happens is the cat command reads the standard input (what you typed via the keyboard) and the > redirects the output into a file called list1.txt

To read the contents of the file, type cat list1.txt or less list1.txt
[cgrlunix@poset ~]$ cat list1.txt
pear
banana
apple
The form >> appends standard output to a file. So to add more items to the file list1.txt, type
[cgrlunix@poset ~]$ cat >> list1.txt

Then type in the names of more fruits
peach
grape
orange
^d (Control + d to stop)

To read the contents of the file, type
[cgrlunix@poset ~]$ cat list1.txt
pear
banana
apple
peach
grape
orange
Now we can write some vegetables to a new file called list2.txt, then type in the names of some vegetables

[cgrlunix@poset ~]$ cat > list2.txt
lettuce
carrot
kale
^d (Control + d to stop)

You should now have two files. One contains six fruits, and the other contains three vegetables. We will now use the cat command to join (concatenate) list1.txt and list2.txt into a new file called biglist.txt. Type
[cgrlunix@poset ~]$ cat list1.txt list2.txt > biglist.txt
What this is doing is reading the contents of list1.txt and list2.txt in turn, then outputing the text to the file biglist.txt

To read the contents of the new file, type
[cgrlunix@poset ~]$ cat biglist.txt
pear
banana
apple
peach
grape
orange
lettuce
carrot
kale
The command sort alphabetically or numerically sorts a list. Type
[cgrlunix@poset ~]$ sort biglist.txt
apple
banana
carrot
grape
kale
lettuce
orange
peach
pear
To output the sorted list to a file called sortlist, type the following
[cgrlunix@poset ~]$ sort biglist.txt > sortlist.txt


3.2 Pipes

I (pipe) takes the output of one command before the pipe and uses it as input for the command after the pipe. Useful for chaining commands which have one output and accept one input

An example: we want to sort the file biglist alphabetically, then display the first 5 lines , then from these five lines we want to grep any lines containing letter “r” or “R” and then count the number of matched lines
[cgrlunix@poset ~]$ sort biglist.txt | head -5 | grep -ic “r”
2
 


tee writes output to stdout, and also to a file simultaneously

For example we can use ls to list all files and directories under the current working diretory and save the output in "all.txt"
[cgrlunix@poset ~]$ ls > all.txt
 
Now we add tee to save the output in "all.txt" and at the same time, view the output as stdout in shell
[cgrlunix@poset ~]$ ls | tee all.txt
By default tee overwrites the file. You can instruct tee to append to the file using the option -a as shown below
[cgrlunix@poset ~]$ ls | tee -a all.txt


4.1 Wildcards

The characters * and ?
The character * is called a wildcard, and will match against none or more character(s) in a file (or directory) name. For example, in your home directory, type
[cgrlunix@poset ~]$ ls list*
list1.txt list2.txt
This will list all files and subdirectories in the current directory starting with list (list…)

Try typing
[cgrlunix@poset ~]$ ls *list.txt
biglist.txt sortlist.txt
This will list all files and subdirectories in the current directory ending with list (…list)

Now typing
[cgrlunix@poset ~]$ ls *list*
biglist.txt list1.txt list2.txt sortlist.txt
This will list all files and subdirectories in the current directory containing list (…list…)

The character ? will match exactly one character. So ls ?ouse will match files like house and mouse, but not grouse.

Try typing
[cgrlunix@poset ~]$ ls ?list1.txt
ls: ?list1.txt: No such file or directory
Now try typing
[cgrlunix@poset ~]$ ls ?ist1.txt
list1.txt


4.2 Filename conventions

We should note here that a directory is merely a special type of file. So the rules and conventions for naming files apply also to directories.
In naming files, characters with special meanings such as / * & % , should be avoided. Also, avoid using spaces within names. The safest way to name a file is to use only alphanumeric characters, that is, letters and numbers, together with _ (underscore) and . (dot).
File names conventionally start with a lower-case letter, and may end with a dot followed by a group of letters indicating the contents of the file. For example, all files consisting of C code may be named with the ending .c, for example, prog1.c . Then in order to list all files containing C code in your home directory, you need only type ls *.c in that directory.


4.3 Getting Help

man (on-line manuals) and whatis

There are on-line manuals which gives information about most commands. The manual pages tell you which options a particular command can take, and how each option modifies the behavior of the command. Type man command to read the manual page for a particular command.
For example, to find out more about the wc (word count) command, type
[cgrlunix@poset ~]$ man wc
Alternatively, type
[cgrlunix@poset ~]$ whatis wc
whatis gives a one-line description of the command, but omits any information about options etc.



5.1 Some more advanced UNIX commands commonly used in bioinformatics
df
The df command reports on the space left on the file system. For example, to find out how much space is left on the fileserver, type
[cgrlunix@poset ~]$ df -h .
Filesystem            Size  Used Avail Use% Mounted on
/dev/sdb               15T  9.4T  5.3T  65% /global
-h means that it prints sizes in human readable format (e.g., 1K 234M 2G)


nproc
proc displays how many computation units (e.g. cores) are in this computer
[cgrlunix@poset ~]$ nproc
24

lscpu
This command lists more confiuguration information about CPU architectures in this computer
[cgrlunix@poset ~]$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                24
On-line CPU(s) list:   0-23
Thread(s) per core:    2
Core(s) per socket:    6
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 44
Stepping:              2
CPU MHz:               2666.806
BogoMIPS:              5333.59
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              12288K
NUMA node0 CPU(s):     0,2,4,6,8,10,12,14,16,18,20,22
NUMA node1 CPU(s):     1,3,5,7,9,11,13,15,17,19,21,23

free
free shows the amount of memory are in total and used. free -h translates the numbers into human-readable sizes (Gb, Mb, etc.)
[cgrlunix@poset ~]$ free -h
             total       used       free     shared    buffers     cached
Mem:           47G        23G        23G       204K       389M        21G
-/+ buffers/cache:       905M        46G
Swap:          47G        41M        47G

top
The top command displays the top 30 processes on the system and periodically updates this information
[cgrlunix@poset ~]$ top
type "q" to quit top

du
du summarizes disk usage of each file, recursively for directories. To summarize the total size of home directory (~)
[cgrlunix@poset ~]$ du -sh ~
6.6M     /home/cgrlunix
-s displays only a total size

gzip
This command compresses a file. For example, to compress file1.fasta, type
[cgrlunix@poset ~]$ gzip file1.fasta
Now list the content in home directory
[cgrlunix@poset ~]$ ls
Homo_sapiens.GRCh38.81.gtf  biglist.txt  data.fastq  file1.fasta.gz  file2.fasta  list1.txt  list2.txt test.txt  unixstuff
Now file1.fasta is zipped and is ended with .gz

To unzip the file, use the gunzip command
[cgrlunix@poset ~]$ gunzip file1.fasta.gz
Now list the content in home directory
[cgrlunix@poset ~]$ ls
Homo_sapiens.GRCh38.81.gtf  biglist.txt  data.fastq  file1.fasta  file2.fasta  list1.txt  list2.txt  test.txt  unixstuff
If you want to gzip a file and keep the original file, us -c (Write output on standard output; keep original files unchanged.)
[cgrlunix@poset ~]$ gzip -c file1.fasta > testfile.gz

tar
tar means tape archive: we use tar to combine a few files from one or more than one folder into a single file for easy storage and distribution. tar and gzip can be used together to compress directories.
For example we combine all files from unixstuff (and from its all subdirectories) and then gzip them into one file: archive.tar.gz.
[cgrlunix@poset ~]$ tar -zvcf archive.tar.gz unixstuff
 
-z: Compress archive using gzip program
-c: Create a new archive
-v: Verbose i.e display progress while creating archive
-f: use archive file name

To extract all files/directories from archive.tar.gz
[cgrlunix@poset ~]$ tar -vxf archive.tar.gz
-x: extract files

history
The shell keeps an ordered list of all the commands that you have entered. Each command is given a number according to the order it was entered.
[cgrlunix@poset ~]$ history
…lots of commands…
You can write the history into another file using >.

nano (a small and friendly text editor)
Create an empty file sequence.fasta
[cgrlunix@poset ~]$ nano sequence.fasta
and copy the following fasta sequences to this file
>seq1
AGTACGTAGTAGCTGCTGCTACGTGCGCTAGCTAGTACGTCAA
TCGTACGTCGACTGATCGTAGCTACGTCGTACGTAGGTACGTT
CGACGTAGATGCTAGCTGACTCGATGC
>seq2
CGATCGATCGTACGTCGACTGATCGTAGCTACGTCGTACGTAG
GTACGTGATCGTACGTCGACTGATCGTAGCTATCGTAGCTACC
CATCGTCAGTTACTGCATGCTCG
>seq3
GATCGTACGTGTAGCTGCTGTAGCTGATCGTACGGTAGCAGCT
CGTCGACTGATCGTAGCTGATCGTACGTCGACTGATCGTAGCT
AGTACGTAGTAGCTGTAGCTGATCGTACGTAGCTGATCGTACG
>seq4
CGATCGATCGTACGTCGACTGATCGTAGCTACGTCGTACGTAG
GTACGTGATCGTACGTCGACTGATCGTAGCTATCGTAGCTACC
GTAGATGCTAGCTGACTCGAT

Type ctrl+X, Y, and then return to quit.
Now peek inside this file by using cat or less


wget (retrieve files from web servers)
[cgrlunix@poset ~]$ wget http://cgrlucb.wikispaces.com/file/view/gene_exp.diff
 

cut (extract sections from each line of input)
Read the contents of gene_exp.diff, type
[cgrlunix@poset ~]$ less -S gene_exp.diff
-S causes lines longer than the screen width to be chopped rather than folded

Extract the third column from gene_exp.diff and display it using less
[cgrlunix@poset ~]$ cut -f 3 gene_exp.diff | less

The -f argument is how we specify which columns to keep. The argument -f also allows us to specify ranges of columns (e.g. -f 3-8) and sets of columns (e.g. -f 3,5,8). Note that it’s not possible to reorder columns using using cut, e.g. -f 6,5,4,3 will not work. To reorder columns, you’ll need to use awk which is discussed later.

Extract the third column from gene_exp.diff, exclude lines containing “-“, sort the names alphabetically and then display it using less
[cgrlunix@poset ~]$ cut -f 3 gene_exp.diff | grep –v "-" | sort | less

Next let's work on the gtf file (Homo_sapiens.GRCh38.81.gtf ) using cut. First of all let's peek into the gtf file using less -S
[cgrlunix@poset ~]$less -S Homo_sapiens.GRCh38.81.gtf

The first five lines start with "#" are headers.
#!genome-build GRCh38.p3
#!genome-version GRCh38
#!genome-date 2013-12
#!genome-build-accession NCBI:GCA_000001405.18
#!genebuild-last-updated 2015-06
We want to extract chromosome (the first column), start position (4th column) and end position (5th column) of each exon from the gtf, and then store the results in a new file called "gene.bed" - this is a typical bed format.
[cgrlunix@poset ~]$ tail -n +6 Homo_sapiens.GRCh38.81.gtf | grep "\bexon\b" | cut -f1,4,5 > gene.bed
\b matches the empty string at the edge of a word. It sets a boundary to the matches.


uniq (when fed a text file, outputs the file with consecutive identical lines collapsed to one)
Firstcreates a new file uniq_test.txt using nano and copy the following content to this file

gene1 A
gene1 A
gene6 F
gene6 F
gene6 F
gene1 A
gene2 B
gene2 B
gene4 D
gene5 E
gene6 F
gene6 F
gene6 F
gene1 A
gene1 A
gene2 B
gene2 B
gene2 B
gene3 C
gene4 D
gene4 D

First let's try uniq the list directly
[cgrlunix@poset ~]$ uniq uniq_test.txt
gene1 A
gene6 F
gene1 A
gene2 B
gene4 D
gene5 E
gene6 F
gene1 A
gene2 B
gene3 C
gene4 D

As you can see, uniq does not return the unique values — it only removes consecutive duplicate lines (keeping one). If instead we did want to find all unique lines in a file, we would first sort all lines using sort so that all identical lines are grouped next to each other, and then run uniq. For example:
[cgrlunix@poset ~]$ sort uniq_test.txt | uniq
gene1 A
gene2 B
gene3 C
gene4 D
gene5 E
gene6 F
Print only those lines which are not repeated (unique) in the input.
[cgrlunix@poset ~]$ sort uniq_test.txt | uniq -u
gene3 C
gene5 E
Print only those lines which are repeated in the input.
[cgrlunix@poset ~]$ sort uniq_test.txt | uniq -d
gene1 A
gene2 B
gene4 D
gene6 F
You can also get some statistics out of uniq with the -c option
[cgrlunix@poset ~]$ sort uniq_test.txt | uniq -c
5 gene1 A
5 gene2 B
1 gene3 C
3 gene4 D
1 gene5 E
6 gene6 F
 

Both sort | uniq and sort | uniq -c are frequently used shell idioms in bioinformatics and worth memorizing. Combined with other Unix tools like grep and cut, sort and uniq can be used to summarize columns of tabular data. Now we want to count the number of record for each feature present in the gtf file:
[cgrlunix@poset ~]$ tail -n +6 Homo_sapiens.GRCh38.81.gtf | cut -f3 | sort | uniq -c
2931 CDS
1083 UTR
5214 exon
250 gene
328 start_codon
296 stop_codon
893 transcript
note: the 3rd column contains the features of the gtf

Since sort and uniq are line-based, we can create lines from multiple columns to count combinations. In this example we would like to count how many each feature (column 3 in this example GTF) are on each strand (column 7):
[cgrlunix@poset ~]$ tail -n +6 Homo_sapiens.GRCh38.81.gtf | cut -f3,7 | sort | uniq -c
1303 CDS +
1628 CDS -
407 UTR +
676 UTR -
2158 exon +
3056 exon -
126 gene +
124 gene -
145 start_codon +
183 start_codon -
113 stop_codon +
183 stop_codon -
389 transcript +
504 transcript -
Or, if you want to see the number of features belonging to a particular gene identifier:
[cgrlunix@poset ~]$ grep "ENSG00000223972"  Homo_sapiens.GRCh38.81.gtf | cut -f3 | sort | uniq -c
9 exon
1 gene
2 transcript

{} (Brace expansion is a mechanism by which arbitrary strings may be generated)
{A,B,C} expands all the elements within the braces
{1..9} expands a range of elements within the braces
[cgrlunix@poset ~]$ echo hello {A,B,C}
hello A B C
[cgrlunix@poset ~]$ echo hello{A,B,C}
helloA helloB helloC
Create several folders with names starting with "folder" under the home directory
[cgrlunix@poset ~]$ mkdir folder_{A,B,C}
[cgrlunix@poset ~]$ ls folder*
folder_A:
folder_B:
folder_C:
Create new several empty .txt files with names beginning with "text" using touch under the home directory
[cgrlunix@poset ~]$ touch text{1..5}.txt
[cgrlunix@poset ~]$ ls text*
text1.txt text2.txt text3.txt text4.txt text5.txt
 

find
find is useful for finding file or directory names that match certain criteria.
Search the home directory and print any files or directories beginning with list
[cgrlunix@poset ~]$ find ~ -name "list*"
/home/cgrlunix/list1.txt
/home/cgrlunix/list2.txt
Search the home directory to print all subdirectories
[cgrlunix@poset ~]$ find ~ -type d
/home/cgrlunix
/home/cgrlunix/.cache
/home/cgrlunix/unixstuff
/home/cgrlunix/unixstuff/backups
/home/cgrlunix/.config
/home/cgrlunix/.config/htop
/home/cgrlunix/folder_A
/home/cgrlunix/folder_C
/home/cgrlunix/folder_B


More about grep
grep is one of the most powerful Unix data tools and very commonly used in bioinformatic data processing
[workshop@poset ~]$ grep -n "T$" sequence.fasta
3:TCGTACGTCGACTGATCGTAGCTACGTCGTACGTAGGTACGTT
10:GATCGTACGTGTAGCTGCTGTAGCTGATCGTACGGTAGCAGCT
11:CGTCGACTGATCGTAGCTGATCGTACGTCGACTGATCGTAGCT
16:GTAGATGCTAGCTGACTCGAT
The above command grep each line ending with a "T" from sequence.fasta and print out the line number of that line. $ matches the end of a line
[workshop@poset ~]$ grep -n "^C" sequence.fasta
4:CGACGTAGATGCTAGCTGACTCGATGC
6:CGATCGATCGTACGTCGACTGATCGTAGCTACGTCGTACGTAG
8:CATCGTCAGTTACTGCATGCTCG
11:CGTCGACTGATCGTAGCTGATCGTACGTCGACTGATCGTAGCT
14:CGATCGATCGTACGTCGACTGATCGTAGCTACGTCGTACGTAG
The above command grep each line starting with a "C" and print out the line number of that line. ^ matches the beginning of a line
[workshop@poset ~]$ grep -n "^C.*T$" sequence.fasta
11:CGTCGACTGATCGTAGCTGATCGTACGTCGACTGATCGTAGCT
The above command grep each line starting with "C", and ending with "T", with any letters in between them, and then print out the line number of that line. . means any single character (except newline), * means that the preceding item will be matched zero or more times.

Now let's go back to work on file1.fasta. It is a fasta file that contains 11 transcript sequences. We want to print the sequence of "Transcript10".
[workshop@poset ~]$ grep -A 1 ">Transcript10" file1.fasta
-A N[int] means that it will print N lines of trailing context after matching lines.
[workshop@poset ~]$ grep -B 1 ">Transcript10" file1.fasta
-B N[int] means that it will print N lines of leading context before matching lines.

The above two command lines are useful for displaying sequences of a fasta file.
[workshop@poset ~]$ grep -A 1 ">Transcript1" file1.fasta
What happened? It matches Transcript1, Transcript10 and Transcript11 since all there contain “Transcript1”, the pattern we were searching. We just want the sequence after Transcript1 to be printed. We need to add more conditions for the search. Several ways to do this:
[workshop@poset ~]$ grep -A 1 ">Transcript1$" file1.fasta
[workshop@poset ~]$ grep -A 1 ">Transcript1\b" file1.fasta
Now let's work on the gtf file.
Display content without header lines:
[cgrlunix@poset ~]$ grep -v "^#" Homo_sapiens.GRCh38.81.gtf | less -S
Next let's use grep to extract record of a transcript called "PLCH2-001"
[cgrlunix@poset ~]$ grep "\btranscript\b" Homo_sapiens.GRCh38.81.gtf | grep  "PLCH2-001" |  less -S
Next let's use grep to extract record of transcripts "PLCH2-001" and "PLCH2-002"
[cgrlunix@poset ~]$ grep "\btranscript\b" Homo_sapiens.GRCh38.81.gtf | grep -E  "PLCH2-001|PLCH2-002" |  less -S
-E means Extended Regular Expressions (ERE). EREs allow us to alternation (regular expression jargon for matching one of several possible patterns) to match either “PLCH2-001” or “PLCH2-002”. The syntax uses a pipe symbol, |:
In basic regular expressions the meta-characters ?, +, {, |, (, and ) lose their special meaning; instead use the backslashed versions \?, \+, \{, \|, \(, and \). The above command can be expressed using the basic grep regular expression:
[cgrlunix@poset ~]$ grep "\btranscript\b" Homo_sapiens.GRCh38.81.gtf | grep  "PLCH2-001\|PLCH2-002" |  less -S

By default, grep outputs the entire matching line if the line contains the pattern we try to search. To only output the matched part of the pattern, we use -o for grep. Now we want to extract all values from the “gene_id” field from the last column of the gtf file. This can be done using grep -o:
[cgrlunix@poset ~]$ grep -E -o "ENSG\w+" Homo_sapiens.GRCh38.81.gtf | less -S
\w means a word character (a-z, A-Z, 0-9...). + means the preceding item is matched one or more times. \w+ means one or more word characters

Here, we’re using extended regular expressions (ERE) to capture all gene names in the field. However, as you can see there’s a great deal of redundancy: our GTF file has multiple features (transcripts, exons, start codons, etc.) that all have the same gene name. We could quickly convert this messy output from grep to a list of unique, sorted gene names:
[cgrlunix@poset ~]$ grep -E -o "ENSG\w+" Homo_sapiens.GRCh38.81.gtf | sort | uniq > unique_gene.txt


sed (a stream editor)
A stream editor is used to perform basic text transformations on an input stream (a file or input from a pipeline). It reads data from a file or standard input and can edit a line at a time.

Here we use sed to find and replace complex string in a file. For example we want to find "Transcript10" from file file1.fasta and then replace "Transcript10" with "gene10"
[cgrlunix@poset ~]$ sed 's/Transcript10/gene10/g' file1.fasta
....
>gene10
GTCCTGGCCTGCTCTTGCTGCTCCTCAGTGGCCTCCAAGCTGTTATTCAAGAAAAAG
....
In the above example, s/A/B/g means that it will replace A with B in global search.

Next, we want to find the sequence of Transcript1 and then transform all "A"s to "a"s
[cgrlunix@poset ~]$ grep -A 1 ">Transcript1$" file1.fasta | sed 's/A/a/g'
>Transcript1
CaGGCTTaGCGGTTTaTTGaTGTCTCCCaaGCCaGaGGaaGTCaCCaaaGGGaaGaGTCaCTTGCaGaaaaTGGTTTCCaTTTCaGaCaTTaCaTC
CCTGTaCaGCaaCCTGCTCCTGTaCaCGTGTGGGTCCTCTGCCGaaGCCaCCaGGGCTGTCaTGaGaCaCCTTGCCGCTGTGTaTCaaCaTGGCaG
CCTGCTaGGGCTTTCTGTTGCCaaGaGGCCTCTCTGGaGaCaGGCaTCTaTGCaaaGTGGGaaGGaCaCCaCTGaGCaaGaaaTTCTGaaaGCTaT
CaaCaTCaaTTCCTTTGCaGaGTGTGGCaTCaaTTTaTTCCaTGaGaGTGTaTCTaaaTCaGCCCTGaGCCaaGaaTTCGaaGCTTTCTTTCGT
 

Substitutions make up the majority of sed’s usage cases, but this is just scratching the surface of sed’s capabilities. It’s also possible to select and print certain ranges of lines with sed. In this case, we’re not doing pattern matching, so we don’t need slashes. To print the lines 5 through 8 of the fastq file, we use:
[cgrlunix@poset ~]$ sed -n '5,8p' data.fastq
@HS2:308:C14K4ACXX:6:1101:1540:2065 1:Y:0:CGGTGTA
NTATTGATCTGAAGGAGGTCTGGTACAAGACAGTGGACATAATCAATCTGGGTGTGGTACTTAGGAGGGGGTTCCAACTGTAGGATATCATCATCCATTT
+
#1:=?<4?BBB,,2?A;?7?<?A?+++2<2:77@B**1?*:*?B<>**0>A>/70<AA4>).8=>7)''''9:(,(59>?5(:;?????:::(8;?;;?=

sed can also be used to remove lines from a file
remove the 2nd line from a file
[cgrlunix@poset ~]$ sed '2d' Homo_sapiens.GRCh38.81.gtf | less -S

remove lines 1-5 from a file
[cgrlunix@poset ~]$ sed '1,5d' Homo_sapiens.GRCh38.81.gtf | less -S
To figure out which lines to remove you can use grep -n <pattern>

remove the last line
[cgrlunix@poset ~]$ sed '$d' Homo_sapiens.GRCh38.81.gtf | less -S
remove the first and last lines
[cgrlunix@poset ~]$ sed '1d;$d' Homo_sapiens.GRCh38.81.gtf | less -S

tr (translate or transliterate)
tr reads from stdin or file (one line at a time) and replaces or removes or compresses specific characters

1. replacement:
For example, we can use tr to transform lowercase to uppercase
[cgrlunix@poset ~]$ echo "hello world" | tr 'h' 'H'
use tr to transform all lowercase to uppercase
[cgrlunix@poset ~]$ echo "hello world" | tr 'a-z' 'A-Z'
use tr for searching and replacing specific characters
[cgrlunix@poset ~]$ echo "hello world" | tr 'eo' '12'
2. delete:
we can use tr -d to search for characters and delete them
[cgrlunix@poset ~]$ echo "hello world" | tr -d 'o'
3. compress:
tr -s can be used to compress consecutive identical characters to one
[cgrlunix@poset ~]$ echo "hello world" | tr -s 'l'
[cgrlunix@poset ~]$ echo "hello         world" | tr -s ' '
[cgrlunix@poset ~]$ echo "hello         world" | tr -s ' ' ','

awk:
Awk is a specialized language that allows you to do a variety of text processing tasks with ease. It mainly help us extract data from and manipulate tabular plaintext files. Throughout the workshop, we’ve seen how we can use simple Unix tools like grep, cut, and sort to inspect and manipulate plaintext tabular data in the shell. For many trivial bioinformatics tasks, these tools allow us to get the job done quickly and easily (and often very efficiently). Still, some tasks are slightly more complex and require a more expressive and powerful tool. This is where the language and tool Awk excels.
We’ll introduce the basics of Awk in this section — enough to get you started with using awk in bioinformatics. While Awk is a fully-fledged programming language, it’s a lot less expressive and powerful than Python and Perl. If you need to implement something complex, it’s likely better (and easier) to do so in Python/Perl. The key to using Awk effectively is to use it for the subset of tasks it’s best at: quick data processing tasks on tabular data.

To learn Awk we’ll cover two key parts of the Awk language: how Awk processes records, and pattern-action pairs. After understanding these two key parts the rest of the language is quite simple.

First, Awk processes input data a record at a time. Each record is composed of fields, separate chunks that Awk automatically separates. Since Awk was designed to work with tabular data, each record is a line, and each field is a column’s entry for that record. The clever part about Awk is that it automatically assigns the entire record to the variable $0, and field one’s value is assigned to $1, field two’s value is assigned to $2, field three’s value is assigned to $3, and so forth.
Second, we build Awk programs using one or more of the following structures: pattern { action }. Each pattern is an expression or regular expression pattern. Patterns are a lot like if statements in other languages: if the pattern’s expression evaluates to true or the regular expression matches, the statements inside action are run. In Awk lingo, these are pattern-action pairs and we can chain multiple pattern-action pairs together (separated by semicolons).

If we omit the pattern, Awk will run the action on all records. If we omit the action but specify a pattern, Awk will print all records that match the pattern. This simple structure makes Awk an excellent choice for quick text processing tasks. This is a lot to take in, but these two basic concepts --records and fields, and pattern-action pairs-- are the foundation of writing text processing programs with Awk. Let’s see some examples.

- Action only
First, we can simply mimic cat by omitting a pattern and printing an entire record with the variable $0:
[cgrlunix@poset ~]$ awk '{print $0}' gene.bed

Awk can also mimic cut:
[cgrlunix@poset ~]$ awk '{print $2"\t"$3}' gene.bed
Here, we’re making use of Awk’s string concatenation. Two strings are concatenated if they are placed next to each other with no argument. So for each record, $2"\t"$3 concatenates the second field, a tab character, and the third field.

- Pattern only
Let’s now look at how we can incorporate simple pattern matching. Suppose we wanted to write a filter that only output lines where the length of the feature (end position - start position) is greater than 1,000. Awk supports arithmetic with the standard operators +, -, *, /, % (remainder), and ^ (exponentiation). We can subtract within a pattern to calculate the length of a feature, and filter on that expression:
[cgrlunix@poset ~]$ awk '$3 - $2 + 1 > 1000' gene.bed

We can also chain patterns, by using logical operators && (and), || (or), and ! (not). For example, if we want to extract all records derived from chromosome 1 if the length is greater than 1000:
[cgrlunix@poset ~]$ awk '$1 ~ /^1$/ && $3 - $2 + 1 > 1000' gene.bed
The first pattern, $1 ~ /^1$/, is how we specify a regular expression. Regular expressions are in slashes. Here, we’re matching the first field, $1, against the regular expression 1. The tilde, ~ means match; to not much the regular expression we would use !~ (or !($1 ~ /^1$/)).

- Pattern and action
We can combine patterns and more complex actions than just printing the entire record. For example, if we want to add a column with the length of this feature (end position - start position) for only mitochondrial, we could use:
[cgrlunix@poset ~]$  awk '$1 ~ /MT/ { print $0 "\t" $3 - $2 + 1}' gene.bed

So far, these exercises illustrate how Awk is useful in (1) filtering data using rules that can combine regular expressions and arithmetic, and (2) reformatting the columns of data using arithmetic. These two applications alone make Awk an extremely handy tool in bioinformatics, and a huge time saver.
Now let’s look at some slightly more advanced use cases. We’ll start by introducing two special patterns: BEGIN and END.

The BEGIN pattern, specifies what to do before the first record is read in, and END specifies what to do after the last record’s processing is complete. BEGIN is useful to initialize and set up variables, and END is useful to print data summaries at the end of file processing. For example, we want to calculate the mean length of all the record in gene.bed. We would have to take the sum of all lengths, and then divide the sum by the total number of records. We can do this with:

[cgrlunix@poset ~]$ awk 'BEGIN { s = 0 }; { s += ($3 - $2 + 1) }; END { print "mean: " s/NR }' gene.bed
mean: 224.604
NR is the current record number, so on the last record NR is set to the total number of records processed. In the example above, we’ve initialized a variable s to 0 in BEGIN (variables you define do not need a dollar sign). Then, for each record we increment s by the length of the feature. At the end of the records, we print this sum s divided by the number of records NR, giving the mean.

While Awk is designed to work with whitespace-separated tabular data, it’s easy to set a different field separator: simply specify which separator to use with the -F argument. For example, we could work with a CSV file in Awk by starting with awk -F",".

We can use NR to extract ranges of lines too; for example, if we want to extract all lines between 3 and 5 (inclusive):
[cgrlunix@poset ~]$ awk 'NR >= 3 && NR <= 5' gene.bed
1 11869 12227
1 12613 12721
1 13221 14409

Awk makes it easy to convert from GTF to BED:
[cgrlunix@poset ~]$ awk '$1 ~! /^#/ {print $1 "\t" $4 "\t" $5 }' Homo_sapiens.GRCh38.81.gtf | sort | uniq | head -n 5
1 11868 12227
1 11868 14409
1 12009 12057
1 12009 13670
1 12178 12227