2. Some of the workshop lectures are directory copied from Chapter 7-Unix Data Tools from Bioinformatics Data Skills authored by Vince Buffalo (especially in sessions about advanced grep, sed and awk). Bioinformatics Data Skillsintroduces various fundamental but extremely useful tools that are commonly used in modern bioinformatics data processing. It is worth a read!
Bioinformatics Data Skills Reproducible and Robust Research with Open Source Tools By Vince Buffalo Publisher: O'Reilly Media Final Release Date: July 2015 Pages: 538
3. The material below is intended for the in-class instruction. Others trying to learn this material offline can follow along on any Unix (like) machine, however, please be aware the file names and path names will be different on your own machine.
Requirements
The module requires that you use a recent MacOS or Linux based system. Future versions of the module may accommodate Microsoft Windows users.
Using the UNIX shell
Your time in the UNIX environment will be spent in theshell, which is UNIX command line interpreter. Shell runs in the Terminal window and if you're using Mac OS (you'll find the Terminal under Applications -> Utilities -> Terminal). The shell allows you to move and copy files, run programs, and more. On Windows, you could install Cygwin and get a UNIX environment on your Windows machine.
How do I open my terminal window?
In Mac OS, you can open a new terminal window by opening the Finder, selecting the Applications folder, then the Utilities sub-folder, and then double-clicking on the Terminal application. You may want to drag the Terminal icon to your Application Dock for ready access going forward.
Alternatively, you can just use Spotlight to search for "Terminal" each time, and open the application after it is listed in the Spotlight results list.
In Linux environments, your terminal program will depend on the version of Linux you have installed. If you are using Linux and you don't know how to open a terminal, raise your hand and let us know now.
Accessing your temporary CGRL account
The next step is to login to the CGRL server (poset.cgrl.berkeley.edu) with your user account and password. You can do this with the ssh command. After login, you should see a shell prompt that contains your user name and the name of the machine followed by a dollar sign.
1.1 Listing files and directories ls (list) When you first login, your current working directory is your home directory. Your home directory has the same name as your user-name, for example, cgrlunix, and it is where your personal files and subdirectories are saved. To find out what is in your home directory, type
[cgrlunix@poset ~]$ ls
Homo_sapiens.GRCh38.81.gtf data.fastq file1.fasta
The ls command lists the contents of your current working directory. ls does not, in fact, cause all the files in your home directory to be listed, but only those ones whose name does not begin with a dot (.). Files beginning with a dot (.) are known as hidden files and usually contain important program configuration information. They are hidden because you should not change them unless you are very familiar with Unix!!!
To list all files in your home directory including those whose names begin with a dot, type
[cgrlunix@poset ~]$ ls -a
. .. .bash_logout .bashrc .cache data.fastq file1.fasta Homo_sapiens.GRCh38.81.gtf .profile
ls is an example of a command which can take options: -a is an example of an option which means do not ignore entries starting with dot (.).
Another bit of information we usually want about a file is its size.
[cgrlunix@poset ~]$ ls -l
total 6568
-rw-r--r-- 1 cgrlunix cgrlunix 636433 Jan 7 12:43 data.fastq
-rw-r--r-- 1 cgrlunix cgrlunix 9193 Jan 7 12:43 file1.fasta
-rw-r--r-- 1 cgrlunix cgrlunix 6073448 Jan 7 12:43 Homo_sapiens.GRCh38.81.gtf
A brief explanation of the information displayed in "ls -l"
Column 1: File permissions and file types 1 2 3 4 - rw- r-- r-- 1. leftmost field is file type: "-" means a file and "d"means a directory. 2. The owner (cgrlunix) has read (r) and write (w) permission to this file 3. The group has read permission to this file 4. everyone else has read permission to this file Note: if the item is an executable then the last field of each section is marked as "x" instead of a "-". For instance: rwx
Column 2: Links field - Number of short cuts to this file from other directories.
Column 3: Name of owner
Column 4: Name of group
Column 5: Size
Column 6: Date/time when the item was created or last modified
Column 7: Name of the item
The fourth column (the one before the creation data) ls -l reports file sizes in bytes. If we wish to use human readable sies, we can use ls -lh:
[cgrlunix@poset ~]$ ls -lh
total 6.5M
-rw-r--r-- 1 cgrlunix cgrlunix 622K Jan 7 12:43 data.fastq
-rw-r--r-- 1 cgrlunix cgrlunix 9.0K Jan 7 12:43 file1.fasta
-rw-r--r-- 1 cgrlunix cgrlunix 5.8M Jan 7 12:43 Homo_sapiens.GRCh38.81.gtf
If you want to sort the files by size, we can use ls -lhS:
[cgrlunix@poset ~]$ ls -lhS
total 6.5M
-rw-r--r-- 1 cgrlunix cgrlunix 5.8M Jan 7 12:43 Homo_sapiens.GRCh38.81.gtf
-rw-r--r-- 1 cgrlunix cgrlunix 622K Jan 7 12:43 data.fastq
-rw-r--r-- 1 cgrlunix cgrlunix 9.0K Jan 7 12:43 file1.fasta
If you want to sort the files by size in a reverse order, we can use ls -lhSr:
[cgrlunix@poset ~]$ ls -lhSr
total 6.5M
-rw-r--r-- 1 cgrlunix cgrlunix 9.0K Jan 7 12:43 file1.fasta
-rw-r--r-- 1 cgrlunix cgrlunix 622K Jan 7 12:43 data.fastq
-rw-r--r-- 1 cgrlunix cgrlunix 5.8M Jan 7 12:43 Homo_sapiens.GRCh38.81.gtf
These options change the behavior of the command. There are online manual pages that tell you which options a particular command can take, and how each option modifies the behavior of the command. (See later in this tutorial)
1.2 Making Directories mkdir (make directory) To make a subdirectory called unixstuff in your current working directory type
[cgrlunix@poset ~]$ mkdir unixstuff
To see the directory you have just created, type
[cgrlunix@poset ~]$ ls
Homo_sapiens.GRCh38.81.gtf data.fastq file1.fasta unixstuff
1.3 Changing to a different directory cd (change directory) The command cd directory means change your working directory from the current one to a different directory. The current working directory may be thought of as the directory you are in, i.e. your current position in the file-system tree.
To change to the directory you have just made, type
[cgrlunix@poset ~]$ cd unixstuff
Type ls to see the contents (which should be empty)
[cgrlunix@poset unixstuff]$ls
Now create a new directory backups under unixstuff
[cgrlunix@poset unixstuff]$ mkdir backups
1.4 The directories . and .. Still in the unixstuff directory, type
[cgrlunix@poset unixstuff]$ls -a
. .. backups
As you can see, in the unixstuff directory (and in all other directories), there are two special directories called (.) and (..) In UNIX, (.) means the current directory, so type
[cgrlunix@poset unixstuff]$ cd .
means stay where you are (the unixstuff directory). This may not seem very useful at first, but using (.) as the name of the current directory will save a lot of typing, as we shall see later in the tutorial.
(..) means the parent of the current directory, so typing
[cgrlunix@poset unixstuff]$ cd ..
will take you one directory up the hierarchy (back to your home directory).
Note: typing cd with no argument always returns you to your home directory. This is very useful if you are lost in the file system.
1.5 Pathnames pwd (print working directory) Pathnames enable you to work out where you are in relation to the whole file-system. For example, to find out the absolute pathname of your working directory:
[unixcgrl@poset ~]$ pwd
/home/cgrlunix
1.6 More about pathnames Understanding pathnames
Then type ls to list the contents of your unixstuff directory
[cgrlunix@poset ~]$ ls unixstuff
backups
Now type
[cgrlunix@poset ~]$ ls backups
ls: backups: No such file or directory
The reason is, backups is not in your current working directory. To use a command on a file (or directory) not in the current working directory (the directory you are currently in), you must either cd to the correct directory, or specify its pathname. To list the contents of your backups directory, you must type
[cgrlunix@poset ~]$ ls unixstuff/backups
2.1 Copying files cp (copy) cp file1 file2 is the command which makes a copy of file1 in the current working directory and calls it file2
First, cd to your home directory and type ls.
[cgrlunix@poset ~]$ cd
[cgrlunix@poset ~]$ ls
Homo_sapiens.GRCh38.81.gtf data.fastq file1.fasta unixstuff
now make a copy of file1.fasta and call the copy file2.fasta
[cgrlunix@poset ~]$ cp file1.fasta file2.fasta
now list all the contents in your home directory
[cgrlunix@poset ~]$ ls
Homo_sapiens.GRCh38.81.gtf data.fastq file1.fasta file2.fasta unixstuff
Now try something more fancier: make a copy of file1.fasta to the directory unixstuff and name this copy file3.fasta
Now make a copy of file1.fasta to the current directory without renaming it
[cgrlunix@poset unixstuff]$ cp ../file1.fasta .
Don't forget the dot (.) at the end. Remember, in unix, the dot means the current directory. NOTE: Copy directories follows the same rules except we need to add an option -r to make cp work with directories.
2.2 Moving files mv (move) mv file1 file2 moves (or renames) file1 to file2 To move a file from one place to another, use the mv command. This has the effect of moving rather than copying the file, so you end up with only one file rather than two. It can also be used to rename a file, by moving the file to the same directory, but giving it a different name. It can also rename directories or move directories to other directories.
In this lecture we will learn how to move files. Moving directories follow the same rules.
We are now going to move file1.fasta to your backup directory. This can be done in many different ways. Here I just list one example: cd to your home directory
[unixcgrl@poset unixstuff]$ cd
display all the content in home directory
[cgrlunix@poset ~]$ ls
Homo_sapiens.GRCh38.81.gtf data.fastq file1.fasta file2.fasta unixstuff
You can see that the original copy of file2.fasta is gone. test.fasta is overwritten by file2.fasta.
Let's rename test.fasta to be file2.fasta
[cgrlunix@poset ~]$ mv test.fasta file2.fasta
2.3 Removing files and directories rm (remove), rmdir (remove empty directory), rm -r (remove directories and their contents recursively)
To delete (remove) a file, use the rm command. Now cd to your unixstuff directory and list all the content, type
[cgrlunix@poset ~]$cd unixstuff
[cgrlunix@poset unixstuff]$ ls
backups file1.fasta file3.fasta file4.fasta
No remove (permanently!) file1.fasta, file3.fasta and file4.fasta from unixstuff, and then list the content
[cgrlunix@poset ~]$ rm file1.fasta file3.fasta file4.fasta
[cgrlunix@poset unixstuff]$ ls
backups
Now make a new directory test under unixstuff and use rmdir to remove it. Make sure to check the content of unixstuff before and after running rmdir. You can also use rm -r to remove a directory and its contents recursively
[cgrlunix@poset unixstuff]$ mkdir test
[cgrlunix@poset unixstuff]$ ls
backups test
[cgrlunix@poset unixstuff]$ rmdir test
[cgrlunix@poset unixstuff]$ ls
backups
[cgrlunix@poset unixstuff]$ mkdir test
[cgrlunix@poset unixstuff]$ cp ../file2.fasta test/
[cgrlunix@poset unixstuff]$ rmdir test/
rmdir: failed to remove ‘test/’: Directory not empty
[cgrlunix@poset unixstuff]$ rm -r test/
2.4 Displaying the contents of a file on the screen clear (clear screen) Before you start the next section, you may like to clear the terminal window of the previous commands so the output of the following commands can be clearly demonstrated.
Now cd to your home directory
[cgrlunix@poset unixstuff]$ cd
Type clear in the shell
[cgrlunix@poset ~]$ clear
This will clear all text and leave you with the $ prompt at the top of the window.
Now list the content by using ls
[cgrlunix@poset ~]$ ls
Homo_sapiens.GRCh38.81.gtf data.fastq file2.fasta unixstuff
file2.fasta is a copy of file1.fasta, but we have moved file1.fasta to the directory backups. Now we can move file1.fasta back. Then you can type ls to check the content
As you can see, the file is longer than the size of the window, so it scrolls past making it unreadable.
less The command less writes the contents of a file onto the screen a page at a time. Type
[cgrlunix@poset ~]$ less file1.fasta
Press the [space-bar] if you want to see another page, type [b] to go up a page, type [q] if you want to quit reading. As you can see, less is used in preference to cat for long files.
Now let's use less to view file "data.fastq"
[cgrlunix@poset ~]$ less data.fastq
One of the most useful features of less is that it allows you to search text and highlights matches. Visually highlighting matches can be an extremely useful way to find potential problems in data. For example, let’s use less to do a quick search of whether there are 3’ adapter contaminants in the data.fastq file. In this case, we’ll look for AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC — a known adapter from the Illumina Tru‐Seq. kit. Our goal isn’t to do an exhaustive test or remove these adapters — we just want to check if there’s any indication there could be contamination.
First, we open data.fastq in less, the press / and enter AGATCGGAAGA (or longer if you want to be more stringent).
[cgrlunix@poset ~]$ less data.fastq
head The head command writes the first few lines of a file to the screen. the default is to display 10 lines
tail The tail command writes the last ten lines of a file to the screen. Its function is opposite to head. Please give it a try. In the shell, type clear, then tail file1.fasta, then tail -n 5 file1.fasta
We can also use tail to remove the header of a file. Normally the -n argument specifies how many of the last lines of a file to display, but if -n is given a number x preceded with a + sign (e.g. +x), tail will start from the xth line. So to chop off a header (the first line in our case), we start from the second line with -n +2.
2.5 Searching the contents of a file grep grep is one of many standard UNIX utilities. It searches files for specified words or patterns. First clear the screen, then type
Other common and useful options for grep include: -v display those lines that do NOT match -n precede each matching line with the line number -c print only the total count of matched lines
Try some of them and see the different results. Don't forget, you can use more than one option at a time, for example, the number of lines without the words “Transcript” or “transcript” is
wc (word count) A handy little utility is the wc command, short for word count. By default, wc outputs the number of words, lines, and characters of the supplied file.
To find out how many lines of file1.fasta has, type
[cgrlunix@poset ~]$ wc -l file1.fasta
22
3.1 Redirecting the Output Most processes initiated by UNIX commands take input from the standard input (that is, they read it from the keyboard) and write to the standard output (that is, they write to the terminal). There is also the standard error, where processes write their error messages, by default, to the terminal screen. Sometimes you may wish to save the output in a file rather than just print them in a terminal, and in this case we need to redirect the output.
echo (writes its arguments to standard output)
In a terminal window type
[cgrlunix@poset ~]$ echo hello!
hello!
We now direct the output hello! to a new file called test.txt, check the content of test.txt using cat
We use the > symbol to redirect the output of a command.
For example, to create a file called list1.txt containing a list of fruits, type
[cgrlunix@poset ~]$ cat > list1.txt
Then type in the names of some fruit. Press [return]after each one
pear
banana
apple
^d (Control + d to stop)
What happens is the cat command reads the standard input (what you typed via the keyboard) and the > redirects the output into a file called list1.txt
To read the contents of the file, type cat list1.txt or less list1.txt
[cgrlunix@poset ~]$ cat list1.txt
pear
banana
apple
The form >> appends standard output to a file. So to add more items to the file list1.txt, type
[cgrlunix@poset ~]$ cat >> list1.txt
Then type in the names of more fruits
peach
grape
orange
^d (Control + d to stop)
To read the contents of the file, type
[cgrlunix@poset ~]$ cat list1.txt
pear
banana
apple
peach
grape
orange
Now we can write some vegetables to a new file called list2.txt, then type in the names of some vegetables
[cgrlunix@poset ~]$ cat > list2.txt
lettuce
carrot
kale
^d (Control + d to stop)
You should now have two files. One contains six fruits, and the other contains three vegetables. We will now use the cat command to join (concatenate) list1.txt and list2.txt into a new file called biglist.txt. Type
I (pipe) takes the output of one command before the pipe and uses it as input for the command after the pipe. Useful for chaining commands which have one output and accept one input
An example: we want to sort the file biglist alphabetically, then display the first 5 lines , then from these five lines we want to grep any lines containing letter “r” or “R” and then count the number of matched lines
tee writes output to stdout, and also to a file simultaneously
For example we can use ls to list all files and directories under the current working diretory and save the output in "all.txt"
[cgrlunix@poset ~]$ ls > all.txt
Now we add tee to save the output in "all.txt" and at the same time, view the output as stdout in shell
[cgrlunix@poset ~]$ ls | tee all.txt
By default tee overwrites the file. You can instruct tee to append to the file using the option -a as shown below
[cgrlunix@poset ~]$ ls | tee -a all.txt
4.1 Wildcards
The characters * and ? The character * is called a wildcard, and will match against none or more character(s) in a file (or directory) name. For example, in your home directory, type
[cgrlunix@poset ~]$ ls list*
list1.txt list2.txt
This will list all files and subdirectories in the current directory starting with list (list…)
Try typing
[cgrlunix@poset ~]$ ls *list.txt
biglist.txt sortlist.txt
This will list all files and subdirectories in the current directory ending with list (…list)
Now typing
[cgrlunix@poset ~]$ ls *list*
biglist.txt list1.txt list2.txt sortlist.txt
This will list all files and subdirectories in the current directory containing list (…list…)
The character ? will match exactly one character. So ls ?ouse will match files like house and mouse, but not grouse.
Try typing
[cgrlunix@poset ~]$ ls ?list1.txt
ls: ?list1.txt: No such file or directory
Now try typing
[cgrlunix@poset ~]$ ls ?ist1.txt
list1.txt
4.2 Filename conventions
We should note here that a directory is merely a special type of file. So the rules and conventions for naming files apply also to directories. In naming files, characters with special meanings such as / * & % , should be avoided. Also, avoid using spaces within names. The safest way to name a file is to use only alphanumeric characters, that is, letters and numbers, together with _ (underscore) and . (dot). File names conventionally start with a lower-case letter, and may end with a dot followed by a group of letters indicating the contents of the file. For example, all files consisting of C code may be named with the ending .c, for example, prog1.c . Then in order to list all files containing C code in your home directory, you need only type ls *.c in that directory.
4.3 Getting Help
man (on-line manuals) and whatis
There are on-line manuals which gives information about most commands. The manual pages tell you which options a particular command can take, and how each option modifies the behavior of the command. Type man command to read the manual page for a particular command. For example, to find out more about the wc (word count) command, type
[cgrlunix@poset ~]$ man wc
Alternatively, type
[cgrlunix@poset ~]$ whatis wc
whatis gives a one-line description of the command, but omits any information about options etc.
5.1 Some more advanced UNIX commands commonly used in bioinformatics df
The df command reports on the space left on the file system. For example, to find out how much space is left on the fileserver, type
[cgrlunix@poset ~]$ df -h .
Filesystem Size Used Avail Use% Mounted on
/dev/sdb 15T 9.4T 5.3T 65% /global
-h means that it prints sizes in human readable format (e.g., 1K 234M 2G)
nproc
proc displays how many computation units (e.g. cores) are in this computer
[cgrlunix@poset ~]$ nproc
24
lscpu
This command lists more confiuguration information about CPU architectures in this computer
[cgrlunix@poset ~]$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 24
On-line CPU(s) list: 0-23
Thread(s) per core: 2
Core(s) per socket: 6
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 44
Stepping: 2
CPU MHz: 2666.806
BogoMIPS: 5333.59
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 12288K
NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22
NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23
free free shows the amount of memory are in total and used. free -h translates the numbers into human-readable sizes (Gb, Mb, etc.)
tar tar means tape archive: we use tar to combine a few files from one or more than one folder into a single file for easy storage and distribution. tar and gzip can be used together to compress directories.
For example we combine all files from unixstuff (and from its all subdirectories) and then gzip them into one file: archive.tar.gz.
[cgrlunix@poset ~]$ tar -zvcf archive.tar.gz unixstuff
-z: Compress archive using gzip program
-c: Create a new archive
-v: Verbose i.e display progress while creating archive
-f: use archive file name
To extract all files/directories from archive.tar.gz
[cgrlunix@poset ~]$ tar -vxf archive.tar.gz
-x: extract files
history
The shell keeps an ordered list of all the commands that you have entered. Each command is given a number according to the order it was entered.
[cgrlunix@poset ~]$ history
…lots of commands…
You can write the history into another file using >.
nano (a small and friendly text editor) Create an empty file sequence.fasta
[cgrlunix@poset ~]$ nano sequence.fasta
and copy the following fasta sequences to this file >seq1 AGTACGTAGTAGCTGCTGCTACGTGCGCTAGCTAGTACGTCAA TCGTACGTCGACTGATCGTAGCTACGTCGTACGTAGGTACGTT CGACGTAGATGCTAGCTGACTCGATGC >seq2 CGATCGATCGTACGTCGACTGATCGTAGCTACGTCGTACGTAG GTACGTGATCGTACGTCGACTGATCGTAGCTATCGTAGCTACC CATCGTCAGTTACTGCATGCTCG >seq3 GATCGTACGTGTAGCTGCTGTAGCTGATCGTACGGTAGCAGCT CGTCGACTGATCGTAGCTGATCGTACGTCGACTGATCGTAGCT AGTACGTAGTAGCTGTAGCTGATCGTACGTAGCTGATCGTACG >seq4 CGATCGATCGTACGTCGACTGATCGTAGCTACGTCGTACGTAG GTACGTGATCGTACGTCGACTGATCGTAGCTATCGTAGCTACC GTAGATGCTAGCTGACTCGAT
Type ctrl+X, Y, and then return to quit. Now peek inside this file by using cat or less
cut (extract sections from each line of input) Read the contents of gene_exp.diff, type
[cgrlunix@poset ~]$ less -S gene_exp.diff
-S causes lines longer than the screen width to be chopped rather than folded
Extract the third column from gene_exp.diff and display it using less
[cgrlunix@poset ~]$ cut -f 3 gene_exp.diff | less
The -f argument is how we specify which columns to keep. The argument -f also allows us to specify ranges of columns (e.g. -f 3-8) and sets of columns (e.g. -f 3,5,8). Note that it’s not possible to reorder columns using using cut, e.g. -f 6,5,4,3 will not work. To reorder columns, you’ll need to use awk which is discussed later.
Extract the third column from gene_exp.diff, exclude lines containing “-“, sort the names alphabetically and then display it using less
We want to extract chromosome (the first column), start position (4th column) and end position (5th column) of each exon from the gtf, and then store the results in a new file called "gene.bed" - this is a typical bed format.
\b matches the empty string at the edge of a word. It sets a boundary to the matches.
uniq (when fed a text file, outputs the file with consecutive identical lines collapsed to one) Firstcreates a new file uniq_test.txt using nano and copy the following content to this file
gene1 A gene1 A gene6 F gene6 F gene6 F gene1 A gene2 B gene2 B gene4 D gene5 E gene6 F gene6 F gene6 F gene1 A gene1 A gene2 B gene2 B gene2 B gene3 C gene4 D gene4 D
First let's try uniq the list directly
[cgrlunix@poset ~]$ uniq uniq_test.txt
gene1 A
gene6 F
gene1 A
gene2 B
gene4 D
gene5 E
gene6 F
gene1 A
gene2 B
gene3 C
gene4 D
As you can see, uniq does not return the unique values — it only removes consecutive duplicate lines (keeping one). If instead we did want to find all unique lines in a file, we would first sort all lines using sort so that all identical lines are grouped next to each other, and then run uniq. For example:
[cgrlunix@poset ~]$ sort uniq_test.txt | uniq
gene1 A
gene2 B
gene3 C
gene4 D
gene5 E
gene6 F
Print only those lines which are not repeated (unique) in the input.
[cgrlunix@poset ~]$ sort uniq_test.txt | uniq -u
gene3 C
gene5 E
Print only those lines which are repeated in the input.
[cgrlunix@poset ~]$ sort uniq_test.txt | uniq -d
gene1 A
gene2 B
gene4 D
gene6 F
You can also get some statistics out of uniq with the -c option
[cgrlunix@poset ~]$ sort uniq_test.txt | uniq -c
5 gene1 A
5 gene2 B
1 gene3 C
3 gene4 D
1 gene5 E
6 gene6 F
Both sort | uniq and sort | uniq -c are frequently used shell idioms in bioinformatics and worth memorizing. Combined with other Unix tools like grep and cut, sort and uniq can be used to summarize columns of tabular data. Now we want to count the number of record for each feature present in the gtf file:
note: the 3rd column contains the features of the gtf
Since sort and uniq are line-based, we can create lines from multiple columns to count combinations. In this example we would like to count how many each feature (column 3 in this example GTF) are on each strand (column 7):
{} (Brace expansion is a mechanism by which arbitrary strings may be generated) {A,B,C} expands all the elements within the braces {1..9} expands a range of elements within the braces
[cgrlunix@poset ~]$ echo hello {A,B,C}
hello A B C
[cgrlunix@poset ~]$ echo hello{A,B,C}
helloA helloB helloC
Create several folders with names starting with "folder" under the home directory
find find is useful for finding file or directory names that match certain criteria. Search the home directory and print any files or directories beginning with list
The above command grep each line starting with "C", and ending with "T", with any letters in between them, and then print out the line number of that line. . means any single character (except newline), * means that the preceding item will be matched zero or more times.
Now let's go back to work on file1.fasta. It is a fasta file that contains 11 transcript sequences. We want to print the sequence of "Transcript10".
[workshop@poset ~]$ grep -A 1 ">Transcript10" file1.fasta
-A N[int] means that it will print N lines of trailing context after matching lines.
-B N[int] means that it will print N lines of leading context before matching lines.
The above two command lines are useful for displaying sequences of a fasta file.
[workshop@poset ~]$ grep -A 1 ">Transcript1" file1.fasta
What happened? It matches Transcript1, Transcript10 and Transcript11 since all there contain “Transcript1”, the pattern we were searching. We just want the sequence after Transcript1 to be printed. We need to add more conditions for the search. Several ways to do this:
[workshop@poset ~]$ grep -A 1 ">Transcript1$" file1.fasta
[workshop@poset ~]$ grep -A 1 ">Transcript1\b" file1.fasta
Now let's work on the gtf file. Display content without header lines:
[cgrlunix@poset ~]$ grep -v "^#" Homo_sapiens.GRCh38.81.gtf | less -S
Next let's use grep to extract record of a transcript called "PLCH2-001"
-E means Extended Regular Expressions (ERE). EREs allow us to alternation (regular expression jargon for matching one of several possible patterns) to match either “PLCH2-001” or “PLCH2-002”. The syntax uses a pipe symbol, |: In basic regular expressions the meta-characters ?, +, {, |, (, and ) lose their special meaning; instead use the backslashed versions \?, \+, \{, \|, \(, and \). The above command can be expressed using the basic grep regular expression:
By default, grep outputs the entire matching line if the line contains the pattern we try to search. To only output the matched part of the pattern, we use -o for grep.Now we want to extract all values from the “gene_id” field from the last column of the gtf file. This can be done using grep -o:
[cgrlunix@poset ~]$ grep -E -o "ENSG\w+" Homo_sapiens.GRCh38.81.gtf | less -S
\w means a word character (a-z, A-Z, 0-9...). + means the preceding item is matched one or more times. \w+ means one or more word characters
Here, we’re using extended regular expressions (ERE) to capture all gene names in the field. However, as you can see there’s a great deal of redundancy: our GTF file has multiple features (transcripts, exons, start codons, etc.) that all have the same gene name. We could quickly convert this messy output from grep to a list of unique, sorted gene names:
sed (a stream editor) A stream editor is used to perform basic text transformations on an input stream (a file or input from a pipeline). It reads data from a file or standard input and can edit a line at a time.
Here we use sed to find and replace complex string in a file. For example we want to find "Transcript10" from file file1.fasta and then replace "Transcript10" with "gene10"
[cgrlunix@poset ~]$ sed 's/Transcript10/gene10/g' file1.fasta
....
>gene10
GTCCTGGCCTGCTCTTGCTGCTCCTCAGTGGCCTCCAAGCTGTTATTCAAGAAAAAG
....
In the above example, s/A/B/g means that it will replace A with B in global search.
Next, we want to find the sequence of Transcript1 and then transform all "A"s to "a"s
[cgrlunix@poset ~]$ grep -A 1 ">Transcript1$" file1.fasta | sed 's/A/a/g'
>Transcript1
CaGGCTTaGCGGTTTaTTGaTGTCTCCCaaGCCaGaGGaaGTCaCCaaaGGGaaGaGTCaCTTGCaGaaaaTGGTTTCCaTTTCaGaCaTTaCaTC
CCTGTaCaGCaaCCTGCTCCTGTaCaCGTGTGGGTCCTCTGCCGaaGCCaCCaGGGCTGTCaTGaGaCaCCTTGCCGCTGTGTaTCaaCaTGGCaG
CCTGCTaGGGCTTTCTGTTGCCaaGaGGCCTCTCTGGaGaCaGGCaTCTaTGCaaaGTGGGaaGGaCaCCaCTGaGCaaGaaaTTCTGaaaGCTaT
CaaCaTCaaTTCCTTTGCaGaGTGTGGCaTCaaTTTaTTCCaTGaGaGTGTaTCTaaaTCaGCCCTGaGCCaaGaaTTCGaaGCTTTCTTTCGT
Substitutions make up the majority of sed’s usage cases, but this is just scratching the surface of sed’s capabilities. It’s also possible to select and print certain ranges of lines with sed. In this case, we’re not doing pattern matching, so we don’t need slashes. To print the lines 5 through 8 of the fastq file, we use:
awk: Awk is a specialized language that allows you to do a variety of text processing tasks with ease. It mainly help us extract data from and manipulate tabular plaintext files. Throughout the workshop, we’ve seen how we can use simple Unix tools like grep, cut, and sort to inspect and manipulate plaintext tabular data in the shell. For many trivial bioinformatics tasks, these tools allow us to get the job done quickly and easily (and often very efficiently). Still, some tasks are slightly more complex and require a more expressive and powerful tool. This is where the language and tool Awk excels. We’ll introduce the basics of Awk in this section — enough to get you started with using awk in bioinformatics. While Awk is a fully-fledged programming language, it’s a lot less expressive and powerful than Python and Perl. If you need to implement something complex, it’s likely better (and easier) to do so in Python/Perl. The key to using Awk effectively is to use it for the subset of tasks it’s best at: quick data processing tasks on tabular data.
To learn Awk we’ll cover two key parts of the Awk language: how Awk processes records, and pattern-action pairs. After understanding these two key parts the rest of the language is quite simple.
First, Awk processes input data a record at a time. Each record is composed of fields, separate chunks that Awk automatically separates. Since Awk was designed to work with tabular data, each record is a line, and each field is a column’s entry for that record. The clever part about Awk is that it automatically assigns the entire record to the variable $0, and field one’s value is assigned to $1, field two’s value is assigned to $2, field three’s value is assigned to $3, and so forth. Second, we build Awk programs using one or more of the following structures: pattern { action }. Each pattern is an expression or regular expression pattern. Patterns are a lot like if statements in other languages: if the pattern’s expression evaluates to true or the regular expression matches, the statements inside action are run. In Awk lingo, these are pattern-action pairs and we can chain multiple pattern-action pairs together (separated by semicolons).
If we omit the pattern, Awk will run the action on all records. If we omit the action but specify a pattern, Awk will print all records that match the pattern. This simple structure makes Awk an excellent choice for quick text processing tasks. This is a lot to take in, but these two basic concepts --records and fields, and pattern-action pairs-- are the foundation of writing text processing programs with Awk. Let’s see some examples.
- Action only First, we can simply mimic cat by omitting a pattern and printing an entire record with the variable $0:
Here, we’re making use of Awk’s string concatenation. Two strings are concatenated if they are placed next to each other with no argument. So for each record, $2"\t"$3 concatenates the second field, a tab character, and the third field.
- Pattern only
Let’s now look at how we can incorporate simple pattern matching. Suppose we wanted to write a filter that only output lines where the length of the feature (end position - start position) is greater than 1,000. Awk supports arithmetic with the standard operators +, -, *, /, % (remainder), and ^ (exponentiation). We can subtract within a pattern to calculate the length of a feature, and filter on that expression:
We can also chain patterns, by using logical operators && (and), || (or), and ! (not). For example, if we want to extract all records derived from chromosome 1 if the length is greater than 1000:
The first pattern, $1 ~ /^1$/, is how we specify a regular expression. Regular expressions are in slashes. Here, we’re matching the first field, $1, against the regular expression 1. The tilde, ~ means match; to not much the regular expression we would use !~ (or !($1 ~ /^1$/)).
- Pattern and action
We can combine patterns and more complex actions than just printing the entire record. For example, if we want to add a column with the length of this feature (end position - start position) for only mitochondrial, we could use:
So far, these exercises illustrate how Awk is useful in (1) filtering data using rules that can combine regular expressions and arithmetic, and (2) reformatting the columns of data using arithmetic. These two applications alone make Awk an extremely handy tool in bioinformatics, and a huge time saver.
Now let’s look at some slightly more advanced use cases. We’ll start by introducing two special patterns: BEGINand END.
The BEGIN pattern, specifies what to do before the first record is read in, and END specifies what to do after the last record’s processing is complete. BEGIN is useful to initialize and set up variables, and END is useful to print data summaries at the end of file processing. For example, we want to calculate the mean length of all the record in gene.bed. We would have to take the sum of all lengths, and then divide the sum by the total number of records. We can do this with:
[cgrlunix@poset ~]$ awk 'BEGIN { s = 0 }; { s += ($3 - $2 + 1) }; END { print "mean: " s/NR }' gene.bed
mean: 224.604
NR is the current record number, so on the last record NR is set to the total number of records processed. In the example above, we’ve initialized a variable s to 0 in BEGIN (variables you define do not need a dollar sign). Then, for each record we increment sby the length of the feature. At the end of the records, we print this sum s divided by the number of records NR, giving the mean.
While Awk is designed to work with whitespace-separated tabular data, it’s easy to set a different field separator: simply specify which separator to use with the -F argument. For example, we could work with a CSV file in Awk by starting with awk -F",".
We can use NR to extract ranges of lines too; for example, if we want to extract all lines between 3 and 5 (inclusive):
Your instructor is Ke Bi.
IMPORTANT NOTES
1.This workshop material is mainly based on “UNIX Tutorial for Beginners” (http://www.ee.surrey.ac.uk/Teaching/Unix/) with modifications. “UNIX Tutorial for Beginners” is covered by a Creative Commons licence. If you need to download a copy of the original tutorial please go to http://www.ee.surrey.ac.uk/Teaching/Unix/download.html
2. Some of the workshop lectures are directory copied from Chapter 7-Unix Data Tools from Bioinformatics Data Skills authored by Vince Buffalo (especially in sessions about advanced grep, sed and awk). Bioinformatics Data Skills introduces various fundamental but extremely useful tools that are commonly used in modern bioinformatics data processing. It is worth a read!
Bioinformatics Data Skills
Reproducible and Robust Research with Open Source Tools By Vince Buffalo
Publisher: O'Reilly Media
Final Release Date: July 2015
Pages: 538
3. The material below is intended for the in-class instruction. Others trying to learn this material offline can follow along on any Unix (like) machine, however, please be aware the file names and path names will be different on your own machine.
Requirements
The module requires that you use a recent MacOS or Linux based system. Future versions of the module may accommodate Microsoft Windows users.
Using the UNIX shell
Your time in the UNIX environment will be spent in the shell, which is UNIX command line interpreter. Shell runs in the Terminal window and if you're using Mac OS (you'll find the Terminal under Applications -> Utilities -> Terminal). The shell allows you to move and copy files, run programs, and more. On Windows, you could install Cygwin and get a UNIX environment on your Windows machine.
How do I open my terminal window?
In Mac OS, you can open a new terminal window by opening the Finder, selecting the Applications folder, then the Utilities sub-folder, and then double-clicking on the Terminal application. You may want to drag the Terminal icon to your Application Dock for ready access going forward.
Alternatively, you can just use Spotlight to search for "Terminal" each time, and open the application after it is listed in the Spotlight results list.
In Linux environments, your terminal program will depend on the version of Linux you have installed. If you are using Linux and you don't know how to open a terminal, raise your hand and let us know now.
Accessing your temporary CGRL account
The next step is to login to the CGRL server (poset.cgrl.berkeley.edu) with your user account and password. You can do this with the ssh command.
After login, you should see a shell prompt that contains your user name and the name of the machine followed by a dollar sign.
1.1 Listing files and directories
ls (list)
When you first login, your current working directory is your home directory. Your home directory has the same name as your user-name, for example, cgrlunix, and it is where your personal files and subdirectories are saved.
To find out what is in your home directory, type
The ls command lists the contents of your current working directory. ls does not, in fact, cause all the files in your home directory to be listed, but only those ones whose name does not begin with a dot (.). Files beginning with a dot (.) are known as hidden files and usually contain important program configuration information. They are hidden because you should not change them unless you are very familiar with Unix!!!
To list all files in your home directory including those whose names begin with a dot, type
ls is an example of a command which can take options: -a is an example of an option which means do not ignore entries starting with dot (.).
Another bit of information we usually want about a file is its size.
A brief explanation of the information displayed in "ls -l"
Column 1: File permissions and file types
1 2 3 4
- rw- r-- r--
1. leftmost field is file type: "-" means a file and "d"means a directory.
2. The owner (cgrlunix) has read (r) and write (w) permission to this file
3. The group has read permission to this file
4. everyone else has read permission to this file
Note: if the item is an executable then the last field of each section is marked as "x" instead of a "-". For instance: rwx
Column 2: Links field - Number of short cuts to this file from other directories.
Column 3: Name of owner
Column 4: Name of group
Column 5: Size
Column 6: Date/time when the item was created or last modified
Column 7: Name of the item
The fourth column (the one before the creation data) ls -l reports file sizes in bytes. If we wish to use human readable sies, we can use ls -lh:
If you want to sort the files by size, we can use ls -lhS:
If you want to sort the files by size in a reverse order, we can use ls -lhSr:
These options change the behavior of the command. There are online manual pages that tell you which options a particular command can take, and how each option modifies the behavior of the command. (See later in this tutorial)
1.2 Making Directories
mkdir (make directory)
To make a subdirectory called unixstuff in your current working directory type
To see the directory you have just created, type
1.3 Changing to a different directory
cd (change directory)
The command cd directory means change your working directory from the current one to a different directory. The current working directory may be thought of as the directory you are in, i.e. your current position in the file-system tree.
To change to the directory you have just made, type
Type ls to see the contents (which should be empty)
Now create a new directory backups under unixstuff
1.4 The directories . and ..
Still in the unixstuff directory, type
As you can see, in the unixstuff directory (and in all other directories), there are two special directories called (.) and (..)
In UNIX, (.) means the current directory, so type
means stay where you are (the unixstuff directory). This may not seem very useful at first, but using (.) as the name of the current directory will save a lot of typing, as we shall see later in the tutorial.
(..) means the parent of the current directory, so typing
will take you one directory up the hierarchy (back to your home directory).
Note: typing cd with no argument always returns you to your home directory. This is very useful if you are lost in the file system.
1.5 Pathnames
pwd (print working directory)
Pathnames enable you to work out where you are in relation to the whole file-system. For example, to find out the absolute pathname of your working directory:
1.6 More about pathnames
Understanding pathnames
Then type ls to list the contents of your unixstuff directory
Now type
The reason is, backups is not in your current working directory. To use a command on a file (or directory) not in the current working directory (the directory you are currently in), you must either cd to the correct directory, or specify its pathname. To list the contents of your backups directory, you must type
2.1 Copying files
cp (copy)
cp file1 file2 is the command which makes a copy of file1 in the current working directory and calls it file2
First, cd to your home directory and type ls.
now make a copy of file1.fasta and call the copy file2.fasta
now list all the contents in your home directory
Now try something more fancier: make a copy of file1.fasta to the directory unixstuff and name this copy file3.fasta
Check the content in unixstuff
Now cd to unixstuff
Now make a copy of file1.fasta to the current directory and name the copy as file4.fasta
Now make a copy of file1.fasta to the current directory without renaming it
Don't forget the dot (.) at the end. Remember, in unix, the dot means the current directory.
NOTE: Copy directories follows the same rules except we need to add an option -r to make cp work with directories.
2.2 Moving files
mv (move)
mv file1 file2 moves (or renames) file1 to file2
To move a file from one place to another, use the mv command. This has the effect of moving rather than copying the file, so you end up with only one file rather than two. It can also be used to rename a file, by moving the file to the same directory, but giving it a different name. It can also rename directories or move directories to other directories.
In this lecture we will learn how to move files. Moving directories follow the same rules.
We are now going to move file1.fasta to your backup directory. This can be done in many different ways. Here I just list one example:
cd to your home directory
display all the content in home directory
now move file1.fasta to the directory backups
Now check the content in the home directory again by using ls
You can see file1.fasta is gone. Now check the content in the directory backups
mv can sometimes be "dangerous" to use for renaming files unless you clearly understand what you try to do . Try the following
You can see that the original copy of file2.fasta is gone. test.fasta is overwritten by file2.fasta.
Let's rename test.fasta to be file2.fasta
2.3 Removing files and directories
rm (remove), rmdir (remove empty directory), rm -r (remove directories and their contents recursively)
To delete (remove) a file, use the rm command.
Now cd to your unixstuff directory and list all the content, type
No remove (permanently!) file1.fasta, file3.fasta and file4.fasta from unixstuff, and then list the content
Now make a new directory test under unixstuff and use rmdir to remove it. Make sure to check the content of unixstuff before and after running rmdir. You can also use rm -r to remove a directory and its contents recursively
2.4 Displaying the contents of a file on the screen
clear (clear screen)
Before you start the next section, you may like to clear the terminal window of the previous commands so the output of the following commands can be clearly demonstrated.
Now cd to your home directory
Type clear in the shell
This will clear all text and leave you with the $ prompt at the top of the window.
Now list the content by using ls
file2.fasta is a copy of file1.fasta, but we have moved file1.fasta to the directory backups. Now we can move file1.fasta back. Then you can type ls to check the content
cat
The command cat can be used to display the contents of a file on the screen. Type:
As you can see, the file is longer than the size of the window, so it scrolls past making it unreadable.
less
The command less writes the contents of a file onto the screen a page at a time. Type
Press the [space-bar] if you want to see another page, type [b] to go up a page, type [q] if you want to quit reading. As you can see, less is used in preference to cat for long files.
Now let's use less to view file "data.fastq"
One of the most useful features of less is that it allows you to search text and highlights matches. Visually highlighting matches can be an extremely useful way to find potential problems in data. For example, let’s use less to do a quick search of whether there are 3’ adapter contaminants in the data.fastq file. In this case, we’ll look for AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC — a known adapter from the Illumina Tru‐Seq. kit. Our goal isn’t to do an exhaustive test or remove these adapters — we just want to check if there’s any indication there could be contamination.
First, we open data.fastq in less, the press / and enter AGATCGGAAGA (or longer if you want to be more stringent).
head
The head command writes the first few lines of a file to the screen. the default is to display 10 lines
First clear the screen then type head file1.fasta
Then type
The above shows the first 5 lines in file1.fasta
tail
The tail command writes the last ten lines of a file to the screen. Its function is opposite to head. Please give it a try. In the shell, type clear, then tail file1.fasta, then tail -n 5 file1.fasta
We can also use tail to remove the header of a file. Normally the -n argument specifies how many of the last lines of a file to display, but if -n is given a number x preceded with a + sign (e.g. +x), tail will start from the xth line. So to chop off a header (the first line in our case), we start from the second line with -n +2.
2.5 Searching the contents of a file
grep
grep is one of many standard UNIX utilities. It searches files for specified words or patterns. First clear the screen, then type
As you can see, grep has printed out each line containing the word “Transcript”.
Now try typing
The grep command is case sensitive; it distinguishes between “Transcript” and “transcript”.
To ignore upper/lower case distinctions, use the -i option, i.e. type
Other common and useful options for grep include:
-v display those lines that do NOT match
-n precede each matching line with the line number
-c print only the total count of matched lines
Try some of them and see the different results. Don't forget, you can use more than one option at a time, for example, the number of lines without the words “Transcript” or “transcript” is
wc (word count)
A handy little utility is the wc command, short for word count. By default, wc outputs the number of words, lines, and characters of the supplied file.
To find out how many lines of file1.fasta has, type
3.1 Redirecting the Output
Most processes initiated by UNIX commands take input from the standard input (that is, they read it from the keyboard) and write to the standard output (that is, they write to the terminal). There is also the standard error, where processes write their error messages, by default, to the terminal screen. Sometimes you may wish to save the output in a file rather than just print them in a terminal, and in this case we need to redirect the output.
echo (writes its arguments to standard output)
In a terminal window type
We now direct the output hello! to a new file called test.txt, check the content of test.txt using cat
We use the > symbol to redirect the output of a command.
For example, to create a file called list1.txt containing a list of fruits, type
Then type in the names of some fruit. Press [return] after each one
^d (Control + d to stop)
What happens is the cat command reads the standard input (what you typed via the keyboard) and the > redirects the output into a file called list1.txt
To read the contents of the file, type cat list1.txt or less list1.txt
The form >> appends standard output to a file. So to add more items to the file list1.txt, type
Then type in the names of more fruits
^d (Control + d to stop)
To read the contents of the file, type
Now we can write some vegetables to a new file called list2.txt, then type in the names of some vegetables
^d (Control + d to stop)
You should now have two files. One contains six fruits, and the other contains three vegetables. We will now use the cat command to join (concatenate) list1.txt and list2.txt into a new file called biglist.txt. Type
What this is doing is reading the contents of list1.txt and list2.txt in turn, then outputing the text to the file biglist.txt
To read the contents of the new file, type
The command sort alphabetically or numerically sorts a list. Type
To output the sorted list to a file called sortlist, type the following
3.2 Pipes
I (pipe) takes the output of one command before the pipe and uses it as input for the command after the pipe. Useful for chaining commands which have one output and accept one input
An example: we want to sort the file biglist alphabetically, then display the first 5 lines , then from these five lines we want to grep any lines containing letter “r” or “R” and then count the number of matched lines
tee writes output to stdout, and also to a file simultaneously
For example we can use ls to list all files and directories under the current working diretory and save the output in "all.txt"
Now we add tee to save the output in "all.txt" and at the same time, view the output as stdout in shell
By default tee overwrites the file. You can instruct tee to append to the file using the option -a as shown below
4.1 Wildcards
The characters * and ?
The character * is called a wildcard, and will match against none or more character(s) in a file (or directory) name. For example, in your home directory, type
This will list all files and subdirectories in the current directory starting with list (list…)
Try typing
This will list all files and subdirectories in the current directory ending with list (…list)
Now typing
This will list all files and subdirectories in the current directory containing list (…list…)
The character ? will match exactly one character. So ls ?ouse will match files like house and mouse, but not grouse.
Try typing
Now try typing
4.2 Filename conventions
We should note here that a directory is merely a special type of file. So the rules and conventions for naming files apply also to directories.
In naming files, characters with special meanings such as / * & % , should be avoided. Also, avoid using spaces within names. The safest way to name a file is to use only alphanumeric characters, that is, letters and numbers, together with _ (underscore) and . (dot).
File names conventionally start with a lower-case letter, and may end with a dot followed by a group of letters indicating the contents of the file. For example, all files consisting of C code may be named with the ending .c, for example, prog1.c . Then in order to list all files containing C code in your home directory, you need only type ls *.c in that directory.
4.3 Getting Help
man (on-line manuals) and whatis
There are on-line manuals which gives information about most commands. The manual pages tell you which options a particular command can take, and how each option modifies the behavior of the command. Type man command to read the manual page for a particular command.
For example, to find out more about the wc (word count) command, type
Alternatively, type
whatis gives a one-line description of the command, but omits any information about options etc.
5.1 Some more advanced UNIX commands commonly used in bioinformatics
df
The df command reports on the space left on the file system. For example, to find out how much space is left on the fileserver, type
-h means that it prints sizes in human readable format (e.g., 1K 234M 2G)
nproc
proc displays how many computation units (e.g. cores) are in this computer
lscpu
This command lists more confiuguration information about CPU architectures in this computer
free
free shows the amount of memory are in total and used. free -h translates the numbers into human-readable sizes (Gb, Mb, etc.)
[cgrlunix@poset ~]$ free -h total used free shared buffers cached Mem: 47G 23G 23G 204K 389M 21G -/+ buffers/cache: 905M 46G Swap: 47G 41M 47Gtop
The top command displays the top 30 processes on the system and periodically updates this information
type "q" to quit top
du
du summarizes disk usage of each file, recursively for directories. To summarize the total size of home directory (~)
-s displays only a total size
gzip
This command compresses a file. For example, to compress file1.fasta, type
Now list the content in home directory
Now file1.fasta is zipped and is ended with .gz
To unzip the file, use the gunzip command
Now list the content in home directory
If you want to gzip a file and keep the original file, us -c (Write output on standard output; keep original files unchanged.)
tar
tar means tape archive: we use tar to combine a few files from one or more than one folder into a single file for easy storage and distribution. tar and gzip can be used together to compress directories.
For example we combine all files from unixstuff (and from its all subdirectories) and then gzip them into one file: archive.tar.gz.
-z: Compress archive using gzip program
-c: Create a new archive
-v: Verbose i.e display progress while creating archive
-f: use archive file name
To extract all files/directories from archive.tar.gz
-x: extract files
history
The shell keeps an ordered list of all the commands that you have entered. Each command is given a number according to the order it was entered.
You can write the history into another file using >.
nano (a small and friendly text editor)
Create an empty file sequence.fasta
and copy the following fasta sequences to this file
>seq1
AGTACGTAGTAGCTGCTGCTACGTGCGCTAGCTAGTACGTCAA
TCGTACGTCGACTGATCGTAGCTACGTCGTACGTAGGTACGTT
CGACGTAGATGCTAGCTGACTCGATGC
>seq2
CGATCGATCGTACGTCGACTGATCGTAGCTACGTCGTACGTAG
GTACGTGATCGTACGTCGACTGATCGTAGCTATCGTAGCTACC
CATCGTCAGTTACTGCATGCTCG
>seq3
GATCGTACGTGTAGCTGCTGTAGCTGATCGTACGGTAGCAGCT
CGTCGACTGATCGTAGCTGATCGTACGTCGACTGATCGTAGCT
AGTACGTAGTAGCTGTAGCTGATCGTACGTAGCTGATCGTACG
>seq4
CGATCGATCGTACGTCGACTGATCGTAGCTACGTCGTACGTAG
GTACGTGATCGTACGTCGACTGATCGTAGCTATCGTAGCTACC
GTAGATGCTAGCTGACTCGAT
Type ctrl+X, Y, and then return to quit.
Now peek inside this file by using cat or less
wget (retrieve files from web servers)
cut (extract sections from each line of input)
Read the contents of gene_exp.diff, type
-S causes lines longer than the screen width to be chopped rather than folded
Extract the third column from gene_exp.diff and display it using less
The -f argument is how we specify which columns to keep. The argument -f also allows us to specify ranges of columns (e.g. -f 3-8) and sets of columns (e.g. -f 3,5,8). Note that it’s not possible to reorder columns using using cut, e.g. -f 6,5,4,3 will not work. To reorder columns, you’ll need to use awk which is discussed later.
Extract the third column from gene_exp.diff, exclude lines containing “-“, sort the names alphabetically and then display it using less
Next let's work on the gtf file (Homo_sapiens.GRCh38.81.gtf ) using cut. First of all let's peek into the gtf file using less -S
The first five lines start with "#" are headers.
We want to extract chromosome (the first column), start position (4th column) and end position (5th column) of each exon from the gtf, and then store the results in a new file called "gene.bed" - this is a typical bed format.
\b matches the empty string at the edge of a word. It sets a boundary to the matches.
uniq (when fed a text file, outputs the file with consecutive identical lines collapsed to one)
Firstcreates a new file uniq_test.txt using nano and copy the following content to this file
gene1 A
gene1 A
gene6 F
gene6 F
gene6 F
gene1 A
gene2 B
gene2 B
gene4 D
gene5 E
gene6 F
gene6 F
gene6 F
gene1 A
gene1 A
gene2 B
gene2 B
gene2 B
gene3 C
gene4 D
gene4 D
First let's try uniq the list directly
As you can see, uniq does not return the unique values — it only removes consecutive duplicate lines (keeping one). If instead we did want to find all unique lines in a file, we would first sort all lines using sort so that all identical lines are grouped next to each other, and then run uniq. For example:
Print only those lines which are not repeated (unique) in the input.
Print only those lines which are repeated in the input.
You can also get some statistics out of uniq with the -c option
Both sort | uniq and sort | uniq -c are frequently used shell idioms in bioinformatics and worth memorizing. Combined with other Unix tools like grep and cut, sort and uniq can be used to summarize columns of tabular data. Now we want to count the number of record for each feature present in the gtf file:
note: the 3rd column contains the features of the gtf
Since sort and uniq are line-based, we can create lines from multiple columns to count combinations. In this example we would like to count how many each feature (column 3 in this example GTF) are on each strand (column 7):
Or, if you want to see the number of features belonging to a particular gene identifier:
{} (Brace expansion is a mechanism by which arbitrary strings may be generated)
{A,B,C} expands all the elements within the braces
{1..9} expands a range of elements within the braces
[cgrlunix@poset ~]$ echo hello {A,B,C} hello A B C [cgrlunix@poset ~]$ echo hello{A,B,C} helloA helloB helloCCreate several folders with names starting with "folder" under the home directory[cgrlunix@poset ~]$ mkdir folder_{A,B,C} [cgrlunix@poset ~]$ ls folder* folder_A: folder_B: folder_C:Create new several empty .txt files with names beginning with "text" using touch under the home directory[cgrlunix@poset ~]$ touch text{1..5}.txt [cgrlunix@poset ~]$ ls text* text1.txt text2.txt text3.txt text4.txt text5.txtfind
find is useful for finding file or directory names that match certain criteria.
Search the home directory and print any files or directories beginning with list
Search the home directory to print all subdirectories
More about grep
grep is one of the most powerful Unix data tools and very commonly used in bioinformatic data processing
The above command grep each line ending with a "T" from sequence.fasta and print out the line number of that line. $ matches the end of a line
The above command grep each line starting with a "C" and print out the line number of that line. ^ matches the beginning of a line
The above command grep each line starting with "C", and ending with "T", with any letters in between them, and then print out the line number of that line. . means any single character (except newline), * means that the preceding item will be matched zero or more times.
Now let's go back to work on file1.fasta. It is a fasta file that contains 11 transcript sequences. We want to print the sequence of "Transcript10".
-A N[int] means that it will print N lines of trailing context after matching lines.
-B N[int] means that it will print N lines of leading context before matching lines.
The above two command lines are useful for displaying sequences of a fasta file.
What happened? It matches Transcript1, Transcript10 and Transcript11 since all there contain “Transcript1”, the pattern we were searching. We just want the sequence after Transcript1 to be printed. We need to add more conditions for the search. Several ways to do this:
Now let's work on the gtf file.
Display content without header lines:
Next let's use grep to extract record of a transcript called "PLCH2-001"
Next let's use grep to extract record of transcripts "PLCH2-001" and "PLCH2-002"
-E means Extended Regular Expressions (ERE). EREs allow us to alternation (regular expression jargon for matching one of several possible patterns) to match either “PLCH2-001” or “PLCH2-002”. The syntax uses a pipe symbol, |:
In basic regular expressions the meta-characters ?, +, {, |, (, and ) lose their special meaning; instead use the backslashed versions \?, \+, \{, \|, \(, and \). The above command can be expressed using the basic grep regular expression:
By default, grep outputs the entire matching line if the line contains the pattern we try to search. To only output the matched part of the pattern, we use -o for grep. Now we want to extract all values from the “gene_id” field from the last column of the gtf file. This can be done using grep -o:
\w means a word character (a-z, A-Z, 0-9...). + means the preceding item is matched one or more times. \w+ means one or more word characters
Here, we’re using extended regular expressions (ERE) to capture all gene names in the field. However, as you can see there’s a great deal of redundancy: our GTF file has multiple features (transcripts, exons, start codons, etc.) that all have the same gene name. We could quickly convert this messy output from grep to a list of unique, sorted gene names:
sed (a stream editor)
A stream editor is used to perform basic text transformations on an input stream (a file or input from a pipeline). It reads data from a file or standard input and can edit a line at a time.
Here we use sed to find and replace complex string in a file. For example we want to find "Transcript10" from file file1.fasta and then replace "Transcript10" with "gene10"
In the above example, s/A/B/g means that it will replace A with B in global search.
Next, we want to find the sequence of Transcript1 and then transform all "A"s to "a"s
Substitutions make up the majority of sed’s usage cases, but this is just scratching the surface of sed’s capabilities. It’s also possible to select and print certain ranges of lines with sed. In this case, we’re not doing pattern matching, so we don’t need slashes. To print the lines 5 through 8 of the fastq file, we use:
sed can also be used to remove lines from a file
remove the 2nd line from a file
remove lines 1-5 from a file
To figure out which lines to remove you can use grep -n <pattern>
remove the last line
remove the first and last lines
tr (translate or transliterate)
tr reads from stdin or file (one line at a time) and replaces or removes or compresses specific characters
1. replacement:
For example, we can use tr to transform lowercase to uppercase
use tr to transform all lowercase to uppercase
use tr for searching and replacing specific characters
2. delete:
we can use tr -d to search for characters and delete them
3. compress:
tr -s can be used to compress consecutive identical characters to one
awk:
Awk is a specialized language that allows you to do a variety of text processing tasks with ease. It mainly help us extract data from and manipulate tabular plaintext files. Throughout the workshop, we’ve seen how we can use simple Unix tools like grep, cut, and sort to inspect and manipulate plaintext tabular data in the shell. For many trivial bioinformatics tasks, these tools allow us to get the job done quickly and easily (and often very efficiently). Still, some tasks are slightly more complex and require a more expressive and powerful tool. This is where the language and tool Awk excels.
We’ll introduce the basics of Awk in this section — enough to get you started with using awk in bioinformatics. While Awk is a fully-fledged programming language, it’s a lot less expressive and powerful than Python and Perl. If you need to implement something complex, it’s likely better (and easier) to do so in Python/Perl. The key to using Awk effectively is to use it for the subset of tasks it’s best at: quick data processing tasks on tabular data.
To learn Awk we’ll cover two key parts of the Awk language: how Awk processes records, and pattern-action pairs. After understanding these two key parts the rest of the language is quite simple.
First, Awk processes input data a record at a time. Each record is composed of fields, separate chunks that Awk automatically separates. Since Awk was designed to work with tabular data, each record is a line, and each field is a column’s entry for that record. The clever part about Awk is that it automatically assigns the entire record to the variable $0, and field one’s value is assigned to $1, field two’s value is assigned to $2, field three’s value is assigned to $3, and so forth.
Second, we build Awk programs using one or more of the following structures: pattern { action }. Each pattern is an expression or regular expression pattern. Patterns are a lot like if statements in other languages: if the pattern’s expression evaluates to true or the regular expression matches, the statements inside action are run. In Awk lingo, these are pattern-action pairs and we can chain multiple pattern-action pairs together (separated by semicolons).
If we omit the pattern, Awk will run the action on all records. If we omit the action but specify a pattern, Awk will print all records that match the pattern. This simple structure makes Awk an excellent choice for quick text processing tasks. This is a lot to take in, but these two basic concepts --records and fields, and pattern-action pairs-- are the foundation of writing text processing programs with Awk. Let’s see some examples.
- Action only
First, we can simply mimic cat by omitting a pattern and printing an entire record with the variable $0:
[cgrlunix@poset ~]$ awk '{print $0}' gene.bedAwk can also mimic cut:
[cgrlunix@poset ~]$ awk '{print $2"\t"$3}' gene.bedHere, we’re making use of Awk’s string concatenation. Two strings are concatenated if they are placed next to each other with no argument. So for each record, $2"\t"$3 concatenates the second field, a tab character, and the third field.- Pattern only
Let’s now look at how we can incorporate simple pattern matching. Suppose we wanted to write a filter that only output lines where the length of the feature (end position - start position) is greater than 1,000. Awk supports arithmetic with the standard operators +, -, *, /, % (remainder), and ^ (exponentiation). We can subtract within a pattern to calculate the length of a feature, and filter on that expression:
We can also chain patterns, by using logical operators && (and), || (or), and ! (not). For example, if we want to extract all records derived from chromosome 1 if the length is greater than 1000:
The first pattern, $1 ~ /^1$/, is how we specify a regular expression. Regular expressions are in slashes. Here, we’re matching the first field, $1, against the regular expression 1. The tilde, ~ means match; to not much the regular expression we would use !~ (or !($1 ~ /^1$/)).
- Pattern and action
We can combine patterns and more complex actions than just printing the entire record. For example, if we want to add a column with the length of this feature (end position - start position) for only mitochondrial, we could use:
[cgrlunix@poset ~]$ awk '$1 ~ /MT/ { print $0 "\t" $3 - $2 + 1}' gene.bedSo far, these exercises illustrate how Awk is useful in (1) filtering data using rules that can combine regular expressions and arithmetic, and (2) reformatting the columns of data using arithmetic. These two applications alone make Awk an extremely handy tool in bioinformatics, and a huge time saver.
Now let’s look at some slightly more advanced use cases. We’ll start by introducing two special patterns: BEGIN and END.
The BEGIN pattern, specifies what to do before the first record is read in, and END specifies what to do after the last record’s processing is complete. BEGIN is useful to initialize and set up variables, and END is useful to print data summaries at the end of file processing. For example, we want to calculate the mean length of all the record in gene.bed. We would have to take the sum of all lengths, and then divide the sum by the total number of records. We can do this with:
[cgrlunix@poset ~]$ awk 'BEGIN { s = 0 }; { s += ($3 - $2 + 1) }; END { print "mean: " s/NR }' gene.bed mean: 224.604NR is the current record number, so on the last record NR is set to the total number of records processed. In the example above, we’ve initialized a variable s to 0 in BEGIN (variables you define do not need a dollar sign). Then, for each record we increment s by the length of the feature. At the end of the records, we print this sum s divided by the number of records NR, giving the mean.While Awk is designed to work with whitespace-separated tabular data, it’s easy to set a different field separator: simply specify which separator to use with the -F argument. For example, we could work with a CSV file in Awk by starting with awk -F",".
We can use NR to extract ranges of lines too; for example, if we want to extract all lines between 3 and 5 (inclusive):
Awk makes it easy to convert from GTF to BED:
[cgrlunix@poset ~]$ awk '$1 ~! /^#/ {print $1 "\t" $4 "\t" $5 }' Homo_sapiens.GRCh38.81.gtf | sort | uniq | head -n 5 1 11868 12227 1 11868 14409 1 12009 12057 1 12009 13670 1 12178 12227