Introduction+to+basic+Unix+commands


 * UNIX data tools workshop **

Your instructor is Ke Bi.


 * IMPORTANT NOTES **

__** 1. **** This workshop material is mainly based on “UNIX Tutorial for Beginners” ([]) with modifications. “UNIX Tutorial for Beginners” is covered by a [|Creative Commons licence]. If you need to download a copy of the original tutorial please go to http://www.ee.surrey.ac.uk/Teaching/Unix/download.html **__

__** 2. Some of the workshop lectures are directory copied from Chapter 7-Unix Data Tools from  // Bioinformatics Data Skills // authored by Vince Buffalo (especially in sessions about advanced grep, sed and awk). **** //Bioinformatics Data Skills// **** introduces various fundamental but extremely useful tools that are commonly used in modern bioinformatics data processing. It is worth a read! **__

//** Bioinformatics Data Skills **// //** Reproducible and Robust Research with Open Source Tools **// **By [|Vince Buffalo] **
 * Publisher: O'Reilly Media**
 * Final Release Date: July 2015**
 * Pages: 538**
 * 3. The material below is intended for the in-class instruction. Others trying to learn this material offline can follow along on any Unix (like) machine, however, please be aware the file names and path names will be different on your own machine. **


 * Requirements **

The module requires that you use a recent MacOS or Linux based system. Future versions of the module may accommodate Microsoft Windows users.


 * Using the UNIX shell **

Your time in the UNIX environment will be spent in the **shell**, which is UNIX command line interpreter. Shell runs in the Terminal window and if you're using Mac OS (you'll find the Terminal under Applications -> Utilities -> Terminal). The shell allows you to move and copy files, run programs, and more. On Windows, you could install Cygwin and get a UNIX environment on your Windows machine.


 * How do I open my terminal window? **

In Mac OS, you can open a new terminal window by opening the Finder, selecting the Applications folder, then the Utilities sub-folder, and then double-clicking on the Terminal application. You may want to drag the Terminal icon to your Application Dock for ready access going forward.

Alternatively, you can just use Spotlight to search for "Terminal" each time, and open the application after it is listed in the Spotlight results list.

In Linux environments, your terminal program will depend on the version of Linux you have installed. If you are using Linux and you don't know how to open a terminal, raise your hand and let us know now.


 * Accessing your temporary CGRL account **

The next step is to login to the CGRL server (poset.cgrl.berkeley.edu) with your user account and password. You can do this with the ssh command. After login, you should see a shell prompt that contains your user name and the name of the machine followed by a dollar sign.

code Kes-MacBook-Pro:~ kebi$ ssh cgrlunix@poset.cgrl.berkeley.edu Password: [cgrlunix@poset ~]$ code When you first login, your current working directory is your home directory. Your home directory has the same name as your user-name, for example, cgrlunix, and it is where your personal files and subdirectories are saved. To find out what is in your home directory, type code [cgrlunix@poset ~]$ ls Homo_sapiens.GRCh38.81.gtf data.fastq file1.fasta code The ** ls ** command lists the contents of your current working directory. ** ls ** does not, in fact, cause all the files in your home directory to be listed, but only those ones whose name does not begin with a dot (.). Files beginning with a dot (.) are known as hidden files and usually contain important program configuration information. They are hidden because you should not change them unless you are very familiar with Unix!!!
 * 1.1 Listing files and directories **
 * ls (list) **

To list all files in your home directory including those whose names begin with a dot, type code [cgrlunix@poset ~]$ ls -a . .. .bash_logout  .bashrc  .cache  data.fastq  file1.fasta  Homo_sapiens.GRCh38.81.gtf  .profile code
 * ls ** is an example of a command which can take options: ** -a ** is an example of an option which means do not ignore entries starting with dot (**.**).

Another bit of information we usually want about a file is its size. code [cgrlunix@poset ~]$ ls -l total 6568 -rw-r--r-- 1 cgrlunix cgrlunix 636433 Jan  7 12:43 data.fastq -rw-r--r-- 1 cgrlunix cgrlunix   9193 Jan  7 12:43 file1.fasta -rw-r--r-- 1 cgrlunix cgrlunix 6073448 Jan 7 12:43 Homo_sapiens.GRCh38.81.gtf code A brief explanation of the information displayed in "ls -l" Column 1: File permissions and file types //1 2 3 4// //- rw- r-- r--// //1. leftmost field is file type: "-" means a file and "d"means a directory.// //2. The owner (cgrlunix) has read (r) and write (w) permission to this file// //3. The group has read permission to this file// //4. everyone else has read permission to this file// //Note: if the item is an executable then the last field of each section is marked as "x" instead of a "-". For instance: rwx//

Column 2: Links field - Number of short cuts to this file from other directories. Column 3: Name of owner Column 4: Name of group Column 5: Size Column 6: Date/time when the item was created or last modified Column 7: Name of the item

The fourth column (the one before the creation data) **ls -l** reports file sizes in bytes. If we wish to use human readable sies, we can use **ls -lh**: code [cgrlunix@poset ~]$ ls -lh total 6.5M -rw-r--r-- 1 cgrlunix cgrlunix 622K Jan 7 12:43 data.fastq -rw-r--r-- 1 cgrlunix cgrlunix 9.0K Jan 7 12:43 file1.fasta -rw-r--r-- 1 cgrlunix cgrlunix 5.8M Jan 7 12:43 Homo_sapiens.GRCh38.81.gtf code

If you want to sort the files by size, we can use ** ls -lhS ** : code [cgrlunix@poset ~]$ ls -lhS total 6.5M -rw-r--r-- 1 cgrlunix cgrlunix 5.8M Jan 7 12:43 Homo_sapiens.GRCh38.81.gtf -rw-r--r-- 1 cgrlunix cgrlunix 622K Jan 7 12:43 data.fastq -rw-r--r-- 1 cgrlunix cgrlunix 9.0K Jan 7 12:43 file1.fasta code

If you want to sort the files by size in a reverse order, we can use ** ls -lhSr ** : code [cgrlunix@poset ~]$ ls -lhSr total 6.5M -rw-r--r-- 1 cgrlunix cgrlunix 9.0K Jan 7 12:43 file1.fasta -rw-r--r-- 1 cgrlunix cgrlunix 622K Jan 7 12:43 data.fastq -rw-r--r-- 1 cgrlunix cgrlunix 5.8M Jan 7 12:43 Homo_sapiens.GRCh38.81.gtf code

These options change the behavior of the command. There are online manual pages that tell you which options a particular command can take, and how each option modifies the behavior of the command. (See later in this tutorial)

** 1.2 Making Directories ** To make a subdirectory called ** unixstuff ** in your current working directory type code [cgrlunix@poset ~]$ mkdir unixstuff code To see the directory you have just created, type code [cgrlunix@poset ~]$ ls Homo_sapiens.GRCh38.81.gtf data.fastq file1.fasta unixstuff code
 * mkdir (make directory) **

** 1.3 Changing to a different directory ** The command ** cd //directory// ** means change your working directory from the current one to a different directory. The current working directory may be thought of as the directory you are in, i.e. your current position in the file-system tree.
 * cd (change directory) **

To change to the directory you have just made, type code [cgrlunix@poset ~]$ cd unixstuff code Type ** ls ** to see the contents (which should be empty) code [cgrlunix@poset unixstuff]$ls code Now create a new directory **backups** under **unixstuff** code [cgrlunix@poset unixstuff]$ mkdir backups code

** 1.4 The directories. and .. ** Still in the **unixstuff** directory, type code [cgrlunix@poset unixstuff]$ls -a . .. backups code As you can see, in the **unixstuff** directory (and in all other directories), there are two special directories called (**.**) and (**..**) In UNIX, (**.**) means the current directory, so type code [cgrlunix@poset unixstuff]$ cd. code means stay where you are (the **unixstuff** directory). This may not seem very useful at first, but using (**.**) as the name of the current directory will save a lot of typing, as we shall see later in the tutorial.

(**..**) means the parent of the current directory, so typing code [cgrlunix@poset unixstuff]$ cd .. code will take you one directory up the hierarchy (back to your home directory).

Note: typing ** cd ** with no argument always returns you to your home directory. This is very useful if you are lost in the file system.

** 1.5 Pathnames ** Pathnames enable you to work out where you are in relation to the whole file-system. For example, to find out the absolute pathname of your working directory: code [unixcgrl@poset ~]$ pwd /home/cgrlunix code
 * pwd (print working directory) **

** 1.6 More about pathnames **
 * Understanding pathnames **

Then type **ls** to list the contents of your **unixstuff** directory code [cgrlunix@poset ~]$ ls unixstuff backups code Now type code [cgrlunix@poset ~]$ ls backups ls: backups: No such file or directory code The reason is, **backups** is not in your current working directory. To use a command on a file (or directory) not in the current working directory (the directory you are currently in), you must either ** cd ** to the correct directory, or specify its pathname. To list the contents of your backups directory, you must type code [cgrlunix@poset ~]$ ls unixstuff/backups code

**2.1 Copying files**
 * cp (copy) **
 * cp //file1 file2// ** is the command which makes a copy of **file1** in the current working directory and calls it **file2**

First, ** cd ** to your home directory and type **ls**. code [cgrlunix@poset ~]$ cd [cgrlunix@poset ~]$ ls Homo_sapiens.GRCh38.81.gtf data.fastq file1.fasta  unixstuff code now make a copy of **file1.fasta** and call the copy **file2.fasta** code [cgrlunix@poset ~]$ cp file1.fasta file2.fasta code now list all the contents in your home directory code [cgrlunix@poset ~]$ ls Homo_sapiens.GRCh38.81.gtf data.fastq file1.fasta  file2.fasta  unixstuff code Now try something more fancier: make a copy of **file1.fasta** to the directory **unixstuff** and name this copy **file3.fasta** code [cgrlunix@poset ~]$ cp file1.fasta unixstuff/file3.fasta code Check the content in **unixstuff** code [cgrlunix@poset ~]$ ls unixstuff backups file3.fasta code Now **cd** to **unixstuff** code [cgrlunix@poset ~]$ cd unixstuff code Now make a copy of **file1.fasta** to the current directory and name the copy as **file4.fasta** code [cgrlunix@poset unixstuff]$ cp ../file1.fasta file4.fasta code Now make a copy of **file1.fasta** to the current directory without renaming it code [cgrlunix@poset unixstuff]$ cp ../file1.fasta. code Don't forget the dot ( ** . ** ) at the end. Remember, in unix, the dot means the current directory. NOTE: Copy directories follows the same rules except we need to add an option -r to make cp work with directories.

** 2.2 Moving files ** To move a file from one place to another, use the ** mv ** command. This has the effect of moving rather than copying the file, so you end up with only one file rather than two. It can also be used to rename a file, by moving the file to the same directory, but giving it a different name. It can also rename directories or move directories to other directories.
 * mv (move) **
 * mv //file1 file2// ** moves (or renames) **file1** to **file2**

In this lecture we will learn how to move files. Moving directories follow the same rules.

We are now going to move **file1.fasta** to your backup directory. This can be done in many different ways. Here I just list one example: cd to your home directory code [unixcgrl@poset unixstuff]$ cd code display all the content in home directory code [cgrlunix@poset ~]$ ls Homo_sapiens.GRCh38.81.gtf data.fastq file1.fasta  file2.fasta  unixstuff code now move **file1.fasta** to the directory **backups** code [cgrlunix@poset ~]$ mv file1.fasta unixstuff/backups code Now check the content in the home directory again by using **ls** code [cgrlunix@poset ~]$ ls Homo_sapiens.GRCh38.81.gtf data.fastq file2.fasta  unixstuff code You can see **file1.fasta** is gone. Now check the content in the directory backups code [cgrlunix@poset ~]$ ls unixstuff/backups file1.txt code mv can sometimes be "dangerous" to use for renaming files unless you clearly understand what you try to do. Try the following code [cgrlunix@poset ~]$ cp file2.fasta test.fasta [cgrlunix@poset ~]$ ls Homo_sapiens.GRCh38.81.gtf data.fastq file2.fasta  test.fasta  unixstuff code code [cgrlunix@poset ~]$ mv file2.fasta test.fasta [cgrlunix@poset ~]$ ls Homo_sapiens.GRCh38.81.gtf data.fastq test.fasta unixstuff code You can see that the original copy of ** file2.fasta ** is gone. ** test.fasta **is overwritten by ** file2.fasta **.

Let's rename **test.fasta** to be **file2.fasta** code [cgrlunix@poset ~]$ mv test.fasta file2.fasta code

** 2.3 Removing files and directories **
 * rm (remove), rmdir (remove empty directory), rm -r (remove directories and their contents recursively) **

To delete (remove) a file, use the ** rm ** command. Now **cd** to your **unixstuff** directory and list all the content, type code [cgrlunix@poset ~]$cd unixstuff [cgrlunix@poset unixstuff]$ ls backups file1.fasta  file3.fasta  file4.fasta code No remove (permanently!) **file1.fasta**, **file3.fasta** and **file4.fasta** from **unixstuff**, and then list the content code [cgrlunix@poset ~]$ rm file1.fasta file3.fasta file4.fasta [cgrlunix@poset unixstuff]$ ls backups code Now make a new directory **test** under **unixstuff** and use **rmdir** to remove it. Make sure to check the content of **unixstuff** before and after running **rmdir. Y**ou can also use rm -r to remove a directory and its contents recursively code [cgrlunix@poset unixstuff]$ mkdir test [cgrlunix@poset unixstuff]$ ls backups test [cgrlunix@poset unixstuff]$ rmdir test [cgrlunix@poset unixstuff]$ ls backups

[cgrlunix@poset unixstuff]$ mkdir test [cgrlunix@poset unixstuff]$ cp ../file2.fasta test/ [cgrlunix@poset unixstuff]$ rmdir test/ rmdir: failed to remove ‘test/’: Directory not empty [cgrlunix@poset unixstuff]$ rm -r test/ code

** 2.4 Displaying the contents of a file on the screen ** Before you start the next section, you may like to clear the terminal window of the previous commands so the output of the following commands can be clearly demonstrated.
 * clear (clear screen) **

Now **cd** to your home directory code [cgrlunix@poset unixstuff]$ cd code Type **clear** in the shell code [cgrlunix@poset ~]$ clear code This will clear all text and leave you with the $ prompt at the top of the window.

Now list the content by using **ls** code [cgrlunix@poset ~]$ ls Homo_sapiens.GRCh38.81.gtf data.fastq file2.fasta  unixstuff code code [cgrlunix@poset ~]$ mv unixstuff/backups/file1.fasta. [cgrlunix@poset ~]$ ls Homo_sapiens.GRCh38.81.gtf data.fastq file1.fasta  file2.fasta  unixstuff code
 * file2.fasta ** is a copy of **file1.fasta**, but we have moved **file1.fasta** to the directory **backups**. Now we can move **file1.fasta** back. Then you can type **ls** to check the content

The command ** cat ** can be used to display the contents of a file on the screen. Type: code [cgrlunix@poset ~]$ cat file1.fasta >Transcript1 CAGGCTTAGCG... >Transcript2 CAGAGCCTCCA… … code As you can see, the file is longer than the size of the window, so it scrolls past making it unreadable.
 * cat **

The command ** less ** writes the contents of a file onto the screen a page at a time. Type code [cgrlunix@poset ~]$ less file1.fasta code Press the ** [space-bar] ** if you want to see another page, type **[b]** to go up a page, type ** [q] ** if you want to quit reading. As you can see, ** less ** is used in preference to ** cat ** for long files.
 * less **

Now let's use less to view file "data.fastq"

code [cgrlunix@poset ~]$ less data.fastq code One of the most useful features of **less** is that it allows you to search text and highlights matches. Visually highlighting matches can be an extremely useful way to find potential problems in data. For example, let’s use less to do a quick search of whether there are 3’ adapter contaminants in the **data.fastq** file. In this case, we’ll look for AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC — a known adapter from the Illumina Tru‐Seq. kit. Our goal isn’t to do an exhaustive test or remove these adapters — we just want to check if there’s any indication there could be contamination.

First, we open data.fastq in less , the press / and enter AGATCGGAAGA (or longer if you want to be more stringent) . code [cgrlunix@poset ~]$ less data.fastq code

The ** head ** command writes the first few lines of a file to the screen. the default is to display 10 lines
 * head **

First **clear** the screen then type **head file1.fasta** code [cgrlunix@poset ~]$ clear [cgrlunix@poset ~]$ head file1.fasta >Transcript1 CAGGCTTAGCG... >Transcript2 CAGAGCCTCCA… … code Then type code [cgrlunix@poset ~]$ head -n 5 file1.fasta >Transcript1 CAGGCTTAGCG... >Transcript2 CAGAGCCTCCA… >Transcript3 code The above shows the first 5 lines in **file1.fasta**

The ** tail ** command writes the last ten lines of a file to the screen. Its function is opposite to **head**. Please give it a try. In the shell, type **clear**, then **tail file1.fasta**, then **tail -n 5 file1.fasta**
 * tail **

We can also use **tail** to remove the header of a file. Normally the -n argument specifies how many of the last lines of a file to display, but if **-n** is given a number x preceded with a + sign (e.g. +x), tail will start from the xth line. So to chop off a header (the first line in our case), we start from the second line with -n +2. code [cgrlunix@poset ~]$ tail -n +2 file1.fasta CAGGCTTAGCG... >Transcript2 CAGAGCCTCCA… >Transcript3 code

** 2.5 Searching the contents of a file ** code [cgrlunix@poset ~]$ grep “Transcript” file1.fasta >Transcript1 >Transcript2 >Transcript3 … code As you can see, ** grep ** has printed out each line containing the word “Transcript”.
 * grep **
 * grep ** is one of many standard UNIX utilities. It searches files for specified words or patterns. First **clear** the screen, then type

Now try typing code [cgrlunix@poset ~]$ grep “transcript” file1.fasta code The ** grep ** command is case sensitive; it distinguishes between “Transcript” and “transcript”.

To ignore upper/lower case distinctions, use the **-i** option, i.e. type code [cgrlunix@poset ~]$ grep -i “transcript” file1.fasta >Transcript1 >Transcript2 >Transcript3 … code Other common and useful options for grep include: **-v** display those lines that do NOT match **-n** precede each matching line with the line number **-c** print only the total count of matched lines

Try some of them and see the different results. Don't forget, you can use more than one option at a time, for example, the number of lines without the words “Transcript” or “transcript” is code [cgrlunix@poset ~]$ grep -ivc "Transcript" file1.fasta 11 code

A handy little utility is the ** wc ** command, short for word count. By default, **wc** outputs the number of words, lines, and characters of the supplied file.
 * wc ** (word count)

code [cgrlunix@poset ~]$ wc file1.fasta 22  22 9193 file1.fasta code

To find out how many lines of **file1.fasta** has, type code [cgrlunix@poset ~]$ wc -l file1.fasta 22

code ** 3.1 Redirecting the Output ** Most processes initiated by UNIX commands take input from the standard input (that is, they read it from the keyboard) and write to the standard output (that is, they write to the terminal). There is also the standard error, where processes write their error messages, by default, to the terminal screen. Sometimes you may wish to save the output in a file rather than just print them in a terminal, and in this case we need to redirect the output.


 * echo ** (writes its arguments to standard output)

In a terminal window type code [cgrlunix@poset ~]$ echo hello! hello! code We now direct the output **hello!** to a new file called **test.txt**, check the content of **test.txt** using **cat** code [cgrlunix@poset ~]$ echo hello! > test.txt [cgrlunix@poset ~]$ cat test.txt hello! code We use the **>** symbol to redirect the output of a command.

For example, to create a file called **list1.txt** containing a list of fruits, type code [cgrlunix@poset ~]$ cat > list1.txt code Then type in the names of some fruit. Press ** [return] ** after each one code pear banana apple code
 * ^d (Control + d to stop) **

What happens is the ** cat ** command reads the standard input (what you typed via the keyboard) and the **>** redirects the output into a file called **list1.txt**

To read the contents of the file, type **cat list1.txt** or **less list1.txt** code [cgrlunix@poset ~]$ cat list1.txt pear banana apple code The form **>>** appends standard output to a file. So to add more items to the file **list1.txt**, type code [cgrlunix@poset ~]$ cat >> list1.txt code

Then type in the names of more fruits code peach grape orange code
 * ^d (Control + d to stop) **

To read the contents of the file, type code [cgrlunix@poset ~]$ cat list1.txt pear banana apple peach grape orange code Now we can write some vegetables to a new file called **list2.txt, t** hen type in the names of some vegetables

code [cgrlunix@poset ~]$ cat > list2.txt lettuce carrot kale code
 * ^d (Control + d to stop) **

You should now have two files. One contains six fruits, and the other contains three vegetables. We will now use the ** cat ** command to join (concatenate) **list1.txt** and **list2.txt** into a new file called **biglist.txt**. Type code [cgrlunix@poset ~]$ cat list1.txt list2.txt > biglist.txt code What this is doing is reading the contents of **list1.txt** and **list2.txt** in turn, then outputing the text to the file **biglist.txt**

To read the contents of the new file, type code [cgrlunix@poset ~]$ cat biglist.txt pear banana apple peach grape orange lettuce carrot kale code The command ** sort ** alphabetically or numerically sorts a list. Type code [cgrlunix@poset ~]$ sort biglist.txt apple banana carrot grape kale lettuce orange peach pear code To output the sorted list to a file called **sortlist**, type the following code [cgrlunix@poset ~]$ sort biglist.txt > sortlist.txt code

** 3.2 Pipes ** **I (pipe)** takes the output of one command before the **pipe** and uses it as input for the command after the **pipe**. Useful for chaining commands which have one output and accept one input

An example: we want to **sort** the file biglist alphabetically, then display the first 5 lines, then from these five lines we want to **grep** any lines containing letter “r” or “R” and then count the number of matched lines code [cgrlunix@poset ~]$ sort biglist.txt | head -5 | grep -ic “r” 2

code
 * tee** writes output to stdout, and also to a file simultaneously

For example we can use **ls** to list all files and directories under the current working diretory and save the output in "all.txt" code [cgrlunix@poset ~]$ ls > all.txt

code Now we add **tee** to save the output in "all.txt" and at the same time, view the output as stdout in shell code [cgrlunix@poset ~]$ ls | tee all.txt code By default **tee** overwrites the file. You can instruct **tee** to append to the file using the option **-a** as shown below code [cgrlunix@poset ~]$ ls | tee -a all.txt code

** 4.1 Wildcards ** The character * is called a wildcard, and will match against none or more character(s) in a file (or directory) name. For example, in your **home** directory, type code [cgrlunix@poset ~]$ ls list* list1.txt list2.txt code This will list all files and subdirectories in the current directory starting with **list (list…)**
 * The characters * and ? **

Try typing code [cgrlunix@poset ~]$ ls *list.txt biglist.txt sortlist.txt code This will list all files and subdirectories in the current directory ending with **list (…list)**

Now typing code [cgrlunix@poset ~]$ ls *list* biglist.txt list1.txt list2.txt sortlist.txt code This will list all files and subdirectories in the current directory containing **list (…list…)**

The character **?** will match exactly one character. So ** ls ?ouse ** will match files like **house** and **mouse**, but not **grouse**.

Try typing code [cgrlunix@poset ~]$ ls ?list1.txt ls: ?list1.txt: No such file or directory code Now try typing code [cgrlunix@poset ~]$ ls ?ist1.txt list1.txt code

** 4.2 Filename c ** **onventions** We should note here that a directory is merely a special type of file. So the rules and conventions for naming files apply also to directories. In naming files, characters with special meanings such as **/ * & %**, should be avoided. Also, avoid using spaces within names. The safest way to name a file is to use only alphanumeric characters, that is, letters and numbers, together with _ (underscore) and. (dot). File names conventionally start with a lower-case letter, and may end with a dot followed by a group of letters indicating the contents of the file. For example, all files consisting of C code may be named with the ending .c, for example, prog1.c. Then in order to list all files containing C code in your home directory, you need only type ** ls *.c ** in that directory.

** 4.3 Getting Help **


 * man (on-line manuals) and whatis **

There are on-line manuals which gives information about most commands. The manual pages tell you which options a particular command can take, and how each option modifies the behavior of the command. Type **man** command to read the manual page for a particular command. For example, to find out more about the ** wc ** (word count) command, type code [cgrlunix@poset ~]$ man wc code Alternatively, type code [cgrlunix@poset ~]$ whatis wc code
 * whatis ** gives a one-line description of the command, but omits any information about options etc.

** 5.1 Some more advanced UNIX commands commonly used in bioinformatics ** The df command reports on the space left on the file system. For example, to find out how much space is left on the fileserver, type code [cgrlunix@poset ~]$ df -h. Filesystem           Size  Used Avail Use% Mounted on /dev/sdb               15T  9.4T  5.3T  65% /global code
 * df**
 * -h** means that it prints sizes in human readable format (e.g., 1K 234M 2G)

proc displays how many computation units (e.g. cores) are in this computer code [cgrlunix@poset ~]$ nproc 24 code
 * nproc**

This command lists more confiuguration information about CPU architectures in this computer code [cgrlunix@poset ~]$ lscpu Architecture:         x86_64 CPU op-mode(s):       32-bit, 64-bit Byte Order:           Little Endian CPU(s):               24 On-line CPU(s) list:  0-23 Thread(s) per core:   2 Core(s) per socket:   6 Socket(s):            2 NUMA node(s):         2 Vendor ID:            GenuineIntel CPU family:           6 Model:                44 Stepping:             2 CPU MHz:              2666.806 BogoMIPS:             5333.59 Virtualization:       VT-x L1d cache:            32K L1i cache:            32K L2 cache:             256K L3 cache:             12288K NUMA node0 CPU(s):    0,2,4,6,8,10,12,14,16,18,20,22 NUMA node1 CPU(s):    1,3,5,7,9,11,13,15,17,19,21,23 code
 * lscpu**

code [cgrlunix@poset ~]$ free -h total      used       free     shared    buffers     cached Mem:          47G        23G        23G       204K       389M        21G -/+ buffers/cache:      905M        46G Swap:         47G        41M        47G code
 * free**
 * free** shows the amount of memory are in total and used. **free -h** translates the numbers into human-readable sizes (Gb, Mb, etc.)

The top command displays the top 30 processes on the system and periodically updates this information code [cgrlunix@poset ~]$ top code type "q" to quit top
 * top**

**du** summarizes disk usage of each file, recursively for directories. To summarize the total size of home directory (~) code [cgrlunix@poset ~]$ du -sh ~ 6.6M    /home/cgrlunix code
 * du**
 * -s** displays only a total size

This command compresses a file. For example, to compress **file1.fasta**, type code [cgrlunix@poset ~]$ gzip file1.fasta code Now list the content in home directory code [cgrlunix@poset ~]$ ls Homo_sapiens.GRCh38.81.gtf biglist.txt  data.fastq  file1.fasta.gz  file2.fasta  list1.txt  list2.txt test.txt  unixstuff code Now **file1.fasta** is zipped and is ended with **.gz**
 * gzip**

To unzip the file, use the **gunzip** command code [cgrlunix@poset ~]$ gunzip file1.fasta.gz code Now list the content in home directory code [cgrlunix@poset ~]$ ls Homo_sapiens.GRCh38.81.gtf biglist.txt  data.fastq  file1.fasta  file2.fasta  list1.txt  list2.txt  test.txt  unixstuff code If you want to gzip a file and keep the original file, us **-c** ( Write output on standard output; keep original files unchanged. ) code [cgrlunix@poset ~]$ gzip -c file1.fasta > testfile.gz code

For example we combine all files from unixstuff (and from its all subdirectories) and then gzip them into one file: archive.tar.gz. code [cgrlunix@poset ~]$ tar -zvcf archive.tar.gz unixstuff
 * tar**
 * tar** means tape archive: we use **tar** to combine a few files from one or more than one folder into a single file for easy storage and distribution. **tar** and **gzip** can be used together to compress directories.

code -z: Compress archive using gzip program -c: Create a new archive -v: Verbose i.e display progress while creating archive -f: use archive file name

To extract all files/directories from archive.tar.gz code [cgrlunix@poset ~]$ tar -vxf archive.tar.gz code -x: extract files

The shell keeps an ordered list of all the commands that you have entered. Each command is given a number according to the order it was entered. code [cgrlunix@poset ~]$ history …lots of commands… code You can write the history into another file using **>**.
 * history**

Create an empty file sequence.fasta code [cgrlunix@poset ~]$ nano sequence.fasta code and copy the following fasta sequences to this file >seq1 AGTACGTAGTAGCTGCTGCTACGTGCGCTAGCTAGTACGTCAA TCGTACGTCGACTGATCGTAGCTACGTCGTACGTAGGTACGTT CGACGTAGATGCTAGCTGACTCGATGC >seq2 CGATCGATCGTACGTCGACTGATCGTAGCTACGTCGTACGTAG GTACGTGATCGTACGTCGACTGATCGTAGCTATCGTAGCTACC CATCGTCAGTTACTGCATGCTCG >seq3 GATCGTACGTGTAGCTGCTGTAGCTGATCGTACGGTAGCAGCT CGTCGACTGATCGTAGCTGATCGTACGTCGACTGATCGTAGCT AGTACGTAGTAGCTGTAGCTGATCGTACGTAGCTGATCGTACG >seq4 CGATCGATCGTACGTCGACTGATCGTAGCTACGTCGTACGTAG GTACGTGATCGTACGTCGACTGATCGTAGCTATCGTAGCTACC GTAGATGCTAGCTGACTCGAT
 * nano ** (a small and friendly text editor)

Type **ctrl+X**, **Y**, and then **return** to quit. Now peek inside this file by using **cat** or **less**

code [cgrlunix@poset ~]$ wget http://cgrlucb.wikispaces.com/file/view/gene_exp.diff
 * wget ** (retrieve files from web servers)

code

Read the contents of **gene_exp.diff**, type code [cgrlunix@poset ~]$ less -S gene_exp.diff code
 * cut ** (extract sections from each line of input)
 * -S ** causes lines longer than the screen width to be chopped rather than folded

Extract the third column from gene_exp.diff and display it using less code [cgrlunix@poset ~]$ cut -f 3 gene_exp.diff | less code

The **-f** argument is how we specify which columns to keep. The argument -f also allows us to specify ranges of columns (e.g. **-f 3-8**) and sets of columns (e.g. -f 3,5,8). Note that it’s not possible to reorder columns using using cut, e.g. **-f 6,5,4,3** will not work. To reorder columns, you’ll need to use **awk** which is discussed later.

Extract the third column from gene_exp.diff, exclude lines containing “-“, sort the names alphabetically and then display it using **less** code [cgrlunix@poset ~]$ cut -f 3 gene_exp.diff | grep –v "-" | sort | less code

Next let's work on the gtf file ( Homo_sapiens.GRCh38.81.gtf ) using cut. First of all let's peek into the gtf file using **less -S** code [cgrlunix@poset ~]$less -S Homo_sapiens.GRCh38.81.gtf code

The first five lines start with "**#**" are headers. code code We want to extract chromosome (the first column), start position (4th column) and end position (5th column) of each exon from the gtf, and then store the results in a new file called "gene.bed" - this is a typical bed format. code [cgrlunix@poset ~]$ tail -n +6 Homo_sapiens.GRCh38.81.gtf | grep "\bexon\b" | cut -f1,4,5 > gene.bed code
 * 1) !genome-build GRCh38.p3
 * 2) !genome-version GRCh38
 * 3) !genome-date 2013-12
 * 4) !genome-build-accession NCBI:GCA_000001405.18
 * 5) !genebuild-last-updated 2015-06
 * \b ** matches the empty string at the edge of a word. It sets a boundary to the matches.

Firstcreates a new file **uniq_test.txt** using nano and copy the following content to this file
 * uniq ** (when fed a text file, outputs the file with consecutive identical lines collapsed to one)

gene1 A gene1 A  gene6 F  gene6 F  gene6 F  gene1 A  gene2 B  gene2 B  gene4 D  gene5 E  gene6 F  gene6 F  gene6 F  gene1 A  gene1 A  gene2 B  gene2 B  gene2 B  gene3 C  gene4 D  gene4 D

First let's try **uniq** the list directly code [cgrlunix@poset ~]$ uniq uniq_test.txt gene1 A gene6 F gene1 A gene2 B gene4 D gene5 E gene6 F gene1 A gene2 B gene3 C gene4 D code

As you can see, **uniq** does not return the unique values — it only removes consecutive duplicate lines (keeping one). If instead we did want to find all unique lines in a file, we would first sort all lines using **sort** so that all identical lines are grouped next to each other, and then run **uniq**. For example: code [cgrlunix@poset ~]$ sort uniq_test.txt | uniq gene1 A gene2 B gene3 C gene4 D gene5 E gene6 F code Print only those lines which are not repeated (unique) in the input. code [cgrlunix@poset ~]$ sort uniq_test.txt | uniq -u gene3 C gene5 E code Print only those lines which are repeated in the input. code [cgrlunix@poset ~]$ sort uniq_test.txt | uniq -d gene1 A gene2 B gene4 D gene6 F code You can also get some statistics out of **uniq** with the **-c** option code [cgrlunix@poset ~]$ sort uniq_test.txt | uniq -c 5 gene1 A 5 gene2 B 1 gene3 C 3 gene4 D 1 gene5 E 6 gene6 F

code

Both **sort | uniq** and **sort | uniq -c** are frequently used shell idioms in bioinformatics and worth memorizing. Combined with other Unix tools like **grep and cut**, **sort and uniq** can be used to summarize columns of tabular data. Now we want to count the number of record for each feature present in the gtf file: code [cgrlunix@poset ~]$ tail -n +6 Homo_sapiens.GRCh38.81.gtf | cut -f3 | sort | uniq -c 2931 CDS 1083 UTR 5214 exon 250 gene 328 start_codon 296 stop_codon 893 transcript code note: the 3rd column contains the features of the gtf

Since **sort** and ** uniq ** are line-based, we can create lines from multiple columns to count combinations. In this example we would like to count how many each feature (column 3 in this example GTF) are on each strand (column 7): code [cgrlunix@poset ~]$ tail -n +6 Homo_sapiens.GRCh38.81.gtf | cut -f3,7 | sort | uniq -c 1303 CDS + 1628 CDS - 407 UTR + 676 UTR - 2158 exon + 3056 exon - 126 gene + 124 gene - 145 start_codon + 183 start_codon - 113 stop_codon + 183 stop_codon - 389 transcript + 504 transcript - code Or, if you want to see the number of features belonging to a particular gene identifier: code [cgrlunix@poset ~]$ grep "ENSG00000223972" Homo_sapiens.GRCh38.81.gtf | cut -f3 | sort | uniq -c 9 exon 1 gene 2 transcript code

{A,B,C} expands all the elements within the braces {1..9} expands a range of elements within the braces code [cgrlunix@poset ~]$ echo hello {A,B,C} hello A B C [cgrlunix@poset ~]$ echo hello{A,B,C} helloA helloB helloC code Create several folders with names starting with "**folder**" under the home directory code [cgrlunix@poset ~]$ mkdir folder_{A,B,C} [cgrlunix@poset ~]$ ls folder* folder_A: folder_B: folder_C: code Create new several empty .txt files with names beginning with "**text**" using **touch** under the home directory code [cgrlunix@poset ~]$ touch text{1..5}.txt [cgrlunix@poset ~]$ ls text* text1.txt text2.txt text3.txt text4.txt text5.txt
 * {} ** (Brace expansion is a mechanism by which arbitrary strings may be generated)

code

Search the home directory and print any files or directories beginning with **list** code [cgrlunix@poset ~]$ find ~ -name "list*" /home/cgrlunix/list1.txt /home/cgrlunix/list2.txt code Search the home directory to print all subdirectories code [cgrlunix@poset ~]$ find ~ -type d /home/cgrlunix /home/cgrlunix/.cache /home/cgrlunix/unixstuff /home/cgrlunix/unixstuff/backups /home/cgrlunix/.config /home/cgrlunix/.config/htop /home/cgrlunix/folder_A /home/cgrlunix/folder_C /home/cgrlunix/folder_B code
 * find **
 * find ** is useful for finding file or directory names that match certain criteria.

**grep** is one of the most powerful Unix data tools and very commonly used in bioinformatic data processing code [workshop@poset ~]$ grep -n "T$" sequence.fasta 3:TCGTACGTCGACTGATCGTAGCTACGTCGTACGTAGGTACGTT 10:GATCGTACGTGTAGCTGCTGTAGCTGATCGTACGGTAGCAGCT 11:CGTCGACTGATCGTAGCTGATCGTACGTCGACTGATCGTAGCT 16:GTAGATGCTAGCTGACTCGAT code The above command **grep** each line ending with a "T" from **sequence.fasta** and print out the line number of that line. $ matches the end of a line code [workshop@poset ~]$ grep -n "^C" sequence.fasta 4:CGACGTAGATGCTAGCTGACTCGATGC 6:CGATCGATCGTACGTCGACTGATCGTAGCTACGTCGTACGTAG 8:CATCGTCAGTTACTGCATGCTCG 11:CGTCGACTGATCGTAGCTGATCGTACGTCGACTGATCGTAGCT 14:CGATCGATCGTACGTCGACTGATCGTAGCTACGTCGTACGTAG code The above command **grep** each line starting with a "C" and print out the line number of that line. ^ matches the beginning of a line code [workshop@poset ~]$ grep -n "^C.*T$" sequence.fasta 11:CGTCGACTGATCGTAGCTGATCGTACGTCGACTGATCGTAGCT code The above command **grep** each line starting with "C", and ending with "T", with any letters in between them, and then print out the line number of that line. **.** means any single character (except newline), * means that the preceding item will be matched zero or more times.
 * More about grep **

Now let's go back to work on **file1.fasta**. It is a fasta file that contains 11 transcript sequences. We want to print the sequence of "Transcript10". code [workshop@poset ~]$ grep -A 1 ">Transcript10" file1.fasta code code [workshop@poset ~]$ grep -B 1 ">Transcript10" file1.fasta code
 * -A ** N[int] means that it will print N lines of trailing context after matching lines.
 * -B ** N[int] means that it will print N lines of leading context before matching lines.

The above two command lines are useful for displaying sequences of a fasta file. code [workshop@poset ~]$ grep -A 1 ">Transcript1" file1.fasta code What happened? It matches Transcript1, Transcript10 and Transcript11 since all there contain “Transcript1”, the pattern we were searching. We just want the sequence after Transcript1 to be printed. We need to add more conditions for the search. Several ways to do this: code [workshop@poset ~]$ grep -A 1 ">Transcript1$" file1.fasta [workshop@poset ~]$ grep -A 1 ">Transcript1\b" file1.fasta code Now let's work on the gtf file. Display content without header lines: code [cgrlunix@poset ~]$ grep -v "^#" Homo_sapiens.GRCh38.81.gtf | less -S code Next let's use **grep** to extract record of a transcript called "PLCH2-001" code [cgrlunix@poset ~]$ grep "\btranscript\b" Homo_sapiens.GRCh38.81.gtf | grep "PLCH2-001" |  less -S code Next let's use **grep** to extract record of transcripts "PLCH2-001" and "PLCH2-002" code [cgrlunix@poset ~]$ grep "\btranscript\b" Homo_sapiens.GRCh38.81.gtf | grep -E "PLCH2-001|PLCH2-002" |  less -S code **-E** means **Extended Regular Expressions (ERE)**. EREs allow us to alternation (regular expression jargon for matching one of several possible patterns) to match either “PLCH2-001” or “PLCH2-002”. The syntax uses a pipe symbol, **|**: In basic regular expressions the meta-characters ?, +, {, |, (, and ) lose their special meaning; instead use the backslashed versions \?, \+, \{, \|, \(, and \). The above command can be expressed using the basic grep regular expression: code [cgrlunix@poset ~]$ grep "\btranscript\b" Homo_sapiens.GRCh38.81.gtf | grep "PLCH2-001\|PLCH2-002" |  less -S code

By default, **grep** outputs the entire matching line if the line contains the pattern we try to search. To only output the matched part of the pattern, we use **-o** for grep**.** Now we want to extract all values from the “gene_id” field from the last column of the gtf file. This can be done using grep **-o **: code [cgrlunix@poset ~]$ grep -E -o "ENSG\w+" Homo_sapiens.GRCh38.81.gtf | less -S code **\w** means a word character (a-z, A-Z, 0-9...). **+** means the preceding item is matched one or more times. **\w+** means one or more word characters

Here, we’re using extended regular expressions (ERE) to capture all gene names in the field. However, as you can see there’s a great deal of redundancy: our GTF file has multiple features (transcripts, exons, start codons, etc.) that all have the same gene name. We could quickly convert this messy output from grep to a list of unique, sorted gene names: code [cgrlunix@poset ~]$ grep -E -o "ENSG\w+" Homo_sapiens.GRCh38.81.gtf | sort | uniq > unique_gene.txt code

A stream editor is used to perform basic text transformations on an input stream (a file or input from a pipeline). It reads data from a file or standard input and can edit a line at a time.
 * sed ( ** a stream editor**)**

Here we use **sed** to find and replace complex string in a file. For example we want to find "Transcript10" from file **file1.fasta** and then replace "Transcript10" with "gene10" code [cgrlunix@poset ~]$ sed 's/Transcript10/gene10/g' file1.fasta .... >gene10 GTCCTGGCCTGCTCTTGCTGCTCCTCAGTGGCCTCCAAGCTGTTATTCAAGAAAAAG .... code In the above example, **s/A/B/g** means that it will replace A with B in global search.

Next, we want to find the sequence of Transcript1 and then transform all "A"s to "a"s code [cgrlunix@poset ~]$ grep -A 1 ">Transcript1$" file1.fasta | sed 's/A/a/g' >Transcript1 CaGGCTTaGCGGTTTaTTGaTGTCTCCCaaGCCaGaGGaaGTCaCCaaaGGGaaGaGTCaCTTGCaGaaaaTGGTTTCCaTTTCaGaCaTTaCaTC CCTGTaCaGCaaCCTGCTCCTGTaCaCGTGTGGGTCCTCTGCCGaaGCCaCCaGGGCTGTCaTGaGaCaCCTTGCCGCTGTGTaTCaaCaTGGCaG CCTGCTaGGGCTTTCTGTTGCCaaGaGGCCTCTCTGGaGaCaGGCaTCTaTGCaaaGTGGGaaGGaCaCCaCTGaGCaaGaaaTTCTGaaaGCTaT CaaCaTCaaTTCCTTTGCaGaGTGTGGCaTCaaTTTaTTCCaTGaGaGTGTaTCTaaaTCaGCCCTGaGCCaaGaaTTCGaaGCTTTCTTTCGT

code

Substitutions make up the majority of sed’s usage cases, but this is just scratching the surface of sed’s capabilities. It’s also possible to select and print certain ranges of lines with sed. In this case, we’re not doing pattern matching, so we don’t need slashes. To print the lines 5 through 8 of the fastq file, we use: code [cgrlunix@poset ~]$ sed -n '5,8p' data.fastq @HS2:308:C14K4ACXX:6:1101:1540:2065 1:Y:0:CGGTGTA NTATTGATCTGAAGGAGGTCTGGTACAAGACAGTGGACATAATCAATCTGGGTGTGGTACTTAGGAGGGGGTTCCAACTGTAGGATATCATCATCCATTT + code
 * 1:=?<4?BBB,,2?A;?7?<?A?+++2<2:77@B**1?*:*?B<>**0>A>/70).8=>7)9:(,(59>?5(:;?????:::(8;?;;?=

remove the 2nd line from a file code [cgrlunix@poset ~]$ sed '2d' Homo_sapiens.GRCh38.81.gtf | less -S code
 * sed** can also be used to remove lines from a file

remove lines 1-5 from a file code [cgrlunix@poset ~]$ sed '1,5d' Homo_sapiens.GRCh38.81.gtf | less -S code To figure out which lines to remove you can use grep -n

remove the last line code [cgrlunix@poset ~]$ sed '$d' Homo_sapiens.GRCh38.81.gtf | less -S code remove the first and last lines code [cgrlunix@poset ~]$ sed '1d;$d' Homo_sapiens.GRCh38.81.gtf | less -S code

**tr** ( translate or transliterate) **tr** reads from stdin or file (one line at a time) and r eplaces or removes or compresses specific characters

1. replacement: For example, we can use **tr** to transform lowercase to uppercase code [cgrlunix@poset ~]$ echo "hello world" | tr 'h' 'H' code use **tr** to transform all lowercase to uppercase code [cgrlunix@poset ~]$ echo "hello world" | tr 'a-z' 'A-Z' code use **tr** for searching and replacing specific characters code [cgrlunix@poset ~]$ echo "hello world" | tr 'eo' '12' code 2. delete: we can use **tr** **-d** to search for characters and delete them code [cgrlunix@poset ~]$ echo "hello world" | tr -d 'o' code 3. compress: code [cgrlunix@poset ~]$ echo "hello world" | tr -s 'l' code code [cgrlunix@poset ~]$ echo "hello        world" | tr -s ' ' code code [cgrlunix@poset ~]$ echo "hello        world" | tr -s ' ' ',' code
 * tr -s** can be used to compress consecutive identical characters to one

Awk is a specialized language that allows you to do a variety of text processing tasks with ease. It mainly help us extract data from and manipulate tabular plaintext files. Throughout the workshop, we’ve seen how we can use simple Unix tools like grep, cut, and sort to inspect and manipulate plaintext tabular data in the shell. For many trivial bioinformatics tasks, these tools allow us to get the job done quickly and easily (and often very efficiently). Still, some tasks are slightly more complex and require a more expressive and powerful tool. This is where the language and tool Awk excels. We’ll introduce the basics of Awk in this section — enough to get you started with using awk in bioinformatics. While Awk is a fully-fledged programming language, it’s a lot less expressive and powerful than Python and Perl. If you need to implement something complex, it’s likely better (and easier) to do so in Python/Perl. The key to using Awk effectively is to use it for the subset of tasks it’s best at: **quick data processing tasks on tabular data**.
 * awk: **

To learn Awk we’ll cover two key parts of the Awk language: how Awk processes records, and pattern-action pairs. After understanding these two key parts the rest of the language is quite simple.

First, Awk processes input data a record at a time. Each record is composed of fields, separate chunks that Awk automatically separates. Since Awk was designed to work with tabular data, each record is a line, and each field is a column’s entry for that record. The clever part about Awk is that it automatically assigns the entire record to the variable $0, and field one’s value is assigned to $1, field two’s value is assigned to $2, field three’s value is assigned to $3, and so forth. Second, we build Awk programs using one or more of the following structures: pattern { action }. Each pattern is an expression or regular expression pattern. Patterns are a lot like if statements in other languages: if the pattern’s expression evaluates to true or the regular expression matches, the statements inside action are run. In Awk lingo, these are pattern-action pairs and we can chain multiple pattern-action pairs together (separated by semicolons).

If we omit the pattern, Awk will run the action on all records. If we omit the action but specify a pattern, Awk will print all records that match the pattern. This simple structure makes Awk an excellent choice for quick text processing tasks. This is a lot to take in, but these two basic concepts --records and fields, and pattern-action pairs-- are the foundation of writing text processing programs with Awk. Let’s see some examples.

<span style="font-family: &#39;Times New Roman&#39;; font-size: 10.5pt;">First, we can simply mimic <span style="font-family: &#39;Times New Roman&#39;; font-size: 10pt;">cat <span style="font-family: &#39;Times New Roman&#39;; font-size: 10.5pt;">by omitting a pattern and printing an entire record with the variable <span style="font-family: &#39;Times New Roman&#39;; font-size: 10pt;">$0 <span style="font-family: &#39;Times New Roman&#39;; font-size: 10.5pt;">: code [cgrlunix@poset ~]$ awk '{print $0}' gene.bed code
 * - Action only**

Awk can also mimic cut : code [cgrlunix@poset ~]$ awk '{print $2"\t"$3}' gene.bed code Here, we’re making use of Awk’s string concatenation. Two strings are concatenated if they are placed next to each other with no argument. So for each record, $2"\t"$3 concatenates the second field, a tab character, and the third field.

Let’s now look at how we can incorporate simple pattern matching. Suppose we wanted to write a filter that only output lines where the length of the feature (end position - start position) is greater than 1,000. Awk supports arithmetic with the standard operators +, - , * , / , % (remainder), and ^ (exponentiation). We can subtract within a pattern to calculate the length of a feature, and filter on that expression: code [cgrlunix@poset ~]$ awk '$3 - $2 + 1 > 1000' gene.bed code
 * - Pattern only**

We can also chain patterns, by using logical operators && (and), || (or), and ! (not). For example, if we want to extract all records derived from chromosome 1 if the length is greater than 1000: code [cgrlunix@poset ~]$ awk '$1 ~ /^1$/ && $3 - $2 + 1 > 1000' gene.bed code The first pattern, $1 ~ /^1$/, is how we specify a regular expression. Regular expressions are in slashes. Here, we’re matching the first field, $ 1, against the regular expression 1. The tilde, ~ means match; to not much the regular expression we would use !~ (or !($1 ~ /^1$/) ).

We can combine patterns and more complex actions than just printing the entire record. For example, if we want to add a column with the length of this feature (end position - start position) for only mitochondrial, we could use: code [cgrlunix@poset ~]$ awk '$1 ~ /MT/ { print $0 "\t" $3 - $2 + 1}' gene.bed code
 * - Pattern and action **

So far, these exercises illustrate how Awk is useful in (1) filtering data using rules that can combine regular expressions and arithmetic, and (2) reformatting the columns of data using arithmetic. These two applications alone make Awk an extremely handy tool in bioinformatics, and a huge time saver. Now let’s look at some slightly more advanced use cases. We’ll start by introducing two special patterns: **BEGIN** and ** END **.

The ** BEGIN **pattern, specifies what to do before the first record is read in, and ** END **specifies what to do after the last record’s processing is complete. ** BEGIN **is useful to initialize and set up variables, and ** END **is useful to print data summaries at the end of file processing. For example, we want to calculate the mean length of all the record in gene.bed. We would have to take the sum of all lengths, and then divide the sum by the total number of records. We can do this with:

code [cgrlunix@poset ~]$ awk 'BEGIN { s = 0 }; { s += ($3 - $2 + 1) }; END { print "mean: " s/NR }' gene.bed mean: 224.604 code
 * NR **is the current record number, so on the last record ** NR **is set to the total number of records processed. In the example above, we’ve initialized a variable ** s **to 0 in BEGIN (variables you define do not need a dollar sign). Then, for each record we increment **s** by the length of the feature. At the end of the records, we print this sum ** s **divided by the number of records ** NR **, giving the mean.

While Awk is designed to work with whitespace-separated tabular data, it’s easy to set a different field separator: simply specify which separator to use with the -F argument. For example, we could work with a CSV file in Awk by starting with awk -F",".

We can use NR to extract ranges of lines too; for example, if we want to extract all lines between 3 and 5 (inclusive): code [cgrlunix@poset ~]$ awk 'NR >= 3 && NR <= 5' gene.bed 1 11869 12227 1 12613 12721 1 13221 14409 code

Awk makes it easy to convert from GTF to BED : code [cgrlunix@poset ~]$ awk '$1 ~! /^#/ {print $1 "\t" $4 "\t" $5 }' Homo_sapiens.GRCh38.81.gtf | sort | uniq | head -n 5 1 11868 12227 1 11868 14409 1 12009 12057 1 12009 13670 1 12178 12227 code