PythonSpring2013

=Introduction to Python for Genomic Data Analysis=

Basics
Welcome to the Introduction to Python workshop at CGRL. This course is aimed at biologists who are interested in learning how to analyze genomic data sets and especially Next Generation Sequencing data sets. Our goal is to provide an impetus to people to start writing small programs and then be able to explore further on their own. Since the expected audience is people who have little to no programming experience this material covered in this course will be at a beginner level. However, since programming requires some understanding of the command line environment we expect that the attendees have an understanding of the Unix or Unix-like environments (e.g. MacOS). You are also expected to know one of the text editors to read and write text files in the Unix environment. Some of the common ones are emacs, vi, nano etc.

Why Python
Python programming language is new and is increasing in popularity. There are several features in the language that are helpful in tackling problems in the genomic data analysis field. Many of the newer programs that are available in this area are written in Python as well. Moreover it is easier than some other languages for a novice programmer to start on.

Python is an evolving language and there are several versions available now. We will be using version 2.7.1 in this class. Many of the programs available for genomic data analysis are written in versions 2.6 or 2.7, but the difference between these are not very significant. These versions are still the most popular ones. The material covered in this class will work well for the most part in both. The newest versions Python 3.x have major differences from 2.x and some of the code covered in this class may not work with it.

Programming in general
INPUTS --> PROGRAMS ---> OUTPUTS

Programs are a set of instructions to the computer to do certain actions. Each program will take 0 or more inputs and then carry out the instructions given and produce 0 or more outputs. In general, there are multiple ways of achieving the same results by doing the different sets of actions. Therefore, there are many of writing computer programs to achieve the same outcome. However, some programs are more efficient and elegant than others. Some may be more easy to read and maintain over time. We will not place much emphasis on these points since it is a beginner level class, but it is good to learn those differences and incorporate these into your programs and programming style as you become more proficient.

Working environment
You are welcome to use the programming environment on your own laptops if you prefer. Having Python 2.6 or 2.7 already installed will be best. However, we will not be able to help much if there any major difficulties using it. You have the option to use Python on the CGRL server machine. The details about logging into the machine will be given in class.

Let's get our feet wet
Once you have logged into your account and set up the environment (PATH variable for Python), check out the version of Python you have.

$ **python -V** //Python 2.7.1//

The traditional way to start programming is to print out the words "Hello, World!". To do this in Python you will have to start the Python interpreter. We do this by simply typing in the "python" command at the prompt.

$ **python** //Python 2.7.1 (r271:86832, May 12 2011, 10:02:04)// //[GCC 4.6.0] on linux2// //Type "help", "copyright", "credits" or "license" for more information.// //>>>//

The python interpreter has been started and is now waiting for you input. This is one of the ways to use Python. Python is an interpreted language, which means that it takes instructions from the programmer, one at a time, and executes them. Let us get Python to print out something. //>>>// **print "Hello, World!"** //Hello, World// //>>>//

The interpreter executed your instruction and prompts you for the next command. Try an arithmetic expression:

//>>>// **3 + 2** //5// //>>>//

So Python interpreter can act as a simple (you can make it do much more complex ones too) calculator as well. Let's see how we can stop the interpreter and come back to the Unix command prompt.

//>>>// **quit** //$//

That is it! Not particularly useful, but you have used Python to run a small script or program. Although some people may be finicky about the words scripts and programs, here we will use it interchangeably. It is a set of instructions to a computer in one or other programming language, which in our case is Python.

Running Python programs from command line
We have seen the interactive of the Python interpreter. The advantage of using Python interpreter this is way is to quickly see the output of each action. When programs are being developed this is a good way to interact with Python. However, once you leave the interpreter there is no saved history of the commands. To do that we have call the Python interpreter to execute a program from the command line.

To do that let's first create a simple program which will do the same task as before; print out the string "Hello, World!". You will use (your favorite) an editor to create the file containing your program the text of which is shown below.

$ **emacs helloWorld.py** code format="python" print "Hello, World!" code
 * 1) !/usr/bin/env python

Now you can execute the program by calling Python on this program file

$ **python helloWorld.py** //Hello, World!//

The first line of the program starting with the "#!" sets up the environment to execute the file; in this case by calling "python" which is the interpreter. We do not have to specify python interpreter if the file is executable.

$ **chmod +x helloWorld.py** $ **./helloWorld.py**

To try a similar program with varying input and output $ **cat helloDude.py** code format="python" dude = raw_input("Enter dude's name: ") print "Hello, %s!" % dude code
 * 1) !/usr/bin/env python

Expressions, Variables, Statements
Here we with look at mathematical expressions, variables or named objects, and statements in Python. The best to understand these are to get back into the interactive Python interpreter again. We are going to look at types, precedence or operations, and assignment in this section. Variables names can be anything subject to a few conditions: they must begin with a letter, can contain letters or numbers or underscores, cannot be one of Python keywords such as code and         del        from        not       while as           elif        global      or         with assert     else       if             pass      yield break      except  import     print class       exec      in            raise continue finally    is             return def          for        lambda     try code

$ **python**

code format="python" Python 2.7.1 (r271:86832, May 12 2011, 10:02:04) [GCC 4.6.0] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> 2 + 5 - 3 4 >>> result = 2 + 5 * 4 >>> result 22 >>> print result 22 >>> type (result)  >>> greeting = "Welcome" >>> greeting 'Welcome' >>> print greeting Welcome >>> type (greeting)  >>> ltor = 50 / 5 * 2 >>> print ltor 20 >>> 50 / 5 ** 2 2 >>> 45 / (5 + 10) 3 >>> n Traceback (most recent call last): File " ", line 1, in NameError: name 'n' is not defined >>> code

When in doubt use parenthesis to explicitly specify the order of execution.

Numeric values
There are mainly two types of numeric values - integers and floating point values. When all number in an expression are integers the operations will be integer arithmetic.

code format="python" >>> x = y = z = 0 >>> x 0 >>> z 0 >>> width = 7 >>> length = 5 * 4 >>> width * length 140 >>> area = width * length >>> 7 / 3 2 >>> 12 / 5.0 2.4 >>> code

Strings
Strings are an important data type in Python. They are specified enclosed in single or double quotes usually. If a quote character is in the string the surrounding quote should be of the other type or the internal quote must be "escaped". We will look at some examples below.

code format="python" >>> 'spam eggs' 'spam eggs' >>> "doesn't" "doesn't" >>> 'doesn\'t' "doesn't" >>> '"Yes,", he said.' '"Yes,", he said.' >>> hello = '''This is a really long line ...    continuing on to a second line''' >>> hello 'This is a really long line\n   continuing on to a second line' >>> print hello This is a really long line continuing on to a second line >>> x = "AC" >>> y = "TG" >>> x + y 'ACTG' >>> z = x + y >>> z 'ACTG' >>> z * 3 'ACTGACTGACTG' >>> rep_z = z * 3 >>> len(rep_z) 12 >>> rep_z[0] 'A' >>> rep_z[2] 'T' >>> rep_z[11] 'G' >>> rep_z[0:4] 'ACTG' >>> rep_z[:4] 'ACTG' >>> z[2:] 'TG' >>> rep_z[2:] 'TGACTGACTG' >>> new_z = z[1:-1] + 'q' >>> new_z 'CTq' code

Lists
Lists are similar to strings in that they are ordered collections. However, strings contain characters as the items whereas lists can contain characters, strings, numbers, or even other lists as their items. The lists can also be accessed through their indexes just like strings. Unlike strings however, the contents of the lists can be changed.

code format="python" >>> cheeses = ['cheddar', 'gouda', 'cottage'] >>> numbers = [17, 15, 23] >>> mixlist = ['contains', 34] >>> empty = [] >>> print cheeses, numbers, empty, mixlist ['cheddar', 'gouda', 'cottage'] [17, 15, 23] [] ['contains', 34] >>> a = ['spam', 'eggs', 100, 1234] >>> a ['spam', 'eggs', 100, 1234] >>> a[2] = a[2] + 23 >>> a ['spam', 'eggs', 123, 1234] >>> a[0:2] = [1, 12] >>> a [1, 12, 123, 1234] >>> a[0:2] = [] >>> a [123, 1234] >>> a[1:1] = ['foo', 'bar'] >>> a [123, 'foo', 'bar', 1234] >>> #Insert copy of a at beginning ... a[:0] = a >>> a [123, 'foo', 'bar', 1234, 123, 'foo', 'bar', 1234] code

Control Flow
In addition to straight line execution of instructions we need to do be able to specify actions based on conditions. Control-flow statements help achieve this objective. The two main control flow statements we will look at are the "if-elif-else" statement and the "for" statement.

code format="python" >>> x = int(raw_input("Please enter an integer: ")) Please enter an integer: 42 >>> if x < 0: ...     x = 0 ...     print 'Negative changed to zero' ... elif x == 0: ...     print 'Zero' ... elif x == 1: ...     print 'Single' ... else: ...     print 'More' ... More >>> words = ['cat', 'jump', 'window'] >>> for w in words: ...    print w, len(w) ... cat 3 jump 4 window 6 code

Exercise
In DNA strings, symbols 'A' and 'T' are complements of each other, as are 'C' and 'G'. The reverse complement of a DNA string s is the string r formed by reversing the symbols of s, then taking the complement of each symbol (e.g., the reverse complement of "GTCA" is "TGAC"). Write a Python script to read in a DNA string of atmost 1000 bp and return the reverse complement of the string.

Functions
Functions help encapsulate code which can be reused over and over again. We have already used a few like "raw_input" and "len". These are predefined functions available in the language. We can also define our own.

code format="python" >>> def print_twice (stuff): ...    print stuff ...    print stuff ... >>> print_twice ("try this") try this try this >>> print_twice ("sleep is divine") sleep is divine sleep is divine code

File Input and Output
Files are necessary to persist data after programs are done executing. We can read data from files and write data to files.

See the example below to read the first few lines of a text file and print it out code format="python" >>> fin = open ('/global/courses/spr2013/pythonWorkshop/frost.txt', 'r') >>> x = 0 >>> while x < 4: ...    line = fin.readline.strip ...    x = x + 1 ...    print line ... Two roads diverged in a yellow wood, And sorry I could not travel both And be one traveler, long I stood And looked down one as far as I could >>> fin.close code

Exercises
1. Write a Python script read a fasta file and print out the sequence name and nucleotide counts for the entries in the file. Use the file in /global/courses/spr2013/pythonWorkshop/mrna10.fa as a test case. This file is copied below code >AX207369 1 cagacttccatcg >AX207370 1 cgatggaagtttg >AX207371 1 gcgctatcctcatcgcgac >AX207372 1 gtcgtgctggggatagagc >AX207373 1 ttatgattcttcctc >AX207374 1 gcggaaggatcatta >AX207375 1 cttcccttcaatttcttaaagcttc >AX207376 1 gaaacttaaaggaattgacggaagg >AX207382 1 ttctcgattccgtg >AX319390 1 ccttgtactgtccgaagcgcagtcaggt code

2. Write a Python program to read the file (beware this file is not small and is in compressed format) /global/courses/spr2013/pythonWorkshop/mrna.fa.gz and write the sequence names and GC percentage of the sequence to another file.