Introduction+to+Python+Fall+2014


 * Introduction to Python for Genomic Data Analysis **
 * Oct 2nd, 2014 **
 * Ravi Alla **

toc

Basics
Welcome to the Introduction to Python workshop at CGRL. This course is aimed at biologists who are interested in learning how to analyze genomic data sets and especially Next Generation Sequencing data sets. Our goal is to provide an impetus to people to start writing small programs and then be able to explore further on their own. Since the expected audience is people who have little to no programming experience this material covered in this course will be at a beginner level. However, since programming requires some understanding of the command line environment we expect that the attendees have an understanding of the Unix or Unix-like environments (e.g. MacOS). To make it easier to write python scripts we will be using a Python Integrated Development Environment (IDE) called Pycharm

Why Python
Python programming language is new and is increasing in popularity. There are several features in the language that are helpful in tackling problems in the genomic data analysis field. Many of the newer programs that are available in this area are written in Python as well. Moreover it is easier than some other languages for a novice programmer to start on.

Python is an evolving language and there are several versions available now. We will be using version 2.7.* in this class. Many of the programs available for genomic data analysis are written in versions 2.6 or 2.7, but the difference between these are not very significant. These versions are still the most popular ones. The material covered in this class will work well for the most part in both. The newest versions Python 3.x have major differences from 2.x and some of the code covered in this class may not work with it.

Programming in general
INPUTS --> PROGRAMS ---> OUTPUTS

Programs are a set of instructions to the computer to do certain actions. Each program will take 0 or more inputs and then carry out the instructions given and produce 0 or more outputs. In general, there are multiple ways of achieving the same results by doing the different sets of actions. Therefore, there are many of writing computer programs to achieve the same outcome. However, some programs are more efficient and elegant than others. Some may be more easy to read and maintain over time. We will not place much emphasis on these points since it is a beginner level class, but it is good to learn those differences and incorporate these into your programs and programming style as you become more proficient.

Setting up Your Workspace
First, download and install Python 2.7 from https://www.python.org/downloads/ Then open the python interpreter and write your first program!

code format="python" $ python >>> print "Hello, World!" 'Hello, World!' code

When you type instructions, the interpreter will execute them and print the result. You can navigate your previous commands using the arrow keys, which makes it easier to try the same line over with a small change. The interpreter is a good tool to quickly run small code snippets or if you want to check something quick instead of re-running your whole program. You can also write the same instructions in a file and run it. This is the basis for writing scripts or for writing programs. Once you start writing longer programs you'll want to do it that way to avoid retyping things, and also to share your code with others. To do this we can use the Pycharm python IDE.





Here you can start making new python files and write scripts. The IDE keeps track of formatting, alerts you of potential errors and has code completion (kind of like tab completion with the UNIX command line). Make a new python file and name it MyFirstScript (if you picked a Python File it will add .py to it automatically).



You can also use the terminal and the python interpreter in pycharm. Go to Tools -> Run Python Console (for python interpreter) or Tools -> Open Terminal. Doing this will open both within Pycharm window, so you won't have to go back and forth between the terminal, python interpreter and the text editor (where you're writing your code).

Comments
Python ignores anything from a # to the end of the line; it's just there as a note for other humans.

code format="python" """ This line above is this shebang line, put it at the top of your scripts. When you run your .py script it will use the local python version on any linux terminal. This is a block of commented code. You can use either triple double quotes or triple single quotes """ code
 * 1) this is a comment
 * 2) !/usr/bin/env python

Variables and Operators
So what can you actually do in a program? Usually it involves reading inputs, creating and changing a bunch of internal variables, and finally communicating a couple of them by printing, sending to a file, etc. Variables names can be anything subject to a few conditions: they must begin with a letter, can contain letters or numbers or underscores, cannot be one of Python keywords such as code and         del        from        not       while as          elif       global      or        with assert      else       if          pass      yield break       except     import      print class       exec       in          raise continue    finally    is          return def         for        lambda      try code We've seen printing already, so let's try doing some math by creating and changing variables.

code x = 3 y = 10 print x + y # 13 z = x * y print z # 30 print y/x # 3 code In a real program, nothing is printed unless you say so. They interpreter just prints values you didn't store anywhere so they aren't lost. This also introduces the `+` operator, which works how you'd expect. There are `-`, `*`, and `/` operators too. `=` doesn't really mean 'equal'... it's better to think of it as 'assigned to/holds/refers to'.

Let's try division.

What happened? Division is trickier than the other operators, because of how numbers are represented in a computer. Which brings us to types...

Types
Everything you store in a computer is represented as a series of 0s and 1s, and depending on what you want to do different representations are best. Python doesn't bother you with types for the most part, but if you don't understand them you'll get tripped up. For example, `+` means something different with strings:

code format="python" print 'string1' + 'string2' # string1string2 code

Ints vs Floats
In Python there are two common types for representing numbers: `int` and `float`. `int`s are simple, exact, and efficient. But they can only represent integers, so division is kind of screwy. There are two division operators: `/` gives the rounded-down answer, and `%` gives the remainder. It's not as bad as it sounds though, because lots of situations come up where you want one of those, rather than traditional division.

code format="python" print 5/2 # 2 print 5%2 # 1 code

`float`s are written by adding a decimal. They handle division the "right" way:

code format="python" print 5.0/2.0 # 2.5 code

The main downside is they're approximate (limited to a certain number of significant digits) and therefore subject to rounding errors.

code format="python" print 100.0 / 23.3 # 4.291845493562231 code

You should never directly test if two `float`s are equal, because they might differ by a tiny amount.

Casting
There are functions for converting one type to another:

code format="python" print int(50.5) # 50 print float(50) # 50.0 print str(50.0) # '50.0' code

We'll get to functions in a bit; for now just think of them in the math sense: they take an input (in parentheses) and produce some output. You can use them yourself, and Python also uses them automatically when you write an expression involving more than one type:

code format="python" print 50.0 / 2 # 25.0 code

You can mix numbers and strings in the same expression if it makes sense.

code format="python" print 'String 1' * 3 # 'String 1String 1String 1' code

Strings
Strings are used so much it's worth going into some detail about their peculiarities. Strings are immutable. Once you declare a string, you cannot change the elements of a string. They can use either single or double quotes, which helps when you need to include one in the content. If you use triple quotes, you can span multiple lines.

code format="python" print '''this is a long string and it has line breaks in it''' print 'this is a long string\nand it has line breaks\nin it' print len('What is the length of this string') # 33 code

Notice the line breaks show up as '\n'; that's how they're actually stored on disk. Escaping characters with special meanings is a general problem in programming. For example, percent encoding in urls. In files, the main characters to know are '\t' (tab) and '\n' (or '\r\n' on windows).

len is a built in function that gives the length of the string. You can put placeholders in strings, and fill them in with data later. This is useful for printing.

code format="python" my_str = 'The placeholder was made into %s' % 7 print my_str #'The placeholder was made into 7' my_str = 'Two placeholders, %s and %s' my_str = my_str % (2, 'three') print my_str # 'Two placeholders, 2 and three' code

The only tricky thing is you need parentheses if there's more than one placeholder. You can use %f and %d for float and integer placeholders. Oh, and the `%` here has a totally different meaning from in math.

** String Methods **
Here are some of the most common string methods. A method is like a function, but it runs "on" an object. If the variable s is a string, then the code s.lower runs the lower method on that string object and returns the result (this idea of a method running on an object is one of the basic ideas that make up Object Oriented Programming, OOP). Here are some of the common string methods:
 * s.lower, s.upper -- returns the lowercase or uppercase version of the string
 * s.strip -- returns a string with whitespace removed from the start and end
 * s.isalpha/s.isdigit/s.isspace... -- tests if all the string chars are in the various character classes
 * s.startswith('other'), s.endswith('other') -- tests if the string starts or ends with the given other string
 * s.find('other') -- searches for the given other string (not a regular expression) within s, and returns the first index where it begins or -1 if not found
 * s.replace('old', 'new') -- returns a string where all occurrences of 'old' have been replaced by 'new'
 * s.split('delim') -- returns a list of substrings separated by the given delimiter. The delimiter is not a regular expression, it's just text. 'aaa,bbb,ccc'.split(',') -> ['aaa', 'bbb', 'ccc']. As a convenient special case s.split (with no arguments) splits on all whitespace chars.
 * s.join(list) -- opposite of split, joins the elements in the given list together using the string as the delimiter. e.g. '---'.join(['aaa', 'bbb', 'ccc']) -> aaa---bbb---ccc

** String Slicing **
The "slice" syntax is a handy way to refer to sub-parts of sequences -- typically strings and lists. The slice s[start:end] is the elements beginning at start and extending up to but not including end. the Suppose we have s = "Hello"  code s = 'Hello' s[1:4] # is 'ell' -- chars starting at index 1 and extending up to but not including index 4 s[1:] # is 'ello' -- omitting either index defaults to the start or end of the string s[:] # is 'Hello' -- omitting both always gives us a copy of the whole thing (this is the pythonic way to copy a sequence like a string or list) s[1:100] # is 'ello' -- an index that is too big is truncated down to the string length

code The standard zero-based index numbers give easy access to chars near the start of the string. As an alternative, Python uses negative numbers to give easy access to the chars at the end of the string: s[-1] is the last char 'o', s[-2] is 'l' the next-to-last char, and so on. Negative index numbers count back from the end of the string: code s[-1] # is 'o' -- last char (1st from the end) s[-4] # is 'e' -- 4th from the end s[:-3] # is 'He' -- going up to but not including the last 3 chars. s[-3:] # is 'llo' -- starting with the 3rd char from the end and extending to the end of the string.

code It is a neat truism of slices that for any index n, s[:n] + s[n:] == s. This works even for n negative or out of bounds. Or put another way s[:n] and s[n:] always partition the string into two string parts, conserving all the characters. As we'll see in the list section later, slices work with lists too.

Lists
Now you know how to do stuff with variables. But what if you want to do stuff with lots of variables at once? You group them into one large collection and refer to that as a variable. We'll focus on lists today. There are lots of other types of collections (dictionaries, sets, trees, etc) that are better for specific situations, but you should default to using a list unless you have some specific reason not to.

Here's how you create a list and access individual elements from it:

code format="python" my_list = [1,2,3,'elephant'] print my_list[0] # 1 print my_list[3] # 'elephant' print len(my_list) # 4 code

Elements in a list are numbered, starting from 0 (just like strings). len gives number of elements in the list. You can also use negative numbers relative to the end of the list:

code format="python" print my_list[-1] # 'elephant' print my_list[-2] # 3 code

Converting between lists and strings is slightly more complicated than what we've seen so far, but it makes joining things into a word, sentence, or line equally easy.

code format="python" letters = list('a string') print letters # ['a', ' ', 's', 't', 'r', 'i', 'n', 'g'] print str(letters) # "['a', ' ', 's', 't', 'r', 'i', 'n', 'g']" print ''.join(letters) # 'a string' print '-'.join(letters) # 'a- -s-t-r-i-n-g' print ['this','is','list1'] + ['this', 'is', 'list2'] # combines the two lists to create one list print ['a', 'list']*3 # ['a', 'list','a', 'list','a', 'list'] code Assignment with an = on lists does not make a copy. Instead, assignment makes the two variables point to the one list in memory.





code colors = ['red', 'blue', 'green'] b = colors print b # ['red', 'blue', 'green'] colors.remove('blue') print colors # ['red', 'green'] print b # ['red', 'green'] b = colors[:] print b # ['red', 'green'] colors.remove('red') print colors # ['green'] print b # ['red', 'green'] code Notice how in the first case, modifying colors also modifies b, because both colors and b variables point to the same list. When you b = colors[:], it is making a copy and so when the list colors changes b does not.

The "empty list" is just an empty pair of brackets [ ] and can be declared as list1 = [], useful for creating an empty list and populating using list append method.

** List Methods **
Join is a string method that we saw before. Just like strings have methods, lists have methods too. Here are some other common list methods.
 * <span style="color: #222222; font-family: inherit; font-family: Roboto,sans-serif; font-size: 14px; vertical-align: baseline; vertical-align: baseline;">list.append(elem) -- adds a single element to the end of the list. Common error: does not return the new list, just modifies the original.
 * <span style="color: #222222; font-family: inherit; font-family: Roboto,sans-serif; font-size: 14px; vertical-align: baseline; vertical-align: baseline;">list.insert(index, elem) -- inserts the element at the given index, shifting elements to the right.
 * <span style="color: #222222; font-family: inherit; font-family: Roboto,sans-serif; font-size: 14px; vertical-align: baseline; vertical-align: baseline;">list.extend(list2) adds the elements in list2 to the end of the list. Using + or += on a list is similar to using extend.
 * <span style="color: #222222; font-family: inherit; font-family: Roboto,sans-serif; font-size: 14px; vertical-align: baseline; vertical-align: baseline;">list.index(elem) -- searches for the given element from the start of the list and returns its index. Throws a ValueError if the element does not appear (use "in" to check without a ValueError).
 * <span style="color: #222222; font-family: inherit; font-family: Roboto,sans-serif; font-size: 14px; vertical-align: baseline; vertical-align: baseline;">list.remove(elem) -- searches for the first instance of the given element and removes it (throws ValueError if not present)
 * <span style="color: #222222; font-family: inherit; font-family: Roboto,sans-serif; font-size: 14px; vertical-align: baseline; vertical-align: baseline;">list.sort -- sorts the list in place (does not return it). (The sorted function shown below is preferred.)
 * <span style="color: #222222; font-family: inherit; font-family: Roboto,sans-serif; font-size: 14px; vertical-align: baseline; vertical-align: baseline;">list.reverse -- reverses the list in place (does not return it)
 * <span style="color: #222222; font-family: inherit; font-family: Roboto,sans-serif; font-size: 14px; vertical-align: baseline; vertical-align: baseline;">list.pop(index) -- removes and returns the element at the given index. Returns the rightmost element if index is omitted (roughly the opposite of append).

You can modify lists after you create them, unlike strings that are immutable.

code format="python" my_list.append('stuff') my_list.remove('elephant') print my_list # [1, 3, 'stuff'] my_list.insert(1, 'new 2nd element') print my_list # [1, 'new 2nd element', 3, 'stuff'] code

Some methods return objects whereas others don't. The three list methods append, insert and remove modify the list in place and do not return anything. The list method pop for example removes an element of the list from the specified index and returns the element, which you could assign to a new variable.

code pop_element = my_list.pop(1) print pop_element # 'new 2nd element' code

You can also use regular assignment to replace list elements:

code format="python" my_list[0] = 23 print my_list # [23, 1, 3, 'stuff'] code

** Sorting a list **
<span style="color: #222222; font-family: Roboto,sans-serif; font-size: 14px; vertical-align: baseline;">The easiest way to sort is with the sorted(list) function, which takes a list and returns a new list with those elements in sorted order. The original list is not changed. code a = [5, 1, 4, 3] print sorted(a) ## [1, 3, 4, 5] print a ## [5, 1, 4, 3] code

<span style="color: #222222; font-family: Roboto,sans-serif; font-size: 14px; vertical-align: baseline;">It's most common to pass a list into the sorted function, but in fact it can take as input any sort of iterable collection. The older list.sort method is an alternative detailed below. The sorted function seems easier to use compared to sort, so I recommend using sorted. <span style="color: #222222; font-family: Roboto,sans-serif; font-size: 14px; vertical-align: baseline;">The sorted function can be customized though optional arguments. The sorted optional argument reverse=True, e.g. sorted(list, reverse=True), makes it sort backwards. code strs = ['aa', 'BB', 'zz', 'CC'] print sorted(strs) ## ['BB', 'CC', 'aa', 'zz'] (case sensitive) print sorted(strs, reverse=True)  ## ['zz', 'aa', 'CC', 'BB'] code

** Custom Sorting With key= **
code strs = ['ccc', 'aaaa', 'd', 'bb'] print sorted(strs, key=len) ## ['d', 'bb', 'ccc', 'aaaa'] code

<span style="color: #222222; font-family: Roboto,sans-serif; font-size: 14px; vertical-align: baseline;">For more complex custom sorting, sorted takes an optional "key=" specifying a "key" function that transforms each element before comparison. The key function takes in 1 value and returns 1 value, and the returned "proxy" value is used for the comparisons within the sort. <span style="color: #222222; font-family: Roboto,sans-serif; font-size: 14px; vertical-align: baseline;">For example with a list of strings, specifying key=len (the built in len function) sorts the strings by length, from shortest to longest. The sort calls len for each string to get the list of proxy length values, and the sorts with those proxy values.

<span style="color: #222222; font-family: Roboto,sans-serif; font-size: 14px; vertical-align: baseline;">

<span style="color: #222222; font-family: Roboto,sans-serif; font-size: 14px; vertical-align: baseline;">As another example, specifying "str.lower" as the key function is a way to force the sorting to treat uppercase and lowercase the same: code print sorted(strs, key=str.lower) ## ['aa', 'BB', 'CC', 'zz'] code
 * 1) "key" argument specifying str.lower function to use for sorting

** sort method **
<span style="color: #222222; font-family: Roboto,sans-serif; font-size: 14px; vertical-align: baseline;">As an alternative to sorted, the sort method on a list sorts that list into ascending order, e.g. list.sort. The sort method changes the underlying list and returns None, so use it like this: code alist.sort           ## correct alist = blist.sort   ## NO incorrect, sort returns None code <span style="color: #222222; font-family: Roboto,sans-serif; font-size: 14px; vertical-align: baseline;">The above is a very common misunderstanding with sort -- it *does not return* the sorted list. The sort method must be called on a list; it does not work on any enumerable collection (but the sorted function above works on anything). The sort method predates the sorted function, so you will likely see it in older code. The sort method does not need to create a new list, so it can be a little faster in the case that the elements to sort are already in a list.

Control Flow (For-In, If-Else, While loops)
So you know how to make large lists of values and do stuff with their individual elements. But what if you want to do stuff with all the elements at once? There's a loop for that: code for elem in my_list: print elem ''' 23 1 3 stuff ''' code

This is called a 'for loop'. It basically does

code format="python" elem = my_list[0] print elem # 23 elem = my_list[1] print elem # 1 code

and so on for everything in my_list. Besides being easier, it also works on any size list (even an empty one). In fact, it doesn't make any assumptions about the type of my_list. You could give it a set, dictionary, custom object or whatever and it would still work.

Notice the : and the indent in the body of the loop. Pycharm will automatically indent after a conditional loop detect if the indents don't align A more tedious way of writing the same for loop

code for i in range(len(my_list)): print my_list[i] ''' 23 1 3 stuff ''' code

Here range is another function that returns a list of integers from 0 up to but not including the number in parenthesis.

The general syntax for conditional and functions is:

for x in y: do stuff

Almost every construct in Python reads that way, with a line naming the construct and setting up variables, a colon, and some indented code that happens 'inside' it. 4 spaces is the most standard way, but you can indent however much you want as long as it's consistent. The constuct ends when you get to a line indented less.

Another way to loop through a list is to use the while loop. code i = 0 while i < len(my_list): print my_list[i] i += 1 # another way to say i = i + 1 code Don't forget to increment the i to prevent an infinite loop.

Another important construct for controlling your program is the 'if statement'. It lets you have two or more alternative blocks of code, and only do one of them depending on the situation.

code format="python" x = 5 if x == 0: print 'x is zero' elif x < 10: print 'x is small' elif x < 50: print 'x is big' else: print 'x is HUGE!' code
 * 1) 'x is small'

Only the first part with the 'if' is required. Then you can have zero or more 'elif's (short for 'else if'), and finally a single 'else' if you want. Each of them is called a 'branch'. Python checks the 'guard' at the beginning of each branch, and if it's a true statement that branch runs. The 'else' branch runs if none of them were true.

This function waits for an input at the console and then uses the typed input to assign to a variable in the program. It always returns a string. code b = raw_input print b # '10' c = raw_input print c # 'T' code
 * A segue to a function called raw_input**
 * 1) input 10
 * 1) input T


 * Exercise 1 : Take a user input for a nucleotide (a or t or g or c) and return its complementary base. If the entered string is not a base, print a message to the user. **

**Nesting**
The first thing to realize is that programming languages are like natural languages: as long as you follow the rules of grammar, anything can be nested inside anything else. Here are a couple examples:

code format="python" my_list = [1, 45, 3, 4, 5] for num in my_list: if num % 2 == 0: print '%s is even' % num else: print '%s is odd' % num ''' '1 is odd' '45 is odd' '4 is even' '5 is odd' ''' code

code format="python" list_in_list = 1,2,3], [4,5,6], [7,8], [ for regular_list in list_in_list: if len(regular_list) == 0: print 'The list %s is empty.' % regular_list else: print 'The list %s has %s elements.' % (regular_list, len(regular_list)) ''' 'The list [1, 2, 3] has 3 elements.' 'The list [4, 5, 6] has 3 elements.' 'The list [7, 8] has 2 elements.' 'The list [] is empty.' ''' code


 * Exercise 2: Remember when we found the complement of a DNA base? Put that in a loop so it works on a list of bases. You can remove the raw_input function for this. **

Tuples
<span style="color: #222222; font-family: Roboto,sans-serif; font-size: 14px; vertical-align: baseline;">A tuple is a fixed size grouping of elements, such as an (x, y) co-ordinate. Tuples are like lists, except they are immutable and do not change size (tuples are not strictly immutable since one of the contained elements could be mutable). Tuples play a sort of "struct" role in Python -- a convenient way to pass around a little logical, fixed size bundle of values. A function that needs to return multiple values can just return a tuple of the values. For example, if I wanted to have a list of 3-d coordinates, the natural python representation would be a list of tuples, where each tuple is size 3 holding one (x, y, z) group. <span style="color: #222222; font-family: Roboto,sans-serif; font-size: 14px; vertical-align: baseline;">To create a tuple, just list the values within parenthesis separated by commas. The "empty" tuple is just an empty pair of parenthesis. Accessing the elements in a tuple is just like a list -- len, [ ], for, in, etc. all work the same. code tuple = (1, 2, 'hi') print len(tuple) ## 3 print tuple[2]   ## hi tuple[2] = 'bye'  ## NO, tuples cannot be changed tuple = (1, 2, 'bye') ## this works code

<span style="color: #222222; font-family: Roboto,sans-serif; font-size: 14px; vertical-align: baseline;">To create a size-1 tuple, the lone element must be followed by a comma. code tuple = ('hi',)  ## size-1 tuple code

<span style="color: #222222; font-family: Roboto,sans-serif; font-size: 14px; line-height: 1.5;">It's a funny case in the syntax, but the comma is necessary to distinguish the tuple from the ordinary case of putting an expression in parentheses. In some cases you can omit the parenthesis and Python will see from the commas that you intend a tuple. <span style="color: #222222; font-family: Roboto,sans-serif; font-size: 14px; vertical-align: baseline;">Assigning a tuple to an identically sized tuple of variable names assigns all the corresponding values. If the tuples are not the same size, it throws an error. This feature works for lists too. code (x, y, z) = (42, 13, "hike") print z ## hike (err_string, err_code) = Foo ## Foo returns a length-2 tuple code

Dictionaries
<span style="color: #222222; font-family: Roboto,sans-serif; font-size: 14px; vertical-align: baseline;">Python's efficient key/value hash table structure is called a "dict". The contents of a dict can be written as a series of key:value pairs within braces { }, e.g. dict = {key1:value1, key2:value2, ... }. The "empty dict" is just an empty pair of curly braces {}.

<span style="color: #222222; font-family: Roboto,sans-serif; font-size: 14px; vertical-align: baseline;">Looking up or setting a value in a dict uses square brackets, e.g. dict['foo'] looks up the value under the key 'foo'. Strings, numbers, and tuples work as keys, and any type can be a value. Other types may or may not work correctly as keys (strings and tuples work cleanly since they are immutable). Looking up a value which is not in the dict throws a KeyError -- use "in" to check if the key is in the dict, or use dict.get(key) which returns the value or None if the key is not present (or get(key, not-found) allows you to specify what value to return in the not-found case). code dict = {} dict['a'] = 'alpha' dict['g'] = 'gamma' dict['o'] = 'omega'
 * 1) Can build up a dict by starting with the the empty dict {}
 * 2) and storing key/value pairs into the dict like this:
 * 3) dict[key] = value-for-that-key

print dict ## {'a': 'alpha', 'o': 'omega', 'g': 'gamma'}

print dict['a']    ## Simple lookup, returns 'alpha' dict['a'] = 6      ## Put new key/value into dict 'a' in dict        ## True if 'z' in dict: print dict['z']    ## Avoid KeyError print dict.get('z') ## None (instead of KeyError) code
 * 1) print dict['z']                  ## Throws KeyError

<span style="color: #222222; font-family: Roboto,sans-serif; font-size: 14px; vertical-align: baseline;"> <span style="color: #222222; font-family: Roboto,sans-serif; font-size: 14px; vertical-align: baseline;">A for loop on a dictionary iterates over its keys by default. The keys will appear in an arbitrary order. The methods dict.keys and dict.values return lists of the keys or values explicitly. There's also an items which returns a list of (key, value) tuples, which is the most efficient way to examine all the key value data in the dictionary. All of these lists can be passed to the sorted function. code for key in dict: print key for key in dict.keys: print key
 * 1) By default, iterating over a dict iterates over its keys.
 * 2) Note that the keys are in a random order.
 * 1) prints a g o
 * 2) Exactly the same as above

print dict.keys ## ['a', 'o', 'g']
 * 1) Get the .keys list:

print dict.values ## ['alpha', 'omega', 'gamma'] for key in sorted(dict.keys): print key, dict[key]
 * 1) Likewise, there's a .values list of values
 * 1) Common case -- loop over the keys in sorted order,
 * 2) accessing each key/value

print dict.items ##  [('a', 'alpha'), ('o', 'omega'), ('g', 'gamma')]
 * 1) .items is the dict expressed as (key, value) tuples

for k, v in dict.items: print k, '>', v code
 * 1) This loop syntax accesses the whole dict by looping
 * 2) over the .items tuple list, accessing one (key, value)
 * 3) pair on each iteration.
 * 1) a > alpha    o > omega     g > gamma

<span style="color: #222222; font-family: Roboto,sans-serif; font-size: 14px; vertical-align: baseline;">There are "iter" variants of these methods called iterkeys, itervalues and iteritems which avoid the cost of constructing the whole list -- a performance win if the data is huge. However, I generally prefer the plain keys and values methods with their sensible names.

<span style="color: #222222; font-family: Roboto,sans-serif; font-size: 14px; vertical-align: baseline;">Strategy note: from a performance point of view, the dictionary is one of your greatest tools, and you should use where you can as an easy way to organize data. For example, you might read a log file where each line begins with an ip address, and store the data into a dict using the ip address as the key, and the list of lines where it appears as the value. Once you've read in the whole file, you can look up any ip address and instantly see its list of lines. The dictionary takes in scattered data and make it into something coherent.


 * Exercise 3: Modify your complement script from before, create a dict with bases as keys and complementary bases as values, then use this dict to complement a DNA string (without using if, elif and else conditionals) **

Functions
Functions are used to group code into 'chunks' that do something meaningful, and hide the boring details. They also keep you from having to type things over and over. Here's a function. code format="python" def my_function(arg1, arg2): return arg1 + arg2

code

It takes two arguments and adds them together. Nothing happens yet though, because we haven't called it.

code format="python" my_function(2, 4) # 6 my_function('this', 'that') # 'thisthat' code

Functions can be thought of in two different ways, depending on what you want to do. First, you can treat them like functions in math, which take 0 or more arguments and return one result. Wherever you call a function, Python will get the result and put it in place of the function call.

code format="python" my_function(1,4) < 3 code
 * 1) False

The other way to look at them is as subroutines. They group a series of steps together so you can just write it once and then reference it in other code. When you're using a function that way you might not care about the result; it's the actions inside that are important.

code format="python" def my_function(arg1, arg2): print arg1 print arg2

my_function(2, 4) code
 * 2
 * 4

Any function without a return statement just returns None (which the interpreter doesn't bother printing). You can store that somewhere if you want, but usually you just ignore it.


 * Exercise 4: wrap your DNA-complementing logic in a function. Start by printing each letter, then see if you can get it to return a list of them instead. **
 * And if that was easy: write a function that finds the reverse complement of a DNA sequence, using your first function for most of the logic. Hint: lists have a built-in reverse function. **

Working with Files
Bioinformatics often involves reading and writing files. How do you think the code for that would look? You might expect something like

code format="python" >>> text = read_file('sequence.txt') code

Sometimes that's what you want, and we'll write a function for it. But in general you can't assume the file is small enough to load all at once.

Then you can loop through and do stuff with each line.

code format="python" in_file_handle = open('sequence.txt', 'r') out_file_handle = open('sequence_out.txt', 'a') for line in in_file_handle: print line if len(line) > 50: out_file_handle.write(line) in_file_handle.close out_file_handle.close """ 'atcgatcgatgcatcatctagtacgtagc' 'tagctcgagctgtgggtggtgatgctgtggctgcgctc' 'tagtcgttatcggcatgcattgtggtacgacgcg'

The file sequence_out.txt will have sequence lines which are > 50 bp """ code

You can loop through elements of the seq object like it was a list, but you can't access them at random. For example, file_handle[0] would cause an error. Also, you can only access them once. If you added a second for loop below that one, seq would appear to be an empty list. That's because it keeps track of its position in the file and always feeds you the next line. Once it's at the end there aren't any more.

The 'r' tells Python to open the file for reading. The other options are 'w' for write and 'a' for append. The difference is writing deletes any existing data in the file, while append adds lines to the end of it. (Think of a log file, where you add a line each time something happens).

file handle objects have their own methods.
 * There is a close method that closes an open file handle.
 * next method goes to the next line in the file and returns it
 * readlines method reads the entire file into memory and returns a list where each element is one line
 * read method reads the entire file into memory and returns the contents of the file as as string.
 * write method writes whatever you pass as an argument to it, to the file handle, you first need to open a file handle with write permissions, usually append permissions. This is useful to write outputs of your scripts to.


 * Exercise 5 **


 * 1. Write a Python script read a fasta file and print out the sequence name and nucleotide counts for each fasta record in the file. **


 * 2. Write a Python program to read the file (beware this file is not small and is in compressed format) and write the sequence names and GC percentage of the sequence to another file. **

Solutions: code format="python" base = raw_input('Please enter a base: ') base = base.upper if base == 'A': print 'T' elif base == 'T': print 'A' elif base == 'G': print 'C' elif base == 'C': print 'G' else: print 'Not a base'
 * Exercise 1**

code

code format="python" base_list = ['A', 'G', 'G', 'T', 'C', 'T', 'C', 'A'] comp_list = []
 * Exercise 2**

for base in base_list: if base == 'A': comp_list.append('T') elif base == 'T': comp_list.append('A') elif base == 'G': comp_list.append('C') elif base == 'C': comp_list.append('G') print comp_list code

code format="python" base_dict = {'A':'T', 'T':'A', 'G':'C', 'C':'G'} base_list = ['A', 'G', 'G', 'T', 'C', 'T', 'C', 'A'] comp_list = []
 * Exercise 3**

for base in base_list: comp_base = base_dict[base] comp_list.append(comp_base) print comp_list code

code def complement(base_list, base_dict): comp_list = [] for base in base_list: comp_base = base_dict[base] comp_list.append(comp_base) return comp_list
 * Exercise 4**

base_dict1 = {'A':'T', 'T':'A', 'G':'C', 'C':'G'} base_list2 = ['A', 'G', 'G', 'T', 'C', 'T', 'C', 'A']

print complement(base_list2, base_dict1)

rev_c = complement(base_list2, base_dict1) rev_c.reverse print rev_c code

code fasta_fh = open('mrna10.fa','r') for line in fasta_fh: if line.startswith('>'): seq_name = line[1:-3] # cut off the starting > and the trailing \n and number seq = fasta_fh.next # gets the next line which is the sequence length = str(len(seq)) # length of the sequence converted to string so we can format print '%s\t%s' %(seq_name, length) fasta_fh.close code
 * Exercise 5.1**


 * Exercise 5.2**

code fasta_fh = open('mrna.fa','r') out_fh = open('mrna_gc.out','a') for line in fasta_fh: if line.startswith('>'): seq_name = line[1:-3] # cut off the starting > and the trailing \n and number seq = fasta_fh.next # gets the next line which is the sequence seq_list = list(seq) # convert the sequence to a list so we can count number of occurences of G and C       num_G = seq_list.count('g') num_C = seq_list.count('c') GC_percent = (num_G + num_C)/float(len(seq)) # we convert the output of len(seq) which is an int to float, so we get float division out_fh.write('%s\t%s\n' %(seq_name, GC_percent)) fasta_fh.close out_fh.close code

References==