Introduction to Python for Genomic Data Analysis
Oct 2nd, 2014
Ravi Alla

Basics

Welcome to the Introduction to Python workshop at CGRL. This course is aimed at biologists who are interested in learning how to analyze genomic data sets and especially Next Generation Sequencing data sets. Our goal is to provide an impetus to people to start writing small programs and then be able to explore further on their own. Since the expected audience is people who have little to no programming experience this material covered in this course will be at a beginner level. However, since programming requires some understanding of the command line environment we expect that the attendees have an understanding of the Unix or Unix-like environments (e.g. MacOS). To make it easier to write python scripts we will be using a Python Integrated Development Environment (IDE) called Pycharm

Why Python

Python programming language is new and is increasing in popularity. There are several features in the language that are helpful in tackling problems in the genomic data analysis field. Many of the newer programs that are available in this area are written in Python as well. Moreover it is easier than some other languages for a novice programmer to start on.

Python is an evolving language and there are several versions available now. We will be using version 2.7.* in this class. Many of the programs available for genomic data analysis are written in versions 2.6 or 2.7, but the difference between these are not very significant. These versions are still the most popular ones. The material covered in this class will work well for the most part in both. The newest versions Python 3.x have major differences from 2.x and some of the code covered in this class may not work with it.



Programming in general


INPUTS --> PROGRAMS ---> OUTPUTS

Programs are a set of instructions to the computer to do certain actions. Each program will take 0 or more inputs and then carry out the instructions given and produce 0 or more outputs. In general, there are multiple ways of achieving the same results by doing the different sets of actions. Therefore, there are many of writing computer programs to achieve the same outcome. However, some programs are more efficient and elegant than others. Some may be more easy to read and maintain over time. We will not place much emphasis on these points since it is a beginner level class, but it is good to learn those differences and incorporate these into your programs and programming style as you become more proficient.


Setting up Your Workspace

First, download and install Python 2.7 from https://www.python.org/downloads/
Then open the python interpreter and write your first program!

$ python
>>> print "Hello, World!"
'Hello, World!'

When you type instructions, the interpreter will execute them and print the result. You can navigate your previous commands using the arrow keys, which makes it easier to try the same line over with a small change. The interpreter is a good tool to quickly run small code snippets or if you want to check something quick instead of re-running your whole program.
You can also write the same instructions in a file and run it. This is the basis for writing scripts or for writing programs.
Once you start writing longer programs you'll want to do it that way to avoid retyping things, and also to share your code with others.
To do this we can use the Pycharm python IDE.

Screenshot 2014-10-01 15.53.04.jpg

Screenshot 2014-10-01 15.59.20.jpg

Here you can start making new python files and write scripts. The IDE keeps track of formatting, alerts you of potential errors and has code completion (kind of like tab completion with the UNIX command line).
Make a new python file and name it MyFirstScript (if you picked a Python File it will add .py to it automatically).

Screenshot 2014-10-01 16.17.24.jpg

You can also use the terminal and the python interpreter in pycharm. Go to Tools -> Run Python Console (for python interpreter) or Tools -> Open Terminal. Doing this will open both within Pycharm window, so you won't have to go back and forth between the terminal, python interpreter and the text editor (where you're writing your code).

Comments

Python ignores anything from a # to the end of the line; it's just there as a note for other humans.

#this is a comment
#!/usr/bin/env python
"""
This line above is this shebang line, put it at the top of your scripts. When you run your .py script it will use the local python version on any linux terminal.
This is a block of commented code.
You can use either triple double quotes or triple single quotes
"""


Variables and Operators


So what can you actually do in a program? Usually it involves reading inputs, creating and changing a bunch of internal variables, and finally communicating a couple of them by printing, sending to a file, etc.
Variables names can be anything subject to a few conditions: they must begin with a letter, can contain letters or numbers or underscores, cannot be one of Python keywords such as
and          del        from        not       while
as           elif       global      or        with
assert       else       if          pass      yield
break        except     import      print
class        exec       in          raise
continue     finally    is          return
def          for        lambda      try
We've seen printing already, so let's try doing some math by creating and changing variables.


x = 3
y = 10
print x + y # 13
z = x * y
print z # 30
print y/x # 3
In a real program, nothing is printed unless you say so. They interpreter just prints values you didn't store anywhere so they aren't lost.
This also introduces the `+` operator, which works how you'd expect. There are `-`, `*`, and `/` operators too.
`=` doesn't really mean 'equal'... it's better to think of it as 'assigned to/holds/refers to'.

Let's try division.

What happened? Division is trickier than the other operators, because of how numbers are represented in a computer.
Which brings us to types...

Types

Everything you store in a computer is represented as a series of 0s and 1s, and depending on what you want to do different representations are best.
Python doesn't bother you with types for the most part, but if you don't understand them you'll get tripped up.
For example, `+` means something different with strings:

print 'string1' + 'string2' # string1string2


Ints vs Floats

In Python there are two common types for representing numbers: `int` and `float`. `int`s are simple, exact, and efficient. But they can only represent integers, so division is kind of screwy. There are two division operators: `/` gives the rounded-down answer, and `%` gives the remainder. It's not as bad as it sounds though, because lots of situations come up where you want one of those, rather than traditional division.

print 5/2 # 2
print 5%2 # 1

`float`s are written by adding a decimal. They handle division the "right" way:

print 5.0/2.0 # 2.5

The main downside is they're approximate (limited to a certain number of significant digits) and therefore subject to rounding errors.

print 100.0 / 23.3 # 4.291845493562231

You should never directly test if two `float`s are equal, because they might differ by a tiny amount.


Casting

There are functions for converting one type to another:

print int(50.5) # 50
print float(50) # 50.0
print str(50.0) # '50.0'

We'll get to functions in a bit; for now just think of them in the math sense: they take an input (in parentheses) and produce some output.
You can use them yourself, and Python also uses them automatically when you write an expression involving more than one type:

print 50.0 / 2 # 25.0

You can mix numbers and strings in the same expression if it makes sense.

print 'String 1' * 3 # 'String 1String 1String 1'

Strings

Strings are used so much it's worth going into some detail about their peculiarities. Strings are immutable. Once you declare a string, you cannot change the elements of a string.
They can use either single or double quotes, which helps when you need to include one in the content.
If you use triple quotes, you can span multiple lines.

print '''this is a long string
and it has line breaks
in it'''
print 'this is a long string\nand it has line breaks\nin it'
print len('What is the length of this string') # 33

Notice the line breaks show up as '\n'; that's how they're actually stored on disk.
Escaping characters with special meanings is a general problem in programming.
For example, percent encoding in urls.
In files, the main characters to know are '\t' (tab) and '\n' (or '\r\n' on windows).

len() is a built in function that gives the length of the string.
You can put placeholders in strings, and fill them in with data later. This is useful for printing.

my_str = 'The placeholder was made into %s' % 7
print my_str #'The placeholder was made into 7'
my_str = 'Two placeholders, %s and %s'
my_str = my_str % (2, 'three')
print my_str # 'Two placeholders, 2 and three'

The only tricky thing is you need parentheses if there's more than one placeholder. You can use %f and %d for float and integer placeholders.
Oh, and the `%` here has a totally different meaning from in math.

String Methods

Here are some of the most common string methods. A method is like a function, but it runs "on" an object. If the variable s is a string, then the code s.lower() runs the lower() method on that string object and returns the result (this idea of a method running on an object is one of the basic ideas that make up Object Oriented Programming, OOP). Here are some of the common string methods:
  • s.lower(), s.upper() -- returns the lowercase or uppercase version of the string
  • s.strip() -- returns a string with whitespace removed from the start and end
  • s.isalpha()/s.isdigit()/s.isspace()... -- tests if all the string chars are in the various character classes
  • s.startswith('other'), s.endswith('other') -- tests if the string starts or ends with the given other string
  • s.find('other') -- searches for the given other string (not a regular expression) within s, and returns the first index where it begins or -1 if not found
  • s.replace('old', 'new') -- returns a string where all occurrences of 'old' have been replaced by 'new'
  • s.split('delim') -- returns a list of substrings separated by the given delimiter. The delimiter is not a regular expression, it's just text. 'aaa,bbb,ccc'.split(',') -> ['aaa', 'bbb', 'ccc']. As a convenient special case s.split() (with no arguments) splits on all whitespace chars.
  • s.join(list) -- opposite of split(), joins the elements in the given list together using the string as the delimiter. e.g. '---'.join(['aaa', 'bbb', 'ccc']) -> aaa---bbb---ccc

String Slicing

The "slice" syntax is a handy way to refer to sub-parts of sequences -- typically strings and lists. The slice s[start:end] is the elements beginning at start and extending up to but not including end. the Suppose we have s = "Hello"
the string 'hello' with letter indexes 0 1 2 3 4
the string 'hello' with letter indexes 0 1 2 3 4

s = 'Hello'
s[1:4] # is 'ell' -- chars starting at index 1 and extending up to but not including index 4
s[1:] # is 'ello' -- omitting either index defaults to the start or end of the string
s[:] # is 'Hello' -- omitting both always gives us a copy of the whole thing (this is the pythonic way to copy a sequence like a string or list)
s[1:100] # is 'ello' -- an index that is too big is truncated down to the string length
 
The standard zero-based index numbers give easy access to chars near the start of the string. As an alternative, Python uses negative numbers to give easy access to the chars at the end of the string: s[-1] is the last char 'o', s[-2] is 'l' the next-to-last char, and so on. Negative index numbers count back from the end of the string:
s[-1] # is 'o' -- last char (1st from the end)
s[-4] # is 'e' -- 4th from the end
s[:-3] # is 'He' -- going up to but not including the last 3 chars.
s[-3:] # is 'llo' -- starting with the 3rd char from the end and extending to the end of the string.
 
It is a neat truism of slices that for any index n, s[:n] + s[n:] == s. This works even for n negative or out of bounds. Or put another way s[:n] and s[n:] always partition the string into two string parts, conserving all the characters. As we'll see in the list section later, slices work with lists too.

Lists

Now you know how to do stuff with variables. But what if you want to do stuff with lots of variables at once? You group them into one large collection and refer to that as a variable. We'll focus on lists today. There are lots of other types of collections (dictionaries, sets, trees, etc) that are better for specific situations, but you should default to using a list unless you have some specific reason not to.

Here's how you create a list and access individual elements from it:

my_list = [1,2,3,'elephant']
print my_list[0] # 1
print my_list[3] # 'elephant'
print len(my_list) # 4

Elements in a list are numbered, starting from 0 (just like strings). len() gives number of elements in the list.
You can also use negative numbers relative to the end of the list:

print my_list[-1] # 'elephant'
print my_list[-2] # 3

Converting between lists and strings is slightly more complicated than what we've seen so far,
but it makes joining things into a word, sentence, or line equally easy.

letters = list('a string')
print letters # ['a', ' ', 's', 't', 'r', 'i', 'n', 'g']
print str(letters) # "['a', ' ', 's', 't', 'r', 'i', 'n', 'g']"
print ''.join(letters) # 'a string'
print '-'.join(letters) # 'a- -s-t-r-i-n-g'
print ['this','is','list1'] + ['this', 'is', 'list2'] # combines the two lists to create one list
print ['a', 'list']*3 # ['a', 'list','a', 'list','a', 'list']
Assignment with an = on lists does not make a copy. Instead, assignment makes the two variables point to the one list in memory.





list of strings 'red' 'blue 'green'
list of strings 'red' 'blue 'green'


both colors and b point to the one list
both colors and b point to the one list


colors = ['red', 'blue', 'green']
b = colors
print b # ['red', 'blue', 'green']
colors.remove('blue')
print colors # ['red', 'green']
print b # ['red', 'green']
b = colors[:]
print b # ['red', 'green']
colors.remove('red')
print colors # ['green']
print b # ['red', 'green']
Notice how in the first case, modifying colors also modifies b, because both colors and b variables point to the same list. When you b = colors[:], it is making a copy and so when the list colors changes b does not.


The "empty list" is just an empty pair of brackets [ ] and can be declared as list1 = [], useful for creating an empty list and populating using list append method.

List Methods

Join is a string method that we saw before. Just like strings have methods, lists have methods too.
Here are some other common list methods.
  • list.append(elem) -- adds a single element to the end of the list. Common error: does not return the new list, just modifies the original.
  • list.insert(index, elem) -- inserts the element at the given index, shifting elements to the right.
  • list.extend(list2) adds the elements in list2 to the end of the list. Using + or += on a list is similar to using extend().
  • list.index(elem) -- searches for the given element from the start of the list and returns its index. Throws a ValueError if the element does not appear (use "in" to check without a ValueError).
  • list.remove(elem) -- searches for the first instance of the given element and removes it (throws ValueError if not present)
  • list.sort() -- sorts the list in place (does not return it). (The sorted() function shown below is preferred.)
  • list.reverse() -- reverses the list in place (does not return it)
  • list.pop(index) -- removes and returns the element at the given index. Returns the rightmost element if index is omitted (roughly the opposite of append()).

You can modify lists after you create them, unlike strings that are immutable.


my_list.append('stuff')
my_list.remove('elephant')
print my_list # [1, 3, 'stuff']
my_list.insert(1, 'new 2nd element')
print my_list # [1, 'new 2nd element', 3, 'stuff']

Some methods return objects whereas others don't. The three list methods append, insert and remove modify the list in place and do not return anything.
The list method pop for example removes an element of the list from the specified index and returns the element, which you could assign to a new variable.


pop_element = my_list.pop(1)
print pop_element # 'new 2nd element'

You can also use regular assignment to replace list elements:

my_list[0] = 23
print my_list # [23, 1, 3, 'stuff']

Sorting a list

The easiest way to sort is with the sorted(list) function, which takes a list and returns a new list with those elements in sorted order. The original list is not changed.
a = [5, 1, 4, 3]
print sorted(a)  ## [1, 3, 4, 5]
print a  ## [5, 1, 4, 3]

It's most common to pass a list into the sorted() function, but in fact it can take as input any sort of iterable collection. The older list.sort() method is an alternative detailed below. The sorted() function seems easier to use compared to sort(), so I recommend using sorted().
The sorted() function can be customized though optional arguments. The sorted() optional argument reverse=True, e.g. sorted(list, reverse=True), makes it sort backwards.
strs = ['aa', 'BB', 'zz', 'CC']
print sorted(strs)  ## ['BB', 'CC', 'aa', 'zz'] (case sensitive)
print sorted(strs, reverse=True)   ## ['zz', 'aa', 'CC', 'BB']


Custom Sorting With key=
strs = ['ccc', 'aaaa', 'd', 'bb']
print sorted(strs, key=len)  ## ['d', 'bb', 'ccc', 'aaaa']

For more complex custom sorting, sorted() takes an optional "key=" specifying a "key" function that transforms each element before comparison. The key function takes in 1 value and returns 1 value, and the returned "proxy" value is used for the comparisons within the sort.
For example with a list of strings, specifying key=len (the built in len() function) sorts the strings by length, from shortest to longest. The sort calls len() for each string to get the list of proxy length values, and the sorts with those proxy values.

calling sorted with key=len
calling sorted with key=len



As another example, specifying "str.lower" as the key function is a way to force the sorting to treat uppercase and lowercase the same:
## "key" argument specifying str.lower function to use for sorting
print sorted(strs, key=str.lower)  ## ['aa', 'BB', 'CC', 'zz']

sort() method

As an alternative to sorted(), the sort() method on a list sorts that list into ascending order, e.g. list.sort(). The sort() method changes the underlying list and returns None, so use it like this:
alist.sort()            ## correct
alist = blist.sort()    ## NO incorrect, sort() returns None
The above is a very common misunderstanding with sort() -- it *does not return* the sorted list. The sort() method must be called on a list; it does not work on any enumerable collection (but the sorted() function above works on anything). The sort() method predates the sorted() function, so you will likely see it in older code. The sort() method does not need to create a new list, so it can be a little faster in the case that the elements to sort are already in a list.

Control Flow (For-In, If-Else, While loops)

So you know how to make large lists of values and do stuff with their individual elements. But what if you want to do stuff with all the elements at once? There's a loop for that:
for elem in my_list:
    print elem
'''
23
1
3
stuff
'''

This is called a 'for loop'. It basically does

elem = my_list[0]
print elem # 23
elem = my_list[1]
print elem # 1

and so on for everything in my_list. Besides being easier, it also works on any size list (even an empty one). In fact, it doesn't make any assumptions about the type of my_list. You could give it a set, dictionary, custom object or whatever and it would still work.

Notice the : and the indent in the body of the loop. Pycharm will automatically indent after a conditional loop detect if the indents don't align
A more tedious way of writing the same for loop

for i in range(len(my_list)):
    print my_list[i]
'''
23
1
3
stuff
'''

Here range() is another function that returns a list of integers from 0 up to but not including the number in parenthesis.

The general syntax for conditional and functions is:

for x in y:
do stuff

Almost every construct in Python reads that way, with a line naming the construct and setting up variables, a colon, and some indented code that happens 'inside' it. 4 spaces is the most standard way, but you can indent however much you want as long as it's consistent. The constuct ends when you get to a line indented less.

Another way to loop through a list is to use the while loop.
i = 0
while i < len(my_list):
    print my_list[i]
    i += 1 # another way to say i = i + 1
Don't forget to increment the i to prevent an infinite loop.

Another important construct for controlling your program is the 'if statement'. It lets you have two or more alternative blocks of code, and only do one of them depending on the situation.

x = 5
if x == 0:
    print 'x is zero'
elif x < 10:
    print 'x is small'
elif x < 50:
    print 'x is big'
else:
    print 'x is HUGE!'
# 'x is small'

Only the first part with the 'if' is required. Then you can have zero or more 'elif's (short for 'else if'), and finally a single 'else' if you want. Each of them is called a 'branch'. Python checks the 'guard' at the beginning of each branch, and if it's a true statement that branch runs. The 'else' branch runs if none of them were true.


A segue to a function called raw_input()
This function waits for an input at the console and then uses the typed input to assign to a variable in the program. It always returns a string.
b = raw_input()
# input 10
print b # '10'
c = raw_input()
# input T
print c # 'T'

Exercise 1: Take a user input for a nucleotide (a or t or g or c) and return its complementary base. If the entered string is not a base, print a message to the user.


Nesting

The first thing to realize is that programming languages are like natural languages: as long as you follow the rules of grammar, anything can be nested inside anything else. Here are a couple examples:

my_list = [1, 45, 3, 4, 5]
for num in my_list:
    if num % 2 == 0:
        print '%s is even' % num
    else:
        print '%s is odd' % num
'''
'1 is odd'
'45 is odd'
'4 is even'
'5 is odd'
'''

list_in_list = [[1,2,3], [4,5,6], [7,8], []]
for regular_list in list_in_list:
    if len(regular_list) == 0:
        print 'The list %s is empty.' % regular_list
    else:
        print 'The list %s has %s elements.' % (regular_list, len(regular_list))
'''
'The list [1, 2, 3] has 3 elements.'
'The list [4, 5, 6] has 3 elements.'
'The list [7, 8] has 2 elements.'
'The list [] is empty.'
'''

Exercise 2: Remember when we found the complement of a DNA base? Put that in a loop so it works on a list of bases. You can remove the raw_input() function for this.

Tuples

A tuple is a fixed size grouping of elements, such as an (x, y) co-ordinate. Tuples are like lists, except they are immutable and do not change size (tuples are not strictly immutable since one of the contained elements could be mutable). Tuples play a sort of "struct" role in Python -- a convenient way to pass around a little logical, fixed size bundle of values. A function that needs to return multiple values can just return a tuple of the values. For example, if I wanted to have a list of 3-d coordinates, the natural python representation would be a list of tuples, where each tuple is size 3 holding one (x, y, z) group.
To create a tuple, just list the values within parenthesis separated by commas. The "empty" tuple is just an empty pair of parenthesis. Accessing the elements in a tuple is just like a list -- len(), [ ], for, in, etc. all work the same.
tuple = (1, 2, 'hi')
print len(tuple)  ## 3
print tuple[2]    ## hi
tuple[2] = 'bye'  ## NO, tuples cannot be changed
tuple = (1, 2, 'bye')  ## this works


To create a size-1 tuple, the lone element must be followed by a comma.
tuple = ('hi',)   ## size-1 tuple

It's a funny case in the syntax, but the comma is necessary to distinguish the tuple from the ordinary case of putting an expression in parentheses. In some cases you can omit the parenthesis and Python will see from the commas that you intend a tuple.
Assigning a tuple to an identically sized tuple of variable names assigns all the corresponding values. If the tuples are not the same size, it throws an error. This feature works for lists too.
(x, y, z) = (42, 13, "hike")
print z  ## hike
(err_string, err_code) = Foo()  ## Foo() returns a length-2 tuple

Dictionaries


Python's efficient key/value hash table structure is called a "dict". The contents of a dict can be written as a series of key:value pairs within braces { }, e.g. dict = {key1:value1, key2:value2, ... }. The "empty dict" is just an empty pair of curly braces {}.

Looking up or setting a value in a dict uses square brackets, e.g. dict['foo'] looks up the value under the key 'foo'. Strings, numbers, and tuples work as keys, and any type can be a value. Other types may or may not work correctly as keys (strings and tuples work cleanly since they are immutable). Looking up a value which is not in the dict throws a KeyError -- use "in" to check if the key is in the dict, or use dict.get(key) which returns the value or None if the key is not present (or get(key, not-found) allows you to specify what value to return in the not-found case).
## Can build up a dict by starting with the the empty dict {}
## and storing key/value pairs into the dict like this:
## dict[key] = value-for-that-key
dict = {}
dict['a'] = 'alpha'
dict['g'] = 'gamma'
dict['o'] = 'omega'
 
print dict  ## {'a': 'alpha', 'o': 'omega', 'g': 'gamma'}
 
print dict['a']     ## Simple lookup, returns 'alpha'
dict['a'] = 6       ## Put new key/value into dict
'a' in dict         ## True
## print dict['z']                  ## Throws KeyError
if 'z' in dict: print dict['z']     ## Avoid KeyError
print dict.get('z')  ## None (instead of KeyError)


dict with keys 'a' 'o' 'g'
dict with keys 'a' 'o' 'g'

A for loop on a dictionary iterates over its keys by default. The keys will appear in an arbitrary order. The methods dict.keys() and dict.values() return lists of the keys or values explicitly. There's also an items() which returns a list of (key, value) tuples, which is the most efficient way to examine all the key value data in the dictionary. All of these lists can be passed to the sorted() function.
## By default, iterating over a dict iterates over its keys.
## Note that the keys are in a random order.
for key in dict: print key
## prints a g o
## Exactly the same as above
for key in dict.keys(): print key
 
## Get the .keys() list:
print dict.keys()  ## ['a', 'o', 'g']
 
## Likewise, there's a .values() list of values
print dict.values()  ## ['alpha', 'omega', 'gamma']
## Common case -- loop over the keys in sorted order,
## accessing each key/value
for key in sorted(dict.keys()):
  print key, dict[key]
 
## .items() is the dict expressed as (key, value) tuples
print dict.items()  ##  [('a', 'alpha'), ('o', 'omega'), ('g', 'gamma')]
 
## This loop syntax accesses the whole dict by looping
## over the .items() tuple list, accessing one (key, value)
## pair on each iteration.
for k, v in dict.items(): print k, '>', v
## a > alpha    o > omega     g > gamma


There are "iter" variants of these methods called iterkeys(), itervalues() and iteritems() which avoid the cost of constructing the whole list -- a performance win if the data is huge. However, I generally prefer the plain keys() and values() methods with their sensible names.

Strategy note: from a performance point of view, the dictionary is one of your greatest tools, and you should use where you can as an easy way to organize data. For example, you might read a log file where each line begins with an ip address, and store the data into a dict using the ip address as the key, and the list of lines where it appears as the value. Once you've read in the whole file, you can look up any ip address and instantly see its list of lines. The dictionary takes in scattered data and make it into something coherent.

Exercise 3: Modify your complement script from before, create a dict with bases as keys and complementary bases as values, then use this dict to complement a DNA string (without using if, elif and else conditionals)

Functions

Functions are used to group code into 'chunks' that do something meaningful, and hide the boring details.
They also keep you from having to type things over and over.
Here's a function.
def my_function(arg1, arg2):
    return arg1 + arg2
 

It takes two arguments and adds them together. Nothing happens yet though, because we haven't called it.

my_function(2, 4) # 6
my_function('this', 'that') # 'thisthat'

Functions can be thought of in two different ways, depending on what you want to do.
First, you can treat them like functions in math, which take 0 or more arguments and return one result.
Wherever you call a function, Python will get the result and put it in place of the function call.

my_function(1,4) < 3
# False

The other way to look at them is as subroutines. They group a series of steps together so you can just write it once and then reference it in other code. When you're using a function that way you might not care about the result; it's the actions inside that are important.

def my_function(arg1, arg2):
    print arg1
    print arg2
 
my_function(2, 4)
# 2
# 4

Any function without a return statement just returns None (which the interpreter doesn't bother printing). You can store that somewhere if you want, but usually you just ignore it.

Exercise 4: wrap your DNA-complementing logic in a function. Start by printing each letter, then see if you can get it to return a list of them instead.
And if that was easy: write a function that finds the reverse complement of a DNA sequence, using your first function for most of the logic. Hint: lists have a built-in reverse function.


Working with Files

Bioinformatics often involves reading and writing files.
How do you think the code for that would look? You might expect something like

>>> text = read_file('sequence.txt')

Sometimes that's what you want, and we'll write a function for it.
But in general you can't assume the file is small enough to load all at once.

Then you can loop through and do stuff with each line.

in_file_handle = open('sequence.txt', 'r')
out_file_handle = open('sequence_out.txt', 'a')
for line in in_file_handle:
     print line
if len(line) > 50:
     out_file_handle.write(line)
in_file_handle.close()
out_file_handle.close()
"""
'atcgatcgatgcatcatctagtacgtagc'
'tagctcgagctgtgggtggtgatgctgtggctgcgctc'
'tagtcgttatcggcatgcattgtggtacgacgcg'
 
The file sequence_out.txt will have sequence lines which are > 50 bp
"""

You can loop through elements of the seq object like it was a list, but you can't access them at random.
For example, file_handle[0] would cause an error.
Also, you can only access them once. If you added a second for loop below that one, seq would appear to be an empty list.
That's because it keeps track of its position in the file and always feeds you the next line. Once it's at the end there aren't any more.

The 'r' tells Python to open the file for reading. The other options are 'w' for write and 'a' for append. The difference is writing deletes any existing data in the file, while append adds lines to the end of it. (Think of a log file, where you add a line each time something happens).

file handle objects have their own methods.
  • There is a close() method that closes an open file handle.
  • next() method goes to the next line in the file and returns it
  • readlines() method reads the entire file into memory and returns a list where each element is one line
  • read() method reads the entire file into memory and returns the contents of the file as as string.
  • write() method writes whatever you pass as an argument to it, to the file handle, you first need to open a file handle with write permissions, usually append permissions. This is useful to write outputs of your scripts to.


Exercise 5

1. Write a Python script read a fasta file and print out the sequence name and nucleotide counts for each fasta record in the file.


2. Write a Python program to read the file (beware this file is not small and is in compressed format) and write the sequence names and GC percentage of the sequence to another file.






Solutions:
Exercise 1
base = raw_input('Please enter a base: ')
base = base.upper()
if base == 'A':
    print 'T'
elif base == 'T':
    print 'A'
elif base == 'G':
    print 'C'
elif base == 'C':
    print 'G'
else:
    print 'Not a base'
 


Exercise 2
base_list = ['A', 'G', 'G', 'T', 'C', 'T', 'C', 'A']
comp_list = []
 
for base in base_list:
    if base == 'A':
        comp_list.append('T')
    elif base == 'T':
        comp_list.append('A')
    elif base == 'G':
        comp_list.append('C')
    elif base == 'C':
        comp_list.append('G')
print comp_list


Exercise 3
base_dict = {'A':'T', 'T':'A', 'G':'C', 'C':'G'}
base_list = ['A', 'G', 'G', 'T', 'C', 'T', 'C', 'A']
comp_list = []
 
for base in base_list:
    comp_base = base_dict[base]
    comp_list.append(comp_base)
print comp_list



Exercise 4
def complement(base_list, base_dict):
    comp_list = []
    for base in base_list:
        comp_base = base_dict[base]
        comp_list.append(comp_base)
    return comp_list
 
base_dict1 = {'A':'T', 'T':'A', 'G':'C', 'C':'G'}
base_list2 = ['A', 'G', 'G', 'T', 'C', 'T', 'C', 'A']
 
print complement(base_list2, base_dict1)
 
rev_c = complement(base_list2, base_dict1)
rev_c.reverse()
print rev_c

Exercise 5.1
fasta_fh = open('mrna10.fa','r')
for line in fasta_fh:
    if line.startswith('>'):
        seq_name = line[1:-3] # cut off the starting > and the trailing \n and number
        seq = fasta_fh.next() # gets the next line which is the sequence
        length = str(len(seq)) # length of the sequence converted to string so we can format
        print '%s\t%s' %(seq_name, length)
fasta_fh.close()


Exercise 5.2

fasta_fh = open('mrna.fa','r')
out_fh = open('mrna_gc.out','a')
for line in fasta_fh:
    if line.startswith('>'):
        seq_name = line[1:-3] # cut off the starting > and the trailing \n and number
        seq = fasta_fh.next() # gets the next line which is the sequence
        seq_list = list(seq) # convert the sequence to a list so we can count number of occurences of G and C
        num_G = seq_list.count('g')
        num_C = seq_list.count('c')
        GC_percent = (num_G + num_C)/float(len(seq)) # we convert the output of len(seq) which is an int to float, so we get float division
        out_fh.write('%s\t%s\n' %(seq_name, GC_percent))
fasta_fh.close()
out_fh.close()





References==[1]
  1. ^ 1. https://developers.google.com/edu/python/
    2. http://intro-prog-bioinfo-2014.wikispaces.com
    3. http://cgrlucb.wikispaces.com/PythonSpring2013
    4. https://docs.python.org/2/index.html