Introduction to Python for Genomic Data Analysis

May 6, 2013

Jeff Johnson and Madhavan Ganesh



Basics

Welcome to the Introduction to Python workshop at CGRL. This course is aimed at biologists who are interested in learning how to analyze genomic data sets and especially Next Generation Sequencing data sets. Our goal is to provide an impetus to people to start writing small programs and then be able to explore further on their own. Since the expected audience is people who have little to no programming experience this material covered in this course will be at a beginner level. However, since programming requires some understanding of the command line environment we expect that the attendees have an understanding of the Unix or Unix-like environments (e.g. MacOS). You are also expected to know one of the text editors to read and write text files in the Unix environment. Some of the common ones are emacs, vi, nano etc.

Why Python

Python programming language is new and is increasing in popularity. There are several features in the language that are helpful in tackling problems in the genomic data analysis field. Many of the newer programs that are available in this area are written in Python as well. Moreover it is easier than some other languages for a novice programmer to start on.

Python is an evolving language and there are several versions available now. We will be using version 2.7.1 in this class. Many of the programs available for genomic data analysis are written in versions 2.6 or 2.7, but the difference between these are not very significant. These versions are still the most popular ones. The material covered in this class will work well for the most part in both. The newest versions Python 3.x have major differences from 2.x and some of the code covered in this class may not work with it.

You might notice some Monty Python references.


Programming in general


INPUTS --> PROGRAMS ---> OUTPUTS

Programs are a set of instructions to the computer to do certain actions. Each program will take 0 or more inputs and then carry out the instructions given and produce 0 or more outputs. In general, there are multiple ways of achieving the same results by doing the different sets of actions. Therefore, there are many of writing computer programs to achieve the same outcome. However, some programs are more efficient and elegant than others. Some may be more easy to read and maintain over time. We will not place much emphasis on these points since it is a beginner level class, but it is good to learn those differences and incorporate these into your programs and programming style as you become more proficient.


Setting up Your Workspace

First, download and install Python 2.7 from python.org
Then open the IDLE interpreter and write your first program!

>>> print "Hello, World!"
'Hello, World!'

When you type instructions, the interpreter will execute them and print the result. You can navigate your previous commands using the arrow keys, which makes it easier to try the same line over with a small change.
You can also write the same instructions in a file and run it.
Once you start writing longer programs you'll want to do it that way to avoid retyping things, and also to share your code with others.
Go to File-->New, save the same instruction with the extension .py, and run it with Run->Load Module.


Comments

Python ignores anything from a # to the end of the line; it's just there as a note for other humans.

>>> # this is a comment
...

Aside: some blocks of code can span multiple lines. The Python interpreter will assume you want to keep writing, and show the '...' prompt. To finish the block and run it, just hit enter again.


Variables and Operators


So what can you actually do in a program? Usually it involves reading inputs, creating and changing a bunch of internal variables, and finally communicating a couple of them by printing, sending to a file, etc.
We've seen printing already, so let's try doing some math by creating and changing variables.

>>> x = 3
>>> y = 10
>>> x + y
13
>>> z = x * y
>>> z
30

In a real program, nothing is printed unless you say so. They interpreter just prints values you didn't store anywhere so they aren't lost.
This also introduces the `+` operator, which works how you'd expect. There are `-`, `*`, and `/` operators too.
`=` doesn't really mean 'equal'... it's better to think of it as 'assigned to/holds/refers to'.

Let's try division.

>>> 5 / 2
2

What happened? Division is trickier than the other operators, because of how numbers are represented in a computer.
Which brings us to types...


Types

Everything you store in a computer is represented as a series of 0s and 1s, and depending on what you want to do different representations are best.
Python doesn't bother you with types for the most part, but if you don't understand them you'll get tripped up.
For example, `+` means something different with strings:

>>> 'String 1' + 'String 2'
'String 1String 2'


Ints vs Floats

In Python there are two common types for representing numbers: `int` and `float`. `int`s are simple, exact, and efficient. But they can only represent integers, so division is kind of screwy. There are two division operators: `/` gives the rounded-down answer, and `%` gives the remainder. It's not as bad as it sounds though, because lots of situations come up where you want one of those, rather than traditional division.

>>> 5 / 2
2
>>> 5 % 2
1

`float`s are written by adding a decimal. They handle division the "right" way:

>>> 5.0 / 2.0
2.5

The main downside is they're approximate (limited to a certain number of significant digits) and therefore subject to rounding errors.

>>> 100.0 / 23.3
4.291845493562231

You should never directly test if two `float`s are equal, because they might differ by a tiny amount.


Casting

There are functions for converting one type to another:

>>> int(50.5)
50
>>> float(50)
50.0
>>> str(50.0)
'50.0'

We'll get to functions in a bit; for now just think of them in the math sense: they take an input (in parentheses) and produce some output.
You can use them yourself, and Python also uses them automatically when you write an expression involving more than one type:

>>> 50.0 / 2
25.0

You can mix numbers and strings in the same expression if it makes sense.

>>> 'String 1' * 3
'String 1String 1String 1'


Strings

Strings are used so much it's worth going into some detail about their peculiarities.
They can use either single or double quotes, which helps when you need to include one in the content.
If you use triple quotes, you can span multiple lines.

>>> '''this is a long string
... and it has line breaks
... in it'''
'this is a long string\nand it has line breaks\nin it'

Notice the line breaks show up as '\n'; that's how they're actually stored on disk.
Escaping characters with special meanings is a general problem in programming.
For example, percent encoding in urls.
In files, the main characters to know are '\t' (tab) and '\n' (or '\r\n' on windows).

You can put placeholders in strings, and fill them in with data later. This is useful for printing.

>>> my_str = 'The placeholder was made into %s' % 7
>>> print my_str
'The placeholder was made into 7'
>>> my_str = 'Two placeholders, %s and %s'
>>> my_str = my_str % (2, 'three')
>>> print my_str
'Two placeholders, 2 and three'

The only tricky thing is you need parentheses if there's more than one placeholder.
Oh, and the `%` here has a totally different meaning from in math.


Other Types

There are a couple other important types besides `int`s, `float`s, and `str`s. `bool`s have only two possible values: True or False. They're used mainly to determine what the program should do in different situations, and we'll see examples of that in a few minutes.

None is its own type, and just denotes lack of a value. In languages that don't have this it gets complicated; you might decide 0 should represent 'no value', or -1 or something. But then you have to be careful that your data never contains 0 or -1. Setting something to None instead is cleaner and safer.


Lists

Now you know how to do stuff with variables. But what if you want to do stuff with lots of variables at once? You group them into one large collection and refer to that as a variable. We'll focus on lists today. There are lots of other types of collections (dictionaries, sets, trees, etc) that are better for specific situations, but you should default to using a list unless you have some specific reason not to.

Here's how you create a list and access individual elements from it:

>>> my_list = [1,2,3,'elephant']
>>> my_list[0]
1
>>> my_list[3]
'elephant'

Elements in a list are numbered, starting from 0. There are reasons for not starting at 1, but they aren't important now.
You can also use negative numbers relative to the end of the list:

>>> my_list[-1]
'elephant'
>>> my_list[-2]
3

Converting between lists and strings is slightly more complicated than what we've seen so far,
but it makes joining things into a word, sentence, or line equally easy.

>>> letters = list('a string')
>>> letters
['a', ' ', 's', 't', 'r', 'i', 'n', 'g']
>>> str(letters)
"['a', ' ', 's', 't', 'r', 'i', 'n', 'g']"
>>> ''.join(letters)
'a string'
>>> '-'.join(letters)
'a- -s-t-r-i-n-g'

You can modify lists after you create them, but to understand that we need to take a detour...


Objects

Everything in Python is an "object". That means in addition to holding data, it also holds functions for doing stuff to that data. That's how operators do different things with different types. Numbers have functions for adding themselves to something, multiplying themselves by something, converting themselves to a string, etc.

Lists carry around functions for adding and removing elements to themselves. You can access them using dot notation:

>>> my_list.append('stuff')
>>> my_list.remove('elephant')
>>> my_list
[1, 3, 'stuff']
>>> my_list.insert(1, 'new 2nd element')
>>> my_list
[1, 'new 2nd element', 3, 'stuff']

A huge part of programming is making your own objects with special functions, like NonNegativeInteger or DnaSequence or WebPage. But that's a more advanced topic for another day.

You can also use regular assignment to replace list elements:

>>> my_list[0] = 23
>>> my_list
[23, 3, 'new third element', 'elephant', 'stuff']


Control Flow

So you know how to make large lists of values and do stuff with their individual elements. But what if you want to do stuff with all the elements at once? There's a loop for that:

>>> for elem in my_list:
...     print elem
...
23
3
'new third element'
'elephant'
'stuff'

This is called a 'for loop'. It basically does

>>> elem = my_list[0]
>>> print elem
23
>>> elem = my_list[1]
>>> print elem
3

and so on for everything in my_list. Besides being easier, it also works on any size list (even an empty one). In fact, it doesn't make any assumptions about the type of my_list. You could give it a set, dictionary, custom object or whatever and it would still work.

The general syntax is:

for x in y:
do stuff

Almost every construct in Python reads that way, with a line naming the construct and setting up variables, a colon, and some indented code that happens 'inside' it. 4 spaces is the most standard way, but you can indent however much you want as long as it's consistent. The constuct ends when you get to a line indented less.

Another important construct for controlling your program is the 'if statement'. It lets you have two or more alternative blocks of code, and only do one of them depending on the situation.

>>> x = 5
>>> if x == 0:
...     print 'x is zero'
... elif x < 10:
...     print 'x is small'
... elif x < 50:
...     print 'x is big'
... else:
...     print 'x is HUGE!'
...
'x is small'

Only the first part with the 'if' is required. Then you can have zero or more 'elif's (short for 'else if'), and finally a single 'else' if you want. Each of them is called a 'branch'. Python checks the 'guard' at the beginning of each branch, and if it's a true statement that branch runs. The 'else' branch runs if none of them were true.

Excercise: you have a DNA base stored in a variable. Write an if statement that prints the complementary base. It should work with 'a', 'c', 't', or 'g'. Something like this:

>>> base = 'a'
>>>  if ...
... ...
...
't'


Composing Programs

Well, that's most of the basic peices of Python. Congratulations! You're not nearly done, but you've reached the tipping point where programming goes from formulaic to creative. If this were guitar you would know how to hold the instrument and play some notes, scales, and chords. From here on it's all about combining them into larger performances: choruses, tracks, albums, symphonies.


Nesting

The first thing to realize is that programming languages are like natural languages: as long as you follow the rules of grammar, anything can be nested inside anything else. Here are a couple examples:

>>> my_list = [1, 45, 3, 4, 5]
>>> for num in my_list:
...     if num % 2 == 0:
...         print '%s is even' % num
...     else:
...         print '%s is odd' % num
...
'1 is odd'
'45 is odd'
'4 is even'
'5 is odd'

>>> list_in_list = [[1,2,3], [4,5,6], [7,8], []]
>>> for regular_list in list_in_list:
...     if len(regular_list) == 0:
...         print 'The list %s is empty.' % regular_list
...     else:
...         print 'The list %s has %s elements.' % (regular_list, len(regular_list))
...
'The list [1, 2, 3] has 3 elements.'
'The list [4, 5, 6] has 3 elements.'
'The list [7, 8] has 2 elements.'
'The list [] is empty.'

Exercise: Remember when we found the complement of a DNA base? Put that in a loop so it works on a list of bases.


Functions

Functions are used to group code into 'chunks' that do something meaningful, and hide the boring details.
They also keep you from having to type things over and over.
Here's a function.

>>> def my_function(arg1, arg2):
...     return arg1 + arg2
...

It takes two arguments and adds them together. Nothing happens yet though, because we haven't called it.

>>> my_function(2, 4)
6
>>> my_function('this', 'that')
'thisthat'

Functions can be thought of in two different ways, depending on what you want to do.
First, you can treat them like functions in math, which take 0 or more arguments and return one result.
Wherever you call a function, Python will get the result and put it in place of the function call.

>>> my_function(1,4) < 3
False

The other way to look at them is as subroutines. They group a series of steps together so you can just write it once and then reference it in other code. When you're using a function that way you might not care about the result; it's the actions inside that are important.

>>> def my_function(arg1, arg2):
...     print arg1
...     print arg2
...
>>> my_function(2, 4)
2
4

Any function without a return statement just returns None (which the interpreter doesn't bother printing). You can store that somewhere if you want, but usually you just ignore it.

Excercise: wrap your DNA-complementing logic in a function. Start by printing each letter, then see if you can get it to return a list of them instead.
And if that was easy: write a function that finds the reverse complement of a DNA sequence, using your first function for most of the logic. Hint: lists have a built-in reverse function.


Next Steps

There are more levels of organization to a typical program: classes, modules, maybe packages, all of which can be written yourself or imported from other peoples' code. But there isn't time to cover all that in one day, so let's skip it and go straight to reading and writing files, so you can say you did something useful things with actual data! If you want to really learn programming you should take some time and go through the official Python tutorial (or google for a different one that suits you better).


Working with Files

Bioinformatics often involves reading and writing files.
How do you think the code for that would look? You might expect something like

>>> text = read_file('sequence.txt')

Sometimes that's what you want, and we'll write a function for it.
But in general you can't assume the file is small enough to load all at once.
Instead you need to load it in little pieces, and for that we need a new construct: the 'with statement'.
It reates a special 'cursor' that acts sort of like a list of lines.
Then you can loop through and do stuff with each line.

>>> with open('sequence.txt', 'r') as seq:
...     for line in seq:
...         print line
...
'atcgatcgatgcatcatctagtacgtagc'
'tagctcgagctgtgggtggtgatgctgtggctgcgctc'
'tagtcgttatcggcatgcattgtggtacgacgcg'

You can loop through elements of the seq object like it was a list, but you can't access them at random.
For example, seq[0] would cause an error.
Also, you can only access them once. If you added a second for loop below that one, seq would appear to be an empty list.
That's because it keeps track of its position in the file and always feeds you the next line. Once it's at the end there aren't any more.

The 'r' tells Python to open the file for reading. The other options are 'w' for write and 'a' for append. The difference is writing deletes any existing data in the file, while append adds lines to the end of it. (Think of a log file, where you add a line each time something happens).

With statements are commonly used for files, but they also work with other cases where you want to access large sources of data:

>>> with connect_to_database('mydata.db') as db:
...     do stuff with database...

Complete Program

Download these two files, and try running the script. You should be able to understand everything it does now!