PythonSpring2013_Week2

=Introduction to Python for Genomic Data Analysis=

Basics
Welcome to the Introduction to Python workshop at CGRL. This course is aimed at biologists who are interested in learning how to analyze genomic data sets and especially Next Generation Sequencing data sets. Our goal is to provide an impetus to people to start writing small programs and then be able to explore further on their own. Since the expected audience is people who have little to no programming experience this material covered in this course will be at a beginner level. However, since programming requires some understanding of the command line environment we expect that the attendees have an understanding of the Unix or Unix-like environments (e.g. MacOS). You are also expected to know one of the text editors to read and write text files in the Unix environment. Some of the common ones are emacs, vi, nano etc.

Why Python
Python programming language is new and is increasing in popularity. There are several features in the language that are helpful in tackling problems in the genomic data analysis field. Many of the newer programs that are available in this area are written in Python as well. Moreover it is easier than some other languages for a novice programmer to start on.

Python is an evolving language and there are several versions available now. We will be using version 2.7.1 in this class. Many of the programs available for genomic data analysis are written in versions 2.6 or 2.7, but the difference between these are not very significant. These versions are still the most popular ones. The material covered in this class will work well for the most part in both. The newest versions Python 3.x have major differences from 2.x and some of the code covered in this class may not work with it.

You might notice some Monty Python references.

Programming in general
INPUTS --> PROGRAMS ---> OUTPUTS

Programs are a set of instructions to the computer to do certain actions. Each program will take 0 or more inputs and then carry out the instructions given and produce 0 or more outputs. In general, there are multiple ways of achieving the same results by doing the different sets of actions. Therefore, there are many of writing computer programs to achieve the same outcome. However, some programs are more efficient and elegant than others. Some may be more easy to read and maintain over time. We will not place much emphasis on these points since it is a beginner level class, but it is good to learn those differences and incorporate these into your programs and programming style as you become more proficient.

Setting up Your Workspace
First, download and install Python 2.7 from python.org Then open the IDLE interpreter and write your first program!

code format="python" >>> print "Hello, World!" 'Hello, World!' code

When you type instructions, the interpreter will execute them and print the result. You can navigate your previous commands using the arrow keys, which makes it easier to try the same line over with a small change. You can also write the same instructions in a file and run it. Once you start writing longer programs you'll want to do it that way to avoid retyping things, and also to share your code with others. Go to File-->New, save the same instruction with the extension .py, and run it with Run->Load Module.

Comments
Python ignores anything from a # to the end of the line; it's just there as a note for other humans.

code format="python" >>> # this is a comment ... code

Aside: some blocks of code can span multiple lines. The Python interpreter will assume you want to keep writing, and show the '...' prompt. To finish the block and run it, just hit enter again.

Variables and Operators
So what can you actually do in a program? Usually it involves reading inputs, creating and changing a bunch of internal variables, and finally communicating a couple of them by printing, sending to a file, etc. We've seen printing already, so let's try doing some math by creating and changing variables.

code format="python" >>> x = 3 >>> y = 10 >>> x + y 13 >>> z = x * y >>> z 30 code

In a real program, nothing is printed unless you say so. They interpreter just prints values you didn't store anywhere so they aren't lost. This also introduces the `+` operator, which works how you'd expect. There are `-`, `*`, and `/` operators too. `=` doesn't really mean 'equal'... it's better to think of it as 'assigned to/holds/refers to'.

Let's try division.

code format="python" >>> 5 / 2 2 code

What happened? Division is trickier than the other operators, because of how numbers are represented in a computer. Which brings us to types...

Types
Everything you store in a computer is represented as a series of 0s and 1s, and depending on what you want to do different representations are best. Python doesn't bother you with types for the most part, but if you don't understand them you'll get tripped up. For example, `+` means something different with strings:

code format="python" >>> 'String 1' + 'String 2' 'String 1String 2' code

Ints vs Floats
In Python there are two common types for representing numbers: `int` and `float`. `int`s are simple, exact, and efficient. But they can only represent integers, so division is kind of screwy. There are two division operators: `/` gives the rounded-down answer, and `%` gives the remainder. It's not as bad as it sounds though, because lots of situations come up where you want one of those, rather than traditional division.

code format="python" >>> 5 / 2 2 >>> 5 % 2 1 code

`float`s are written by adding a decimal. They handle division the "right" way:

code format="python" >>> 5.0 / 2.0 2.5 code

The main downside is they're approximate (limited to a certain number of significant digits) and therefore subject to rounding errors.

code format="python" >>> 100.0 / 23.3 4.291845493562231 code

You should never directly test if two `float`s are equal, because they might differ by a tiny amount.

Casting
There are functions for converting one type to another:

code format="python" >>> int(50.5) 50 >>> float(50) 50.0 >>> str(50.0) '50.0' code

We'll get to functions in a bit; for now just think of them in the math sense: they take an input (in parentheses) and produce some output. You can use them yourself, and Python also uses them automatically when you write an expression involving more than one type:

code format="python" >>> 50.0 / 2 25.0 code

You can mix numbers and strings in the same expression if it makes sense.

code format="python" >>> 'String 1' * 3 'String 1String 1String 1' code

Strings
Strings are used so much it's worth going into some detail about their peculiarities. They can use either single or double quotes, which helps when you need to include one in the content. If you use triple quotes, you can span multiple lines.

code format="python" >>> '''this is a long string ... and it has line breaks ... in it''' 'this is a long string\nand it has line breaks\nin it' code

Notice the line breaks show up as '\n'; that's how they're actually stored on disk. Escaping characters with special meanings is a general problem in programming. For example, percent encoding in urls. In files, the main characters to know are '\t' (tab) and '\n' (or '\r\n' on windows).

You can put placeholders in strings, and fill them in with data later. This is useful for printing.

code format="python" >>> my_str = 'The placeholder was made into %s' % 7 >>> print my_str 'The placeholder was made into 7' >>> my_str = 'Two placeholders, %s and %s' >>> my_str = my_str % (2, 'three') >>> print my_str 'Two placeholders, 2 and three' code

The only tricky thing is you need parentheses if there's more than one placeholder. Oh, and the `%` here has a totally different meaning from in math.

Other Types
There are a couple other important types besides `int`s, `float`s, and `str`s. `bool`s have only two possible values: True or False. They're used mainly to determine what the program should do in different situations, and we'll see examples of that in a few minutes.

None is its own type, and just denotes lack of a value. In languages that don't have this it gets complicated; you might decide 0 should represent 'no value', or -1 or something. But then you have to be careful that your data never contains 0 or -1. Setting something to None instead is cleaner and safer.

Lists
Now you know how to do stuff with variables. But what if you want to do stuff with lots of variables at once? You group them into one large collection and refer to that as a variable. We'll focus on lists today. There are lots of other types of collections (dictionaries, sets, trees, etc) that are better for specific situations, but you should default to using a list unless you have some specific reason not to.

Here's how you create a list and access individual elements from it:

code format="python" >>> my_list = [1,2,3,'elephant'] >>> my_list[0] 1 >>> my_list[3] 'elephant' code

Elements in a list are numbered, starting from 0. There are reasons for not starting at 1, but they aren't important now. You can also use negative numbers relative to the end of the list:

code format="python" >>> my_list[-1] 'elephant' >>> my_list[-2] 3 code

Converting between lists and strings is slightly more complicated than what we've seen so far, but it makes joining things into a word, sentence, or line equally easy.

code format="python" >>> letters = list('a string') >>> letters ['a', ' ', 's', 't', 'r', 'i', 'n', 'g'] >>> str(letters) "['a', ' ', 's', 't', 'r', 'i', 'n', 'g']" >>> ''.join(letters) 'a string' >>> '-'.join(letters) 'a- -s-t-r-i-n-g' code

You can modify lists after you create them, but to understand that we need to take a detour...

Objects
Everything in Python is an "object". That means in addition to holding data, it also holds functions for doing stuff to that data. That's how operators do different things with different types. Numbers have functions for adding themselves to something, multiplying themselves by something, converting themselves to a string, etc.

Lists carry around functions for adding and removing elements to themselves. You can access them using dot notation:

code format="python" >>> my_list.append('stuff') >>> my_list.remove('elephant') >>> my_list [1, 3, 'stuff'] >>> my_list.insert(1, 'new 2nd element') >>> my_list [1, 'new 2nd element', 3, 'stuff'] code

A huge part of programming is making your own objects with special functions, like NonNegativeInteger or DnaSequence or WebPage. But that's a more advanced topic for another day.

You can also use regular assignment to replace list elements:

code format="python" >>> my_list[0] = 23 >>> my_list [23, 3, 'new third element', 'elephant', 'stuff'] code

Control Flow
So you know how to make large lists of values and do stuff with their individual elements. But what if you want to do stuff with all the elements at once? There's a loop for that:

code format="python" >>> for elem in my_list: ...    print elem ... 23 3 'new third element' 'elephant' 'stuff' code

This is called a 'for loop'. It basically does

code format="python" >>> elem = my_list[0] >>> print elem 23 >>> elem = my_list[1] >>> print elem 3 code

and so on for everything in my_list. Besides being easier, it also works on any size list (even an empty one). In fact, it doesn't make any assumptions about the type of my_list. You could give it a set, dictionary, custom object or whatever and it would still work.

The general syntax is:

for x in y: do stuff

Almost every construct in Python reads that way, with a line naming the construct and setting up variables, a colon, and some indented code that happens 'inside' it. 4 spaces is the most standard way, but you can indent however much you want as long as it's consistent. The constuct ends when you get to a line indented less.

Another important construct for controlling your program is the 'if statement'. It lets you have two or more alternative blocks of code, and only do one of them depending on the situation.

code format="python" >>> x = 5 >>> if x == 0: ...    print 'x is zero' ... elif x < 10: ...    print 'x is small' ... elif x < 50: ...    print 'x is big' ... else: ...    print 'x is HUGE!' ... 'x is small' code

Only the first part with the 'if' is required. Then you can have zero or more 'elif's (short for 'else if'), and finally a single 'else' if you want. Each of them is called a 'branch'. Python checks the 'guard' at the beginning of each branch, and if it's a true statement that branch runs. The 'else' branch runs if none of them were true.

Excercise: you have a DNA base stored in a variable. Write an if statement that prints the complementary base. It should work with 'a', 'c', 't', or 'g'. Something like this:

code format="python" >>> base = 'a' >>> if ... ... ... ... 't' code

Composing Programs
Well, that's most of the basic peices of Python. Congratulations! You're not nearly done, but you've reached the tipping point where programming goes from formulaic to creative. If this were guitar you would know how to hold the instrument and play some notes, scales, and chords. From here on it's all about combining them into larger performances: choruses, tracks, albums, symphonies.

Nesting
The first thing to realize is that programming languages are like natural languages: as long as you follow the rules of grammar, anything can be nested inside anything else. Here are a couple examples:

code format="python" >>> my_list = [1, 45, 3, 4, 5] >>> for num in my_list: ...    if num % 2 == 0: ...        print '%s is even' % num ...    else: ...        print '%s is odd' % num ... '1 is odd' '45 is odd' '4 is even' '5 is odd' code

code format="python" >>> list_in_list = 1,2,3], [4,5,6], [7,8], [ >>> for regular_list in list_in_list: ...    if len(regular_list) == 0: ...        print 'The list %s is empty.' % regular_list ...    else: ...        print 'The list %s has %s elements.' % (regular_list, len(regular_list)) ... 'The list [1, 2, 3] has 3 elements.' 'The list [4, 5, 6] has 3 elements.' 'The list [7, 8] has 2 elements.' 'The list [] is empty.' code

Exercise: Remember when we found the complement of a DNA base? Put that in a loop so it works on a list of bases.

Functions
Functions are used to group code into 'chunks' that do something meaningful, and hide the boring details. They also keep you from having to type things over and over. Here's a function.

code format="python" >>> def my_function(arg1, arg2): ...    return arg1 + arg2 ... code

It takes two arguments and adds them together. Nothing happens yet though, because we haven't called it.

code format="python" >>> my_function(2, 4) 6 >>> my_function('this', 'that') 'thisthat' code

Functions can be thought of in two different ways, depending on what you want to do. First, you can treat them like functions in math, which take 0 or more arguments and return one result. Wherever you call a function, Python will get the result and put it in place of the function call.

code format="python" >>> my_function(1,4) < 3 False code

The other way to look at them is as subroutines. They group a series of steps together so you can just write it once and then reference it in other code. When you're using a function that way you might not care about the result; it's the actions inside that are important.

code format="python" >>> def my_function(arg1, arg2): ...    print arg1 ...    print arg2 ... >>> my_function(2, 4) 2 4 code

Any function without a return statement just returns None (which the interpreter doesn't bother printing). You can store that somewhere if you want, but usually you just ignore it.

Excercise: wrap your DNA-complementing logic in a function. Start by printing each letter, then see if you can get it to return a list of them instead. And if that was easy: write a function that finds the //reverse// complement of a DNA sequence, using your first function for most of the logic. Hint: lists have a built-in reverse function.

Next Steps
There are more levels of organization to a typical program: classes, modules, maybe packages, all of which can be written yourself or imported from other peoples' code. But there isn't time to cover all that in one day, so let's skip it and go straight to reading and writing files, so you can say you did something useful things with actual data! If you want to really learn programming you should take some time and go through the official Python tutorial (or google for a different one that suits you better).

Working with Files
Bioinformatics often involves reading and writing files. How do you think the code for that would look? You might expect something like

code format="python" >>> text = read_file('sequence.txt') code

Sometimes that's what you want, and we'll write a function for it. But in general you can't assume the file is small enough to load all at once. Instead you need to load it in little pieces, and for that we need a new construct: the 'with statement'. It reates a special 'cursor' that acts sort of like a list of lines. Then you can loop through and do stuff with each line.

code format="python" >>> with open('sequence.txt', 'r') as seq: ...    for line in seq: ...        print line ... 'atcgatcgatgcatcatctagtacgtagc' 'tagctcgagctgtgggtggtgatgctgtggctgcgctc' 'tagtcgttatcggcatgcattgtggtacgacgcg' code

You can loop through elements of the seq object like it was a list, but you can't access them at random. For example, seq[0] would cause an error. Also, you can only access them once. If you added a second for loop below that one, seq would appear to be an empty list. That's because it keeps track of its position in the file and always feeds you the next line. Once it's at the end there aren't any more.

The 'r' tells Python to open the file for reading. The other options are 'w' for write and 'a' for append. The difference is writing deletes any existing data in the file, while append adds lines to the end of it. (Think of a log file, where you add a line each time something happens).

With statements are commonly used for files, but they also work with other cases where you want to access large sources of data:

code format="python" >>> with connect_to_database('mydata.db') as db: ...    do stuff with database... code

Complete Program
Download these two files, and try running the script. You should be able to understand everything it does now!