Welcome to the Introduction to Python workshop at CGRL. This course is aimed at biologists who are interested in learning how to analyze molecular sequence data sets and especially Next Generation Sequencing data sets. Our goal is to provide an impetus to people to start writing small programs and then be able to explore further on their own. The expected audience for this course are those who have little to no programming experience, and this material covered in this course will be at a beginner level. However, since programming requires some understanding of the command line environment we expect that the attendees have an understanding of the Unix or Unix-like environments (e.g. MacOS). To make it easier to write python scripts we will be using a particular text editor called Aquamacs.
Why Python
Python programming language is new and is increasing in popularity. There are several features in the language that are helpful in tackling problems in the genomic data analysis field. Many of the newer programs that are available in this area are written in Python as well. Moreover it is easier than some other languages for a novice programmer to start on.
Python is an evolving language and there are several versions available now. We will be using version 2.7.* in this class. Many of the programs available for genomic data analysis are written in versions 2.6 or 2.7, but the difference between these are not very significant. These versions are still the most popular ones. The material covered in this class will work well for the most part in both. The newest versions Python 3.x have several differences from 2.x and parts of the code covered in this class may not work with it, so please be aware of this.
Programming in general
INPUTS --> PROGRAMS ---> OUTPUTS
Programs are a set of instructions to the computer to do certain actions. Each program will take 0 or more inputs and then carry out the instructions given and produce 0 or more outputs. In general, there are multiple ways of achieving the same results by doing the different sets of actions. Therefore, there are many ways to writing computer programs to achieve the same outcome, with some more efficient and elegant than others. The key point is that with a few basic tools, you will be able to build your own pipelines and have the power to automate certain tasks, such as batch processing files.
As you work through this material, try to think of ways to apply these methods to your specific problems. As beginners, it's good to think in pseudo-code, that is, break down your large problems into smaller actual tasks, and then think about how these coding methods can be used to solve each of those tasks. It's overwhelming to take on a large problem all at once, but breaking the problem into discrete tasks and then applying the python language to those tasks will help.
When you type instructions, the interpreter will execute them and print the result. You can navigate your previous commands using the arrow keys (up, down), which makes it easier to try the same line over with a small change. The interpreter is a good command line tool to quickly run small code snippets or if you want to check something quick instead of re-running your whole program.
You can also write the same instructions in a file and run it. This is the basis for writing scripts and programs. Once you start writing longer programs, you'll want to do use text files for editing your code, and also to share your code with others. To do this, we can use aquamacs to edit the text in our script files, which you also should have installed.
Let's first create a directory on our desktop, and then create a blank script (text file) to edit. We'll do this using basic Unix commands in our terminal window.
$ mkdir python_folder
$ cd python_folder
$ touch first_script.py
The extension '.py' is used to indicate this script can be executed using python. You should get used to naming your scripts with this extension. You can check to see the file is now in this directory. Now we want to begin editing this file using aquamacs.
Here are some quick options for using aquamacs: open a file from command line: aquamacs [filename] & save a file: CTRL-X-S, or command-s close a file: command + W
$ aquamacs first_script.py &
By adding the ampersand (&), you keep control of your shell prompt after aquamacs launches. If you don't use the ampersand, you'll need to open another terminal window in a moment to run our first program.
***If this doesn't work for you, simply open the script manually in aquamacs. The above is a nice command line shortcut to the program.
Now let's write the program to our script, 'first_script.py'. Copy and paste the following below into your aquamacs editor, then save it:
print"Hello world!"
Now go back to your terminal window and execute the script using Python:
$ python first_script.py
It should have produced the same output as before: Hello, World!
Now, let's try to change a couple quick things in this script to show different ways to use the "print" function:
#try this one firstprint"hello","world"#then try this one afterprint"hello"print"world"#and finally this oneprint"hello"printprint"world"
What was different about these, if anything?
Comments
Python ignores anything from a # to the end of the line; it's just there as a note for other humans.
#this is a comment#you can makes notes about what the code is doing throughout your script'''
This is a block of commented code.
You can use either triple double quotes or triple single quotes
I use this at the beginning of a script to explain what it does and how to use it.
'''
A variable is a datum with a human-readable name which can change values as the logic of the program dictates, as opposed to other types of data which have a constant, unchanging value once declared.
Variables must follow these rules:
1) Variable names are case-sensitive, so Variable1 and variable1 are different variables. 2) Variable names must start with a letter (a-z) but can contain letters, numbers and underscores. 3) Variable names cannot contain spaces or non-alphanumeric characters 4) Variable names cannot be any of the following keywords that already have special meaning in python:
Let's explore the three main types of variables (strings, integers, floating-point numbers) in a new script. From your terminal window, let's make a new script:
$ touch variables.py
$ aquamacs variables.py &
Variables are assigned by writing: - the name you want to use - a single equals sign - the value you'd like to store
This is the same whether the variable you're assigning is of type str (a character string), int (whole number), float (non-integer real number), or more fancy variables. Add this code (copy and paste) to your new variables script:
#create your variables
variable_1 ="Bananas"
variable_2 =45
variable_3 =45.5678#print out variables, then information about the variablesprintprint variable_1
print"the type of variable this is = ",type(variable_1),'\n'#type() is a function#'\n' is the character for newlineprint variable_2
print"the type of variable this is = ",type(variable_2),'\n'print variable_3
print"the type of variable this is = ",type(variable_3),'\n'
The output should look like this:
Bananas
the type of variable this is = <type 'str'>
45
the type of variable this is = <type 'int'>
45.5678
the type of variable this is = <type 'float'>
You can always figure out the variable type by using the Python built-in function 'type', one of many useful basic functions.
Numerical Operations with integers and floats
There are a number of numerical operations that can be performed with integers and floats, but differences do exist! Copy and paste the following over your variables script:
a =42
b =3# addition uses the plus sign (+)
addition = a + b
# subtraction uses the minus sign (-)
subtraction = a - b
# multiplication uses the asterisk (*)
multiplication = a * b
# division uses the slash (/)
division = a / b
# and exponents use a double-asterisk (**)
exponent = a ** b
print'sum', addition
print'diff', subtraction
print'prod', multiplication
print'quo', division
print'pow', exponent
x =5print"x should be 5, like we made it! see x = ", x
#edit a variable (type integer) by adding one, useful for creating a counter!
x +=1print"now x is one more than before! see x = ", x
x +=1print"now x is two more than before! see x = ", x
The output should look like this:
sum 45
diff 39
prod 126
quo 14
pow 74088
x should be 5, like we made it! see x = 5
now x is one more than before! see x = 6
now x is two more than before! see x = 7
Now try it with some 'floats' instead of 'ints'. Change the variables at the beginning of your script to:
a =58.672
b =3.745
The output should now look like this:
sum 62.417
diff 54.927
prod 219.72664
quo 15.6667556742
pow 4195403.08144
This is important to consider when doing operations, because: - for integers, 5 / 2 = 2 - for floats, 5.0 / 2.0 = 2.5
You can always change the type of variable through casting.
Note, performing an operation will do nothing visible unless it is printed or stored. For example:
#add variables
a =15
b =10#this won't 'create' anything visible
a/b
#but this willprint a/b
#and this too, where we store the value as a variable
c = a/b
print c
Casting and Coercion
There are coercion functions for converting one type to another. Let's demonstrate this using a script. It's a little long, but will demonstrate the changes that can be made between variable types and how they can affect operations.
int_1 =52
int_2 =4
int_3 =15#print information about variablesprint int_1,'is a',type(int_1)print int_2,'is a',type(int_2)print int_3,'is a',type(int_3)#perform some divisionprint int_1,"divided by", int_2,"=", int_1/int_2
print int_2,"divided by", int_3,"=", int_2/int_3
print'\n',"Now let's change those integers to floats!",'\n'#change variable types from integers to floats using the float() function
int_1 =float(int_1)
int_2 =float(int_2)
int_3 =float(int_3)#print information about newly coerced variables to confirm changesprint int_1,'is a',type(int_1)print int_2,'is a',type(int_2)print int_3,'is a',type(int_3)#perform some division with the floatsprint int_1,"divided by", int_2,"=", int_1/int_2
print int_2,"divided by", int_3,"=", int_2/int_3
print'\n',"Now let's change those floats to strings!",'\n'#change to a string using the str() function
int_1 =str(int_1)
int_2 =str(int_2)
int_3 =str(int_3)#confirm the changes have occurredprint int_1,'is a',type(int_1)print int_2,'is a',type(int_2)print int_3,'is a',type(int_3)
Notice in particular how 4 divided by 15 is calculated as an integer and following a transformation to a float.
This is a useful tool especially when it comes to importing text from files, we need control over what variable types certain items should be. We can even turn numbers into strings by using the str() function, but numerical operations won't work the same for these. String operations are a whole different topic!
Strings
Strings are used A LOT so it's worth going into details about their peculiarities. Much of what you'll import from text files will enter your python world as strings, so we need to know how to manipulate them and change them. - Strings are immutable. Once you declare a string, you cannot change the elements of a string. - They can use either single or double quotes, which helps when you need to include one in the content. - If you use triple quotes, you can span multiple lines.
Create a new script for some string manipulations:
$ touch strings.py
$ aquamacs strings.py &
And add the following:
s ='I think "coffee" is pretty fantastic.'print s,'\n'
s2 ="I much prefer tea."print s2,'\n'
s3 ='''I 'prefer' drinking both
because I "could"
really, really
use
caffeine.'''print s3,'\n'
Notice the output:
I think "coffee" is pretty fantastic.
I much prefer tea.
I 'prefer' drinking both
because I "could"
really, really
use
caffeine.
The triple quotes will preserve line breaks as is in the code, and allow single and double quotation pairs.
Escape Characters
Now for a difficulty in working with strings... escape characters. Try running the following code by copying and pasting over your string script:
#bring on the tab
s ='some\thing is missing'print s,'\n'#bring on the newline
s2 ="somethi\ng is broken"print s2,'\n'#bring on the bells
s3 ='''something th\at will drive you b\an\an\as'''print s3,'\n'#solution 1
s4 = r'\a solu\tio\n'print s4,'\n'#solution 2
s5 ='\\another solu\\tio\\n'print s5,'\n'
This produces a very ugly output that does not match most of our strings:
some hing is missing
somethi
g is broken
something tht will drive you bnns
\a solu\tio\n
\another solu\tio\n
Escaping characters with special meanings is a general problem in programming. In files, the main characters to know are '\t' (tab) and '\n' (newline), but also include bell noises '\a' and other bothers.
Python offers two ways around this: the safest is to escape your escape, using a second backslash (see s5 above, '\\'), but this requires adding text to the string. A fancier way involves a special kind of string, the raw string. Raw strings start with r' and end with ' will treat every character between as exactly what it looks like, with the exception of the single quote (which ends the string). If you do use raw strings, watch out for two catches: 1) You must still escape single quotes you want to appear inside the string. 2) The last character of the string cannot be a backslash, since this will escape the closing quote.
These may not be big problems for you if you don't write many strings containing backslashes, but you may not have control over what you import from other text files.
String Placeholders
You can put placeholders in strings, and fill them in using variables from outside of the string. This is VERY useful for printing and also when it comes to writing to files further down the line. We'll test this using the string.format() method. Try the following code below:
a ='teas'
b ='coffees'
c =5
d =4.5#use the string.format() method to include variables in the stringprint"I would like several {}".format(b)#use the same method with multiple variablesprint"Actually, I would like roughly {} {} and maybe {} of your {}".format(d, b, c, a)#you can even reuse the same variables if we assign them numbers, starting with zeroprint"Okay, just give me {0} {1} and {0} {2}, then {0} more {2}".format(c, a, b)
Your output should be:
I would like several coffees
Actually, I would like roughly 4.5 coffees and maybe 5 of your teas
Okay, just give me 5 teas and 5 coffees, then 5 more coffees
The curly braces allow you to insert a variable into the string anywhere you'd like. Follow the string with: .format(variable_to_insert), making sure to include multiple variables if multiple braces are used.
Notice we can include any type of variable in the string, including other strings, integers, and floats. This is especially useful for printing to the screen and for writing to output files, which will largely be in strings.
String Methods
Here are some of the most common string methods. A method is like a function, but it runs "on" an object. If the variable s is a string, then the code s.lower() runs the lower() method on that string object and returns the result. We've used functions [like str() and float() and len()] and methods [string.format()] in our previous scripts.
Here are some of the common string methods (the ones I use the most are in bold):
s.lower(), s.upper() -- returns the lowercase or uppercase version of the string
Useful for converting molecular sequence data to the same format (all uppercase)
s.strip() -- returns a string with whitespace removed from the start and end
An essential tool for removing oddities from lines when importing them from outside text files
s.isalpha()/s.isdigit()/s.isspace()... -- tests if all the string chars are in the various character classes
s.startswith('other'), s.endswith('other') -- tests if the string starts or ends with the given other string
Imagine searching for a particular file extension with s.endswith('fasta') in a directory...
s.find('other') -- searches for the given other string (not a regular expression) within s, and returns the first index where it begins or -1 if not found
s.replace('old', 'new') -- returns a string where all occurrences of 'old' have been replaced by 'new'
s.split('delim') -- returns a list of substrings separated by the given delimiter. The delimiter is not a regular expression, it's just text. 'aaa,bbb,ccc'.split(',') -> ['aaa', 'bbb', 'ccc']. As a convenient special case s.split() (with no arguments) splits on all whitespace chars.
This can be used with characters we already know, for example '\t' for splitting a tab-delimited text file.
s.join(list) -- opposite of split(), joins the elements in the given list together using the string as the delimiter. e.g. '---'.join(['aaa', 'bbb', 'ccc']) -> aaa---bbb---ccc
Let's show some examples using some fake sequence data.
seq1 ="aCtgttttggcagccGGGGAtG"print"seq1:", seq1
#change to all upper case
seq1 = seq1.upper()print"-seq1 using uppercase transform:"print seq1
#test whether beginning of sequence matches a patternprint"-test whether beginning matches ACT:"print seq1.startswith("ACT")print"-test whether beginning matches act:"print seq1.startswith("act")#what happens when we split?
seq1 = seq1.split("C")print"-sequence after splitting it using C:"print seq1
The output should look like this:
seq1: aCtgttttggcagccGGGGAtG
-seq1 using uppercase transform:
ACTGTTTTGGCAGCCGGGGATG
-test whether beginning matches ACT:
True
-test whether beginning matches act:
False
-sequence after splitting it using C:
['A', 'TGTTTTGG', 'AG', '', 'GGGGATG']
Notice that some of these methods return a result which may require you to assign a new variable. You can simply use the same name as the original variable, which effectively replaces the original variable information with the returned results (permanently), or assign to a new variable name. The last method in the script returns a list, which we will talk about in the next section. For a nice tutorial on some of these methods, check out the prior UC Berkeley Summer Python Bootcamp Wiki.
String Slicing
The "slice" syntax is a handy way to refer to sub-parts of sequences -- typically strings and lists. The slice s[start:end] consists of the elements beginning at start and extending up to (but not including) the end.
We'll start with a string example. All strings have a length that is equal to the number of characters they are composed of, and a slice is simply a portion of that total length. You can decide what that slice is composed of by providing index numbers, which can be thought of as "coordinates" for the substring.
Let's visualize this using a string, "Hello".
the string 'hello' with letter indexes 0 1 2 3 4
This string has a length of 5, but the index number will always start at ZERO, causing the final index number of the string to be 4 rather than 5. This is a fundamental concept for learning slicing with strings and lists.
Let's create a new script for slicing.
$ touch slice.py
$ aquamacs slice.py &
And play around with some slicing options:
s ='Hello'printlen(s)#5 -- using the len() function will return the total number of characters in the stringprint s[1:4]# is 'ell' -- chars starting at index 1 and extending up to but not including index 4 (the o)print s[1:]# is 'ello' -- omitting either index defaults to the start or end of the stringprint s[:]# is 'Hello' -- omitting both always gives us a copy of the whole thing
The standard zero-based index numbers give easy access to chars near the start of the string. As an alternative, Python uses negative numbers to give easy access to the chars at the opposite end of the string: s[-1] is the last char 'o' s[-2] is 'l' the next-to-last char and so on. Negative index numbers count back from the end of the string:
print s[-1]# is 'o' -- last char (1st from the end)print s[-4]# is 'e' -- 4th from the endprint s[:-3]# is 'He' -- going up to but not including the last 3 chars.print s[-3:]# is 'llo' -- starting with the 3rd char from the end and extending to the end of the string.
It is a neat truism of slices that for any index n, s[:n] + s[n:] == s. This works even for n negative or out of bounds. Or put another way s[:n] and s[n:] always partition the string into two string parts, conserving all the characters. As we'll see in the list section later, slices work with lists too.
String Concatenation
There is use for joining strings together. You can use string1 + string2 to accomplish this.
The + does what you might imagine. Its use is fairly straightforward, but it is oddly similar to a numerical operator we've seen: the plus sign. The difference between string concatenation and numerical addition is only whether the values on either side are strings or numbers. Thus, using one of each, like 'string' + 5 will confuse python, and therefore should be avoided. If you meant to concatenate 'string' and the string '5' to get 'string5', you should either coerce 5 into a string first or use the string.format() method. This is particularly relevant when it comes time to write files using variables within our scripts.
Let's examine the uses of concatenation vs. numerical operations with the plus sign using the code below:
seq1 ="AATTCGCTATTCGGGGAATCGATCGA"
seq2 ="TTACGATCGAGGATCCTATGAT"
seq3 ="CCTATTCGAGGTCTTGAGCTTTAGC"print"partial sequence lengths:",len(seq1),len(seq2),len(seq3)#notice what happens when the + is used in this context (an integer is produced)
pred_length =len(seq1) + len(seq2) + len(seq3)print"predicted complete length = ", pred_length
#notice what happens when the + is used in this context (a merged string is produced)
complete_seq = seq1+seq2 + seq3
print"complete sequence length = ",len(complete_seq)print complete_seq
That's all we will cover for strings in this introductory course, but there are more powerful string methods, including regular expression operations.
I've found that using the above tools, you can process text from files and write to output files with no problem.
Data Structures
Lists
Now you know how to do stuff with variables. But what if you want to do stuff with lots of variables at once? You group them into one large collection and refer to that as a variable. We'll focus on lists today. There are lots of other types of collections (dictionaries, sets, tuples, etc) that are better for specific situations, but you should default to using a list unless you have some specific reason not to.
Let's create a new script to examine lists and some list functions.
$ touch lists.py
$ aquamacs lists.py &
Here's how you create a list and access individual elements from it:
Notice that integers and floats and strings can be added to the same list, and that you should pay attention to how you are adding elements to a list. When printing a full list, the strings will always have single quotes around them, whereas the numerical variable will not. When printing an individual element from a list, notice though that these quotes disappear for strings.
Elements in a list are indexed, starting from 0 (just like with string slicing). The len() function gives the number of elements in the list. You can also use negative numbers relative to the end of the list, like with string slicing.
list of strings 'red' 'blue 'green'
To create an empty list, use empty brackets. The list can be populated with variables manually, or as we'll see later on, using a loop.
When working with text files, you will often be importing lines then splitting them using a delimiter to parse out useful information. As we saw above, splitting strings will create a list. We can take elements from those lists and create new lists that we are interested in. Here is a practical example:
header1 =">gi|307750512|gb|GU048508.1| Ampithoe valida isolate AV339 cytochrome c oxidase subunit I gene, partial cds; mitochondrial"#you can create a list from a string by splitting it#this is almost certainly what you'll do when importing lines from files#you can split by tabs, spaces, or other characters#here, I want the Genbank accession number, so I'll split using the | symbol
header1_list = header1.split("|")#how long is our list?printlen(header1_list)#what does the list look like?print header1_list
#find the element of interest in the list using the index numberprint header1_list[3]#we can assign this to a variable to keep better track of it
genbank_no = header1_list[3]print genbank_no
And the output should look like this:
5
['>gi', '307750512', 'gb', 'GU048508.1', ' Ampithoe valida isolate AV339 cytochrome c oxidase subunit I gene, partial cds; mitochondrial']
GU048508.1
GU048508.1
Sometimes it can be useful to simply print the list out and count which element you are interested in, it is easy to mess up the index numbers with complex string splits. Alternatively, you can print out each element one by one to locate the item of interest (list[0], list[1], list[2], etc).
Let's make a list of Genbank numbers from several of these line headers following the same methods:
#Here are a bunch of strings that you can imagine came from a fasta file downloaded from GenBank#We want to parse through these to pull out the genbank numbers, then put them in a new list
header1 =">gi|307750512|gb|GU048508.1| Ampithoe valida isolate AV339 cytochrome c oxidase subunit I gene, partial cds; mitochondrial"
header2 =">gi|307750510|gb|GU048507.1| Ampithoe valida isolate AV338 cytochrome c oxidase subunit I gene, partial cds; mitochondrial"
header3 =">gi|307750508|gb|GU048506.1| Ampithoe valida isolate AV337 cytochrome c oxidase subunit I gene, partial cds; mitochondrial"
header4 =">gi|307750506|gb|GU048505.1| Ampithoe valida isolate AV336 cytochrome c oxidase subunit I gene, partial cds; mitochondrial"#perform splitting operation for strings
header1_list = header1.split("|")
header2_list = header2.split("|")
header3_list = header3.split("|")
header4_list = header4.split("|")#create an empty list to populate
genbank_list =[]#we can use the index of the lists created to pull out the genbank number#add the genbank information to the list using the list.append() function
genbank_list.append(header1_list[3])
genbank_list.append(header2_list[3])
genbank_list.append(header3_list[3])
genbank_list.append(header4_list[3])#print the listprint genbank_list
We can do useful things with this list now, like sorting by name, deleting items, slicing, etc.
List Methods
Just like strings have methods, lists have methods too. Here are some other common list methods, but you'll probably be using the append() method most frequently.
list.append(elem) -- adds a single element to the end of the list. Common error: does not return the new list, just modifies the original.
list.insert(index, elem) -- inserts the element at the given index, shifting elements to the right.
list.extend(list2) adds the elements in list2 to the end of the list. Using + or += on a list is similar to using extend().
list.index(elem) -- searches for the given element from the start of the list and returns its index. Throws a ValueError if the element does not appear (use "in" to check without a ValueError).
list.remove(elem) -- searches for the first instance of the given element and removes it (throws ValueError if not present)
list.reverse() -- reverses the list in place (does not return it)
list.pop(index) -- removes and returns the element at the given index. Returns the rightmost element if index is omitted (roughly the opposite of append()).
You can modify lists after you create them, unlike strings that are immutable.
Some methods return objects whereas others don't. The three list methods append, insert and remove modify the list in place and do not return anything. However, the method pop() removes an element of the list from the specified index and returns the element, which you could assign to a new variable.
Sorting a list
The easiest way to sort is with the sorted(list) function, which takes a list and returns a new list with those elements in sorted order. The original list is not changed.
list1 =['GU048508.1','GU048507.1','GU048505.1','GU048506.1']#perform sorted function on list, assign to new variable
sorted_list1 =sorted(list1)#reverse sorting
sorted_list2 =sorted(list1, reverse=True)print"starting list:", list1
print"sorted list:", sorted_list1
print"reverse sorted list:", sorted_list2
#note, we can always print the new list without assigning it to a variableprint"without a variable assignment:",sorted(list1)
It's most common to pass a list into the sorted() function, but in fact it can take as input any sort of iterable collection. The sorted() function can be customized though optional arguments. The sorted() optional argument reverse=True, e.g. sorted(list, reverse=True), makes it sort backwards, which can be useful in certain cases.
Custom Sorting With key=
For more complex custom sorting, sorted() takes an optional "key=" specifying a "key" function that transforms each element before comparison. The key function takes in 1 value and returns 1 value, and the returned "proxy" value is used for the comparisons within the sort.
For example with a list of strings, specifying key=len (the built in len() function) sorts the strings by length, from shortest to longest. The sort calls len() for each string to get the list of proxy length values, and the sorts with those proxy values.
strs =['ccc','aaaa','d','bb']printsorted(strs, key=len)# produces ['d', 'bb', 'ccc', 'aaaa'], which is sorted by lengthprintsorted(strs, key=len, reverse=True)# produces ['aaaa', 'ccc', 'bb', 'd'], which is reverse sorted by length
Here is a way to visualize what is happening in this script:
calling sorted with key=len
As another example, specifying "str.lower" as the key function is a way to force the sorting to treat uppercase and lowercase the same:
strs =['zz','BB','aa','CC']#normal sortprintsorted(strs)# produces ['BB', 'CC', 'aa', 'zz'], which is sorted by uppercase, then alphabetical#use "key" argument specifying str.lower function to use for sortingprintsorted(strs, key=str.lower)# produces ['aa', 'BB', 'CC', 'zz']# this list is sorted alphabetically using the proxy values, which are all converted to lowercase
A list of lists
You can add a variety of thing to lists, including other lists. You can create and access these data structures in the following way:
#create lists of genes
geneset1 =['cytb','nd2','16s','12s']
geneset2 =['pomc','rag1','kiaa2013','tyr','pdc']
geneset3 =['aa1','aa2','aa3']#create empty list
new_list =[]#populate empty list with our existing lists
new_list.append(geneset1)
new_list.append(geneset2)
new_list.append(geneset3)#characterize this list of listsprint"length of new list:",len(new_list)print new_list
print"first element of new list:", new_list[0]print"second element of new list", new_list[1]print"third element of new list:", new_list[2]#retrieve specific elements from the list of listsprint"first element in the first element:", new_list[0][0]print"second element in the first element:", new_list[0][1]
The output looks like this:
length of new list: 3
[['cytb', 'nd2', '16s', '12s'], ['pomc', 'rag1', 'kiaa2013', 'tyr', 'pdc'], ['aa1', 'aa2', 'aa3']]
first element of new list: ['cytb', 'nd2', '16s', '12s']
second element of new list ['pomc', 'rag1', 'kiaa2013', 'tyr', 'pdc']
third element of new list: ['aa1', 'aa2', 'aa3']
first element in the first element: cytb
second element in the first element: nd2
Notice that the first index of the list ( new_list[0] ) will return that element, which itself is a list. To access an element within that list, you need more coordinates. Here, we add the second index, new_list[0][1]. This will retrieve first element in our 'new_list', which is the list 'geneset1', then retrieve the second element in
'geneset1', which is 'nd2'. You can use a list of lists for a variety of tasks, and it is a convenient way to store several related lists together.
Tuples
A tuple is essentially a list that you can not change. You can index, slice them and add them together to make new tuples but not use sort(), reverse(), delete or remove items from them. If you ever have a tuple that you want to change, you have to turn it into a list. Tuples have merits, but we won't say much more about them here.
Sets
Like a set in mathematics, a set has a bunch of elements with no repeats. To build a set, you can pass in a list and automatically remove duplicates, or build a set one by one (again automatically removing duplicates). Sets are like lists, but they are not sorted and have other special properties. The greatest utility of sets is the easy ability for making "Venn Diagrams". We can build one set to simply know how many unique elements it contains. We can compare two sets to find only the elements they have in common, only elements found in one or the other, and to "add" or "subtract" the unique values from respective sets. Sets can be converted into lists to use list methods if necessary.
Let's make a new script for investigating the properties of sets:
$ touch sets.py
$ aquamacs sets.py &
Then add the following code to the script:
#syntax for adding to a set
geneset1 =set(['pomc','pomc','pomc','pomc','rag1','kiaa2013','tyr','pdc','cytb','nd2','16s','12s'])
geneset2 =set(['pomc','rag1','kiaa2013','tyr','pdc','aa1','aa2','aa3'])#or create an empty set and populate it item by item (useful for looping)#similar to list.append() method, but we use set.add() instead
geneset3 =set()
geneset3.add('pomc')
geneset3.add('rag1')#notice the length of geneset1 is not 12, but rather 9, because duplicates are eliminatedprint"unique elements in geneset1:",len(geneset1)#should be 9print"unique elements in geneset2:",len(geneset2)#should be 8print"unique elements in geneset3:",len(geneset3)#should be 2
Now let's perform some comparisons between the sets. Replace your code with this:
geneset1 =set(['pomc','pomc','pomc','pomc','rag1','kiaa2013','tyr','pdc','cytb','nd2','16s','12s'])
geneset2 =set(['pomc','rag1','kiaa2013','tyr','pdc','aa1','aa2','aa3'])print"geneset1:", geneset1
print"elements in geneset1:",len(geneset1)print"geneset2:", geneset2
print"elements in geneset2:",len(geneset2),'\n'#combine sets to include all unique values
combined_set = geneset1 | geneset2
print"combined_set of all unique values:"printlen(combined_set)print combined_set,'\n'#combine sets but only include values shared between both sets
shared_set = geneset1 & geneset2
print"shared_set of only elements common to both sets:"printlen(shared_set)print shared_set,'\n'#combine sets by subtraction, removing all elements of the second set from the first
subtract_set = geneset1 - geneset2
print"subtract_set of geneset1 minus geneset2:"printlen(subtract_set)print subtract_set,'\n'#combine sets and remove only the shared elements between them, leaving only elements present in one set but not both
diff_set = geneset1 ^ geneset2
print"diff_set composed of elements not shared between both sets:"printlen(diff_set)print diff_set,'\n'
This produces the following output:
geneset1: set(['12s', 'kiaa2013', 'cytb', 'rag1', 'pomc', 'nd2', 'pdc', 'tyr', '16s'])
elements in geneset1: 9
geneset2: set(['kiaa2013', 'aa2', 'rag1', 'aa1', 'aa3', 'pomc', 'pdc', 'tyr'])
elements in geneset2: 8
combined_set of all unique values:
12
set(['pdc', 'cytb', 'aa2', '12s', 'aa1', 'aa3', 'pomc', 'rag1', 'nd2', 'kiaa2013', 'tyr', '16s'])
shared_set of only elements common to both sets:
5
set(['pdc', 'rag1', 'kiaa2013', 'tyr', 'pomc'])
subtract_set of geneset1 minus geneset2:
4
set(['cytb', '12s', 'nd2', '16s'])
diff_set composed of elements not shared between both sets:
7
set(['12s', 'cytb', 'aa1', 'aa3', 'aa2', 'nd2', '16s'])
That is pretty useful. You could imagine scenarios of comparing sets of molecular data where you may want to use one of the above methods.
A dictionary is another type of data structure, and very comparable to an actual dictionary in which we reference a word and find its definition. To retrieve information out of a python dictionary, you look up a word, called a key, and you find information associated with that word, called the value.
Strings, numbers, and tuples work as keys, and any type can be a value. Other types may or may not work correctly as keys (strings and tuples work cleanly since they are immutable). Just like lists and sets, dictionaries are special and have their own useful features.
Let's create a new script for examining dictionaries and their properties:
# An empty dictionary is started by setting a variable name = {}dict={}# key/value pairs can be added to the dict following this format:# dictionary_name[key_name] = valuedict['a']='alpha'dict['g']='gamma'dict['o']='omega'# note that if instead of 'a', we typed dict[a] = 'alpha', python would look for a variable called 'a'# a temporary placeholder can be used for a key value that can be filled in later# this is sometimes useful in loops (we will see this for parsing fasta files to create dictionaries)dict['p']=''dict['p']+='pi'#notice there is no order for the dictionary keys when it printsprintdict#look up a particular key, which will return the VALUE (not the key!)printdict['a']
Here is a way to visualize the dictionary data structure:
dict with keys 'a' 'o' 'g'
Dictionary methods
The methods dict.keys() and dict.values() return lists of the keys or values explicitly. There's also an items() function which returns a list of (key, value) tuples, which is the most efficient way to examine all the key value data in the dictionary. All of these lists can be passed to the sorted() function.
Try adding this code and running it:
# First, let's get the keys# we can simply print themprintdict.keys()print# or we can store them as a list in a new variable
key_list =dict.keys()print key_list
print# there are similar methods for the values in the dictionary# simply print the valuesprintdict.values()print#or store the values in a list
values_list =dict.values()print values_list
# Finally, there is a way to retrieve keys and items together expressed as (key, value) tuples#we can print thisprintdict.items()print#or store it as a tuple
key_value_tup =dict.items()print key_value_tup
From a performance point of view, the dictionary is one of your greatest tools, and you should use it to organize your data. A common usage for molecular data is to have a sequence name as a key and the DNA sequence as the corresponding value. You can iterate over an entire fasta file to create a sequence dictionary. Remember, keys and values can always be turned into lists or tuples for additional methods.
Tests and Loops: if, for, while
Now that you understand variables and several data structures, it is time to create statements that will execute some action across multiple variables automatically, or execute that action only under a certain condition. The three words if, for, and while can be used to construct such statements.
First word: if
The if statement is one of the most fundamental components of programming, and it is found in some form or another in every general-purpose programming language. It simply tells the computer to execute a block of code if a given statement is true. If the statement is false, the code is skipped. Although the concept is simple, conditional statements like if allow your programs to make decisions on their own based on their inputs. This allows you to have two or more alternative blocks of code, and only do one of them depending on the situation.
if <insert something that may or may not be true here>:
execute this line of code
execute more code if wanted too
The key pieces of syntax here are the if, the condition that we would like to evaluate (note: the condition is expressed within brackets <> for emphasis in this example, but brackets are not used in real code), the colon : that signifies the end of the logical test, and the tab-indentation for the things to do below the first line of the statement. Aquamacs will automatically tab over for you following this type of statement.
#create sequence
seq1 ="ACTGTATCGATGCGGCATTTTTCGGATC"#create a conditional statement#if the length of the sequence (an integer) is greater than 10, print the statementiflen(seq1)>10:
print"sequence is longer than 10 bp"#OR else if the length is less than 10, print this statementeliflen(seq1)<10:
print"sequence is shorter than 10 bp"#OR else if the length is exactly 10, print this statementeliflen(seq1)==10:
print"sequence is exactly 10 bp!"
Try changing the sequence length to meet the other conditions, and see the results.
Only the first part with the 'if' is required. Then you can have zero or more 'elif's (short for 'else if'), and finally a single 'else' if you want. Each of them is called a 'branch'. Python checks the 'guard' at the beginning of each branch, and if it's a true statement that branch runs. The 'else' branch runs if none of them were true.
Second word: for
The for loop allows you to perform the same actions multiple times, but with different data each time through (some languages have an equivalent construct and call it a for each loop).
seq1 ="ACTGT"
genelist1 =['nd2','16s','12s','cytb']
dict1 ={'a': 'alpha','p': 'pi','o': 'omega','g': 'gamma'}print"the for loop in a string iterates over each character of the string:"for x in seq1:
print x
print'\n',"the for loop in a list iterates over each item in the list:"for x in genelist1:
print x
print'\n',"the for loop in a dictionary will default to iterating over the keys:"for x in dict1:
print x
print'\n',"special for loop using a dictionary method for key, value pairs:"for key, value in dict1.items():
print key, value
and the output:
the for loop in a string iterates over each character of the string:
A
C
T
G
T
the for loop in a list iterates over each item in the list:
nd2
16s
12s
cytb
the for loop in a dictionary will default to iterating over the keys:
a
p
g
o
special for loop using a dictionary method for key, value pairs:
a alpha
p pi
g gamma
o omega
In these cases, we use the variable x to iterate across the string or list, and print out that value. We could change the name of x to be more meaningful in the context of the loop (bp, item, key, etc.), as long as the naming follows the rules for a variable. The loop ends when there are no more values left to iterate over. You could imagine that as you move through each item or character or key, you may want to perform some function.
You may also control what happens using the range() function:
# assign x to the integer zero
x =int(0)# for every number from 0 to 4 (5 numbers), execute the following codefor i inrange(5):
# print value of xprint x
# then add one increment to x
x +=1
and the output:
0
1
2
3
4
Here range() is another function that returns a list of integers from 0 up to but not including the number in parenthesis.
Third word: while
Another way to loop through a list is to use the while loop. This example is very similar to the previous one:
# assign x to be the integer zero
x =int(0)# as long as x is less than 5, execute the following codewhile x <5:
# add one to x
x +=1# print out the value of xprint"x is now {}".format(x)
and the output:
x is now 1
x is now 2
x is now 3
x is now 4
x is now 5
Don't forget to increment the x (x += 1), or you will enter an infinite loop!
Nesting and Complex Statements
The first thing to realize is that programming languages are like natural languages: as long as you follow the rules of grammar, anything can be nested inside anything else. Here are a couple examples:
Make a new script and call it GC_content.py Add the following:
seq1 ="TTGAGCTGGGATACTCGGAGCCTCACTTAGAATAATCATCCGGACGGAGTTAAGCGGCCCCGGTAATTTA"
seq2 ="ATTGGGGATGACCAAATTTACAACACCATCGTAACAGCTCATGCTTTTATCATAATTTTTTTTATAGTTA"
seq3 ="TACCTATTATAATCGGTGGCTTCGGAAATTGATTGGTTCCTCTAATATTAGGAAGACCGGATATAGCCTT"
seq4 ="CCCACGAATGAATAATATAAGATTTTGATTACTACCGCCCTCTTTAACTCTCCTATTAATAAGAGGAATG"
seq5 ="GTAGAAAGAGGGGTAGGTACTGGGTGAACAGTTTACCCCCCTTTAAGAGGGGCAATTGCCCATAGAGGGG"#create empty list to put these sequences in
seq_list =[]#add the sequences one by one
seq_list.append(seq1)
seq_list.append(seq2)
seq_list.append(seq3)
seq_list.append(seq4)
seq_list.append(seq5)#start a for loop, iterating over every sequence in the listfor seq in seq_list:
# each time a sequence is selected, start the GC_content and total_bases counters at zero
GC_count =float(0)
total_bases =float(0)# start the second for loop, this time iterating over every base in a given sequencefor base in seq:
# for every base in this sequence, add one to our total base pair counter
total_bases +=1# start a conditional statement, that if a 'C' is found, add one to the GC_content counterif base =="C":
GC_count +=1# another conditional statement, that if a 'G' is found, add one to the GC_content counterelif base =="G":
GC_count +=1#at the end of the sequence, before moving on to the next sequence in the list, calculate the percentage# of G's and C's in this sequence
GC_content =int((GC_count/total_bases)*100)# print this information out using the string.format() methodprint"The GC content of {0} is {1}%".format(seq, GC_content)#after this code, the loop will continue with the next sequence in the seq_list
and have a look at the output:
The GC content of TTGAGCTGGGATACTCGGAGCCTCACTTAGAATAATCATCCGGACGGAGTTAAGCGGCCCCGGTAATTTA is 50%
The GC content of ATTGGGGATGACCAAATTTACAACACCATCGTAACAGCTCATGCTTTTATCATAATTTTTTTTATAGTTA is 30%
The GC content of TACCTATTATAATCGGTGGCTTCGGAAATTGATTGGTTCCTCTAATATTAGGAAGACCGGATATAGCCTT is 38%
The GC content of CCCACGAATGAATAATATAAGATTTTGATTACTACCGCCCTCTTTAACTCTCCTATTAATAAGAGGAATG is 34%
The GC content of GTAGAAAGAGGGGTAGGTACTGGGTGAACAGTTTACCCCCCTTTAAGAGGGGCAATTGCCCATAGAGGGG is 52%
It's nice to use the same functions on multiple items in a row. By using combinations of tests and loops, you can accomplish a lot with relatively simple code. The main tricky part starting out is keeping track of what happens inside each loop, and when. It is also important to consider where to place a variable or list or dictionary that you are trying to build, either inside or outside of a loop. Sometimes your variable or item may reset each time the loop begins, if you've defined it inside the loop.
Take a look at the following example, which might help you think about nesting loops and how they operate:
outside_counter =int(0)for i inrange(10):
outside_counter +=1print outside_counter
# this prints 10# the counter is outside of the loop#############################for i inrange(10):
inside_counter =int(0)
inside_counter +=1print inside_counter
# this prints 1, rather than 10# why?#############################
first_loop =int(0)
second_loop =int(0)
third_loop =int(0)for i inrange(5):
first_loop +=1for j inrange(5):
second_loop +=1for h inrange(5):
third_loop +=1print first_loop
#this count is 5print second_loop
#this count is 25print third_loop
#this count is 125# how can you explain the differences in numbers? what order do the loops occur in?
Perhaps a better way to visualize this last example is to print a statement following each iteration of nested loops, so you can see what is being executed and when. It's a nice way to keep track. Try this code:
first_loop =int(0)
second_loop =int(0)
third_loop =int(0)for i inrange(5):
first_loop +=1print"I am in the primary for statement now!"for j inrange(5):
second_loop +=1print'\t',"I am in the secondary for statement here."for h inrange(5):
third_loop +=1print'\t','\t',"And here I am in the third for statement, doing loops!"
I put tabs in the statements to correspond with the tabs for the nested loops. Does this make more sense? Getting a grasp on nesting statements like these is important for really using the language to do your tasks.
Functions
Functions are used to group code into 'chunks' that do something meaningful, and hide the boring details. They also keep you from having to type things over and over.
All functions begin by: def name_of_function(arguments): block of code
Let's make a 'functions.py' file.
Here is an example of a function that is numerical in principle:
#assign a name to function and list arguments it will take, in this case twodef percent_calc(x,y):
percent =(float(x) / float(y))*100#return the percent variable from this functionreturn percent
#to execute the function, use the name of the function and give it#the appropriate arguments defined by you (in this case, two numbers)#print the output of the function, performed for 100 and 300 as argumentsprint percent_calc(100,300)#or store it as a variable for use later
perc = percent_calc(100,300)print perc
# in both cases 33.3333333333 is what prints to the screen
Functions can be thought of in two different ways, depending on what you want to do. First, you can treat them like functions in math, which take 0 or more arguments and return one result, like above.
The other way to look at them is as subroutines. They group a series of steps together so you can just write it once and then reference it in other code. Remember our GC_content calculator? We can create a function for that and execute that for a single sequence or a list of sequences, now that it is in a handy format.
def gc_calculator(seq):
total_bases =float(0)
GC_count =float(0)for base in seq:
# for every base in this sequence, add one to our total base pair counter
total_bases +=1# start a conditional statement, that if a 'C' is found, add one to the GC_content counterif base =="C":
GC_count +=1# another conditional statement, that if a 'G' is found, add one to the GC_content counterelif base =="G":
GC_count +=1
GC_content =int((GC_count/total_bases)*100)return GC_content
seq1 ="TTGAGCTGGGATACTCGGAGCCTCACTTAGAATAATCATCCGGACGGAGTTAAGCGGCCCCGGTAATTTA"
seq2 ="ATTGGGGATGACCAAATTTACAACACCATCGTAACAGCTCATGCTTTTATCATAATTTTTTTTATAGTTA"
seq3 ="TACCTATTATAATCGGTGGCTTCGGAAATTGATTGGTTCCTCTAATATTAGGAAGACCGGATATAGCCTT"
seq_list =[]
seq_list.append(seq1)
seq_list.append(seq2)
seq_list.append(seq3)#now we can use the function on one sequence, or iterate over the listprint"gc_content of seq1 is:", gc_calculator(seq1),"%"for seq in seq_list:
gc= gc_calculator(seq)print"the GC content of {0} is: {1}%".format(seq,gc)
and the output:
gc_content of seq1 is: 50 %
the GC content of TTGAGCTGGGATACTCGGAGCCTCACTTAGAATAATCATCCGGACGGAGTTAAGCGGCCCCGGTAATTTA is: 50%
the GC content of ATTGGGGATGACCAAATTTACAACACCATCGTAACAGCTCATGCTTTTATCATAATTTTTTTTATAGTTA is: 30%
the GC content of TACCTATTATAATCGGTGGCTTCGGAAATTGATTGGTTCCTCTAATATTAGGAAGACCGGATATAGCCTT is: 38%
Any function without a return statement just returns None (which the interpreter doesn't bother printing). You can store that somewhere if you want, but usually you just ignore it.
Working with Files
Bioinformatics often involves reading and writing files. So far, we've been delivering information to Python in the scripts we are writing, by assigning variables and creating data structures in the script itself. In most cases, this information will come from an outside file, and we need to learn how to import this information and use it in meaningful ways. That's what the focus of this section is, and we'll end with writing information to output files!
A filehandle is an object that controls the stream of information between your program and a file stored somewhere on the computer. Filehandles are not filenames, and they are not the files themselves. They are a tool that your program uses to interact with files, nothing more.
file handle objects have their own methods.
There is a close() method that closes an open file handle.
next() method goes to the next line in the file and returns it
readlines() method reads the entire file into memory and returns a list where each element is one line
read() method reads the entire file into memory and returns the contents of the file as as string.
write() method writes whatever you pass as an argument to it, to the file handle, you first need to open a file handle with write permissions, usually append permissions. This is useful to write outputs of your scripts to.
We create filehandles in the simplest sense with the open() command:
fh = open('some_file', [permissions])
The permissions include 'r', 'w', and 'a'. The 'r' tells Python to open the file for reading. The other options are 'w' for write and 'a' for append. The difference is writing deletes any existing data in the file, while append adds lines to the end of it. (Think of a log file, where you add a line each time something happens).
Create a new script called "file_open.py"
Let's try it on a phylip format alignment file, which you can download below:
Put this file in the same folder we've been using for this workshop. You can inspect it from terminal by typing:
$ less Alignment.phy
(hit q when done)
The first line has two numbers, the first is how many sequences there are and the second is the alignment length. Then all following lines are the sequence name followed by the sequence.
Let's open this file by entering the following code into "file_open.py":
#fh is a nice abbreviation for filehandle#the 'r' indicates read-only, so we can't make changes to this file
fh =open('Alignment.phy','r')
contents = fh.read()print contents
#always close a file after you are done using the following command:
fh.close()
That's a replica, but it is a single long string and not useful for parsing through. We want to use something more manageable.
Let's use a different method that involves looping through lines read from our file. The fh.readlines() method returns a list of strings, where each string represents a line in the file. Using the list and string methods we've already covered, we can iterate over the list of file lines and do things to these strings to pull out relevant information.
Let's do this here using the method described above, and our goal here is simply to pull out the names of the samples in this alignment.
Try the following code:
fh =open('Alignment.phy','r')#read the lines into a list
lines = fh.readlines()#we can now iterate over the lines to do useful things#let's skip the first line using index numbers (remember lines[0] will be the first line)for line in lines[1:]:
#eliminate the newline characters and odd whitespaces trailing lines#ALWAYS a good idea to use this function! You can assign to a new variable or keep it simple#and replace the line variable, since we only need it here
line = line.strip()#split the line based on a delimiter, in this case the space between the sequence name and sequence#at this stage you could also use line.split('\t') if you have a tab-delimited text file for example
line = line.split()#NOTE: we just turned our string into a list of strings... now we are dealing with a list!#assign the elements in the newly created list to variable names for quick reference
name = line[0]
sequence = line[1]print name
# prints the name part of this line, then will move on to the next line in our readlines() generated list#always close a file after you are done using the following command:
fh.close()
We are just beginning to scratch the surface here for possibilities. You could also calculate GC content for each sequence along the way, as long as we include the function we invented above and execute it on the sequence variable. You could put the sample names into a list and sort by alphabetical, or store the sequences in a list. You could check for duplicate sequences by creating a set and comparing the length of the set vs. the number of sequences read (start a counter!). You could search for a particular repeat in each sequence, or trim them to a specific length.
The last thing we will do is write to a file.
Create an output file by using a filehandle for it.
We'll modify the above script to output the sample names and sequences to a new text file, separated by a tab:
fh =open('Alignment.phy','r')
lines = fh.readlines()#create the output file here, and make sure we can append it ('a'), not write over it ('w')
output_name ="Output.txt"
fh_out =open(output_name,'a')for line in lines[1:]:
line = line.strip()
line = line.split()
name = line[0]
sequence = line[1]#this is where the writing happens with the fh.write() function#we are using a concatenated string here: name plus a tab character plus sequence name, plus newline#if you didn't include the newline, all names and seqs would all appear on a single line!
fh_out.write(name+'\t'+sequence+'\n')
fh.close()#add line to close output file
fh_out.close()
Nothing appeared on the screen, but we didn't ask anything to print. Check your directory to see the output file, "Output.txt".
If we used the 'w' option for our output file, only the last sample name and sequence would be there, because we would have been constantly writing over the file contents. You can do A LOT with output files, including making headers, then populating your file with information when you iterate over the lines of the input file. You could also summarize information and do mathematical operations, then output these too. Don't forget about the utility of the string.format() method here!
Tips
Congratulations, you survived your first major exposure to Python!
The best way to continue learning is through more focused tutorials and practice. I strongly recommend working through the UC Berkeley Python bootcamp wiki page, which has plenty of exercises and solutions, found here.
There are also many other resources for learning Python for free online or available as a download.
Some examples are: Learning Python Python Pocket Python Official Documentation
And 9 times out of 10, googling your problem will help you solve it via a programming forum.
Remember, a lot of bioinformatics is simply processing output files from one program in order to prepare them as inputs for a different program. You can do these formatting changes with the concepts explained above. Become familiar with the specific file types you work with, and play around with writing your own files using information from input files.
Introduction to Python
Sept 28, 2015Dan Portik
Table of Contents
Basics
Welcome to the Introduction to Python workshop at CGRL. This course is aimed at biologists who are interested in learning how to analyze molecular sequence data sets and especially Next Generation Sequencing data sets. Our goal is to provide an impetus to people to start writing small programs and then be able to explore further on their own. The expected audience for this course are those who have little to no programming experience, and this material covered in this course will be at a beginner level. However, since programming requires some understanding of the command line environment we expect that the attendees have an understanding of the Unix or Unix-like environments (e.g. MacOS). To make it easier to write python scripts we will be using a particular text editor called Aquamacs.Why Python
Python programming language is new and is increasing in popularity. There are several features in the language that are helpful in tackling problems in the genomic data analysis field. Many of the newer programs that are available in this area are written in Python as well. Moreover it is easier than some other languages for a novice programmer to start on.Python is an evolving language and there are several versions available now. We will be using version 2.7.* in this class. Many of the programs available for genomic data analysis are written in versions 2.6 or 2.7, but the difference between these are not very significant. These versions are still the most popular ones. The material covered in this class will work well for the most part in both. The newest versions Python 3.x have several differences from 2.x and parts of the code covered in this class may not work with it, so please be aware of this.
Programming in general
INPUTS --> PROGRAMS ---> OUTPUTS
Programs are a set of instructions to the computer to do certain actions. Each program will take 0 or more inputs and then carry out the instructions given and produce 0 or more outputs. In general, there are multiple ways of achieving the same results by doing the different sets of actions. Therefore, there are many ways to writing computer programs to achieve the same outcome, with some more efficient and elegant than others. The key point is that with a few basic tools, you will be able to build your own pipelines and have the power to automate certain tasks, such as batch processing files.
As you work through this material, try to think of ways to apply these methods to your specific problems. As beginners, it's good to think in pseudo-code, that is, break down your large problems into smaller actual tasks, and then think about how these coding methods can be used to solve each of those tasks. It's overwhelming to take on a large problem all at once, but breaking the problem into discrete tasks and then applying the python language to those tasks will help.
Setting up your first script
You should have already downloaded and installed Python 2.7 from https://www.python.org/downloads/
Let's open our terminal window, then open the python interpreter, and write our first simple program.
When you type instructions, the interpreter will execute them and print the result. You can navigate your previous commands using the arrow keys (up, down), which makes it easier to try the same line over with a small change. The interpreter is a good command line tool to quickly run small code snippets or if you want to check something quick instead of re-running your whole program.
You can also write the same instructions in a file and run it. This is the basis for writing scripts and programs.
Once you start writing longer programs, you'll want to do use text files for editing your code, and also to share your code with others.
To do this, we can use aquamacs to edit the text in our script files, which you also should have installed.
Let's first create a directory on our desktop, and then create a blank script (text file) to edit. We'll do this using basic Unix commands in our terminal window.
The extension '.py' is used to indicate this script can be executed using python. You should get used to naming your scripts with this extension. You can check to see the file is now in this directory. Now we want to begin editing this file using aquamacs.
Here are some quick options for using aquamacs:
open a file from command line: aquamacs [filename] &
save a file: CTRL-X-S, or command-s
close a file: command + W
By adding the ampersand (&), you keep control of your shell prompt after aquamacs launches. If you don't use the ampersand, you'll need to open another terminal window in a moment to run our first program.
***If this doesn't work for you, simply open the script manually in aquamacs. The above is a nice command line shortcut to the program.
Now let's write the program to our script, 'first_script.py'. Copy and paste the following below into your aquamacs editor, then save it:
Now go back to your terminal window and execute the script using Python:
It should have produced the same output as before:
Hello, World!
Now, let's try to change a couple quick things in this script to show different ways to use the "print" function:
What was different about these, if anything?
Comments
Python ignores anything from a # to the end of the line; it's just there as a note for other humans.Variables: integers, floating-point numbers (decimal), strings (text)
A variable is a datum with a human-readable name which can change values as the logic of the program dictates, as opposed to other types of data which have a constant, unchanging value once declared.
Variables must follow these rules:
1) Variable names are case-sensitive, so Variable1 and variable1 are different variables.
2) Variable names must start with a letter (a-z) but can contain letters, numbers and underscores.
3) Variable names cannot contain spaces or non-alphanumeric characters
4) Variable names cannot be any of the following keywords that already have special meaning in python:
Types (Overview)
Let's explore the three main types of variables (strings, integers, floating-point numbers) in a new script.
From your terminal window, let's make a new script:
Variables are assigned by writing:
- the name you want to use
- a single equals sign
- the value you'd like to store
This is the same whether the variable you're assigning is of type str (a character string), int (whole number), float (non-integer real number), or more fancy variables. Add this code (copy and paste) to your new variables script:
The output should look like this:
You can always figure out the variable type by using the Python built-in function 'type', one of many useful basic functions.
Numerical Operations with integers and floats
There are a number of numerical operations that can be performed with integers and floats, but differences do exist!
Copy and paste the following over your variables script:
The output should look like this:
Now try it with some 'floats' instead of 'ints'. Change the variables at the beginning of your script to:
The output should now look like this:
This is important to consider when doing operations, because:
- for integers, 5 / 2 = 2
- for floats, 5.0 / 2.0 = 2.5
You can always change the type of variable through casting.
Note, performing an operation will do nothing visible unless it is printed or stored. For example:
Casting and Coercion
There are coercion functions for converting one type to another. Let's demonstrate this using a script. It's a little long, but will demonstrate the changes that can be made between variable types and how they can affect operations.Notice in particular how 4 divided by 15 is calculated as an integer and following a transformation to a float.
This is a useful tool especially when it comes to importing text from files, we need control over what variable types certain items should be. We can even turn numbers into strings by using the str() function, but numerical operations won't work the same for these. String operations are a whole different topic!
Strings
Strings are used A LOT so it's worth going into details about their peculiarities. Much of what you'll import from text files will enter your python world as strings, so we need to know how to manipulate them and change them.- Strings are immutable. Once you declare a string, you cannot change the elements of a string.
- They can use either single or double quotes, which helps when you need to include one in the content.
- If you use triple quotes, you can span multiple lines.
Create a new script for some string manipulations:
And add the following:
Notice the output:
The triple quotes will preserve line breaks as is in the code, and allow single and double quotation pairs.
Escape Characters
Now for a difficulty in working with strings... escape characters. Try running the following code by copying and pasting over your string script:This produces a very ugly output that does not match most of our strings:
Escaping characters with special meanings is a general problem in programming.
In files, the main characters to know are '\t' (tab) and '\n' (newline), but also include bell noises '\a' and other bothers.
Python offers two ways around this: the safest is to escape your escape, using a second backslash (see s5 above, '\\'), but this requires adding text to the string. A fancier way involves a special kind of string, the raw string. Raw strings start with r' and end with ' will treat every character between as exactly what it looks like, with the exception of the single quote (which ends the string). If you do use raw strings, watch out for two catches:
1) You must still escape single quotes you want to appear inside the string.
2) The last character of the string cannot be a backslash, since this will escape the closing quote.
These may not be big problems for you if you don't write many strings containing backslashes, but you may not have control over what you import from other text files.
String Placeholders
You can put placeholders in strings, and fill them in using variables from outside of the string. This is VERY useful for printing and also when it comes to writing to files further down the line. We'll test this using the string.format() method. Try the following code below:Your output should be:
The curly braces allow you to insert a variable into the string anywhere you'd like. Follow the string with: .format(variable_to_insert), making sure to include multiple variables if multiple braces are used.
Notice we can include any type of variable in the string, including other strings, integers, and floats. This is especially useful for printing to the screen and for writing to output files, which will largely be in strings.
String Methods
Here are some of the most common string methods. A method is like a function, but it runs "on" an object. If the variable s is a string, then the code s.lower() runs the lower() method on that string object and returns the result. We've used functions [like str() and float() and len()] and methods [string.format()] in our previous scripts.Here are some of the common string methods (the ones I use the most are in bold):
Let's show some examples using some fake sequence data.
The output should look like this:
Notice that some of these methods return a result which may require you to assign a new variable. You can simply use the same name as the original variable, which effectively replaces the original variable information with the returned results (permanently), or assign to a new variable name. The last method in the script returns a list, which we will talk about in the next section.
For a nice tutorial on some of these methods, check out the prior UC Berkeley Summer Python Bootcamp Wiki.
String Slicing
The "slice" syntax is a handy way to refer to sub-parts of sequences -- typically strings and lists. The slice s[start:end] consists of the elements beginning at start and extending up to (but not including) the end.We'll start with a string example. All strings have a length that is equal to the number of characters they are composed of, and a slice is simply a portion of that total length. You can decide what that slice is composed of by providing index numbers, which can be thought of as "coordinates" for the substring.
Let's visualize this using a string, "Hello".
This string has a length of 5, but the index number will always start at ZERO, causing the final index number of the string to be 4 rather than 5. This is a fundamental concept for learning slicing with strings and lists.
Let's create a new script for slicing.
And play around with some slicing options:
The standard zero-based index numbers give easy access to chars near the start of the string. As an alternative, Python uses negative numbers to give easy access to the chars at the opposite end of the string:
s[-1] is the last char 'o'
s[-2] is 'l' the next-to-last char
and so on. Negative index numbers count back from the end of the string:
It is a neat truism of slices that for any index n, s[:n] + s[n:] == s. This works even for n negative or out of bounds. Or put another way s[:n] and s[n:] always partition the string into two string parts, conserving all the characters. As we'll see in the list section later, slices work with lists too.
String Concatenation
There is use for joining strings together. You can use string1 + string2 to accomplish this.The + does what you might imagine. Its use is fairly straightforward, but it is oddly similar to a numerical operator we've seen: the plus sign. The difference between string concatenation and numerical addition is only whether the values on either side are strings or numbers. Thus, using one of each, like 'string' + 5 will confuse python, and therefore should be avoided. If you meant to concatenate 'string' and the string '5' to get 'string5', you should either coerce 5 into a string first or use the string.format() method. This is particularly relevant when it comes time to write files using variables within our scripts.
Let's examine the uses of concatenation vs. numerical operations with the plus sign using the code below:
The output should look like this:
That's all we will cover for strings in this introductory course, but there are more powerful string methods, including regular expression operations.
I've found that using the above tools, you can process text from files and write to output files with no problem.
Data Structures
Lists
Now you know how to do stuff with variables. But what if you want to do stuff with lots of variables at once? You group them into one large collection and refer to that as a variable. We'll focus on lists today. There are lots of other types of collections (dictionaries, sets, tuples, etc) that are better for specific situations, but you should default to using a list unless you have some specific reason not to.Let's create a new script to examine lists and some list functions.
Here's how you create a list and access individual elements from it:
Your output should look like this:
Notice that integers and floats and strings can be added to the same list, and that you should pay attention to how you are adding elements to a list. When printing a full list, the strings will always have single quotes around them, whereas the numerical variable will not. When printing an individual element from a list, notice though that these quotes disappear for strings.
Elements in a list are indexed, starting from 0 (just like with string slicing). The len() function gives the number of elements in the list.
You can also use negative numbers relative to the end of the list, like with string slicing.
To create an empty list, use empty brackets. The list can be populated with variables manually, or as we'll see later on, using a loop.
When working with text files, you will often be importing lines then splitting them using a delimiter to parse out useful information. As we saw above, splitting strings will create a list. We can take elements from those lists and create new lists that we are interested in. Here is a practical example:
And the output should look like this:
Sometimes it can be useful to simply print the list out and count which element you are interested in, it is easy to mess up the index numbers with complex string splits. Alternatively, you can print out each element one by one to locate the item of interest (list[0], list[1], list[2], etc).
Let's make a list of Genbank numbers from several of these line headers following the same methods:
This gives the following output:
We can do useful things with this list now, like sorting by name, deleting items, slicing, etc.
List Methods
Just like strings have methods, lists have methods too.Here are some other common list methods, but you'll probably be using the append() method most frequently.
You can modify lists after you create them, unlike strings that are immutable.
Some methods return objects whereas others don't. The three list methods append, insert and remove modify the list in place and do not return anything.
However, the method pop() removes an element of the list from the specified index and returns the element, which you could assign to a new variable.
Sorting a list
The easiest way to sort is with the sorted(list) function, which takes a list and returns a new list with those elements in sorted order. The original list is not changed.This produces:
It's most common to pass a list into the sorted() function, but in fact it can take as input any sort of iterable collection. The sorted() function can be customized though optional arguments. The sorted() optional argument reverse=True, e.g. sorted(list, reverse=True), makes it sort backwards, which can be useful in certain cases.
Custom Sorting With key=
For more complex custom sorting, sorted() takes an optional "key=" specifying a "key" function that transforms each element before comparison. The key function takes in 1 value and returns 1 value, and the returned "proxy" value is used for the comparisons within the sort.For example with a list of strings, specifying key=len (the built in len() function) sorts the strings by length, from shortest to longest. The sort calls len() for each string to get the list of proxy length values, and the sorts with those proxy values.
Here is a way to visualize what is happening in this script:
As another example, specifying "str.lower" as the key function is a way to force the sorting to treat uppercase and lowercase the same:
A list of lists
You can add a variety of thing to lists, including other lists. You can create and access these data structures in the following way:The output looks like this:
Notice that the first index of the list ( new_list[0] ) will return that element, which itself is a list. To access an element within that list, you need more coordinates. Here, we add the second index, new_list[0][1]. This will retrieve first element in our 'new_list', which is the list 'geneset1', then retrieve the second element in
'geneset1', which is 'nd2'. You can use a list of lists for a variety of tasks, and it is a convenient way to store several related lists together.
Tuples
A tuple is essentially a list that you can not change. You can index, slice them and add them together to make new tuples but not use sort(), reverse(), delete or remove items from them. If you ever have a tuple that you want to change, you have to turn it into a list. Tuples have merits, but we won't say much more about them here.
Sets
Like a set in mathematics, a set has a bunch of elements with no repeats. To build a set, you can pass in a list and automatically remove duplicates, or build a set one by one (again automatically removing duplicates). Sets are like lists, but they are not sorted and have other special properties. The greatest utility of sets is the easy ability for making "Venn Diagrams". We can build one set to simply know how many unique elements it contains. We can compare two sets to find only the elements they have in common, only elements found in one or the other, and to "add" or "subtract" the unique values from respective sets. Sets can be converted into lists to use list methods if necessary.
Let's make a new script for investigating the properties of sets:
Then add the following code to the script:
Now let's perform some comparisons between the sets. Replace your code with this:
This produces the following output:
That is pretty useful. You could imagine scenarios of comparing sets of molecular data where you may want to use one of the above methods.
For a complete list of set methods, look here.
Dictionaries
A dictionary is another type of data structure, and very comparable to an actual dictionary in which we reference a word and find its definition. To retrieve information out of a python dictionary, you look up a word, called a key, and you find information associated with that word, called the value.
Strings, numbers, and tuples work as keys, and any type can be a value. Other types may or may not work correctly as keys (strings and tuples work cleanly since they are immutable). Just like lists and sets, dictionaries are special and have their own useful features.
Let's create a new script for examining dictionaries and their properties:
And we will start a dictionary in the script:
And the output:
{'a': 'alpha', 'p': 'pi', 'o': 'omega', 'g': 'gamma'} alphaHere is a way to visualize the dictionary data structure:Dictionary methods
The methods dict.keys() and dict.values() return lists of the keys or values explicitly. There's also an items() function which returns a list of (key, value) tuples, which is the most efficient way to examine all the key value data in the dictionary. All of these lists can be passed to the sorted() function.
Try adding this code and running it:
and the output:
['a', 'p', 'o', 'g'] ['a', 'p', 'o', 'g'] ['alpha', 'pi', 'omega', 'gamma'] ['alpha', 'pi', 'omega', 'gamma'] [('a', 'alpha'), ('p', 'pi'), ('o', 'omega'), ('g', 'gamma')] [('a', 'alpha'), ('p', 'pi'), ('o', 'omega'), ('g', 'gamma')]From a performance point of view, the dictionary is one of your greatest tools, and you should use it to organize your data. A common usage for molecular data is to have a sequence name as a key and the DNA sequence as the corresponding value. You can iterate over an entire fasta file to create a sequence dictionary. Remember, keys and values can always be turned into lists or tuples for additional methods.Tests and Loops: if, for, while
Now that you understand variables and several data structures, it is time to create statements that will execute some action across multiple variables automatically, or execute that action only under a certain condition. The three words if, for, and while can be used to construct such statements.
First word: if
The if statement is one of the most fundamental components of programming, and it is found in some form or another in every general-purpose programming language. It simply tells the computer to execute a block of code if a given statement is true. If the statement is false, the code is skipped. Although the concept is simple, conditional statements like if allow your programs to make decisions on their own based on their inputs. This allows you to have two or more alternative blocks of code, and only do one of them depending on the situation.
if <insert something that may or may not be true here>:
The key pieces of syntax here are the if, the condition that we would like to evaluate (note: the condition is expressed within brackets <> for emphasis in this example, but brackets are not used in real code), the colon : that signifies the end of the logical test, and the tab-indentation for the things to do below the first line of the statement. Aquamacs will automatically tab over for you following this type of statement.
Let's create a script for this section:
And add the following code:
Try changing the sequence length to meet the other conditions, and see the results.
Only the first part with the 'if' is required. Then you can have zero or more 'elif's (short for 'else if'), and finally a single 'else' if you want. Each of them is called a 'branch'. Python checks the 'guard' at the beginning of each branch, and if it's a true statement that branch runs. The 'else' branch runs if none of them were true.
Second word: for
The for loop allows you to perform the same actions multiple times, but with different data each time through (some languages have an equivalent construct and call it a for each loop).
Let's make a script for this:
And try out some of these examples:
and the output:
In these cases, we use the variable x to iterate across the string or list, and print out that value. We could change the name of x to be more meaningful in the context of the loop (bp, item, key, etc.), as long as the naming follows the rules for a variable. The loop ends when there are no more values left to iterate over. You could imagine that as you move through each item or character or key, you may want to perform some function.
You may also control what happens using the range() function:
and the output:
Here range() is another function that returns a list of integers from 0 up to but not including the number in parenthesis.
Third word: while
Another way to loop through a list is to use the while loop. This example is very similar to the previous one:
and the output:
Don't forget to increment the x (x += 1), or you will enter an infinite loop!
Nesting and Complex Statements
The first thing to realize is that programming languages are like natural languages: as long as you follow the rules of grammar, anything can be nested inside anything else. Here are a couple examples:
Make a new script and call it GC_content.py
Add the following:
and have a look at the output:
It's nice to use the same functions on multiple items in a row. By using combinations of tests and loops, you can accomplish a lot with relatively simple code. The main tricky part starting out is keeping track of what happens inside each loop, and when. It is also important to consider where to place a variable or list or dictionary that you are trying to build, either inside or outside of a loop. Sometimes your variable or item may reset each time the loop begins, if you've defined it inside the loop.
Take a look at the following example, which might help you think about nesting loops and how they operate:
Perhaps a better way to visualize this last example is to print a statement following each iteration of nested loops, so you can see what is being executed and when. It's a nice way to keep track. Try this code:
I put tabs in the statements to correspond with the tabs for the nested loops. Does this make more sense? Getting a grasp on nesting statements like these is important for really using the language to do your tasks.
Functions
Functions are used to group code into 'chunks' that do something meaningful, and hide the boring details.They also keep you from having to type things over and over.
All functions begin by:
def name_of_function(arguments):
block of code
Let's make a 'functions.py' file.
Here is an example of a function that is numerical in principle:
Functions can be thought of in two different ways, depending on what you want to do.
First, you can treat them like functions in math, which take 0 or more arguments and return one result, like above.
The other way to look at them is as subroutines. They group a series of steps together so you can just write it once and then reference it in other code. Remember our GC_content calculator? We can create a function for that and execute that for a single sequence or a list of sequences, now that it is in a handy format.
and the output:
Any function without a return statement just returns None (which the interpreter doesn't bother printing). You can store that somewhere if you want, but usually you just ignore it.
Working with Files
Bioinformatics often involves reading and writing files. So far, we've been delivering information to Python in the scripts we are writing, by assigning variables and creating data structures in the script itself. In most cases, this information will come from an outside file, and we need to learn how to import this information and use it in meaningful ways. That's what the focus of this section is, and we'll end with writing information to output files!A filehandle is an object that controls the stream of information between your program and a file stored somewhere on the computer. Filehandles are not filenames, and they are not the files themselves. They are a tool that your program uses to interact with files, nothing more.
file handle objects have their own methods.
We create filehandles in the simplest sense with the open() command:
fh = open('some_file', [permissions])
The permissions include 'r', 'w', and 'a'.
The 'r' tells Python to open the file for reading. The other options are 'w' for write and 'a' for append. The difference is writing deletes any existing data in the file, while append adds lines to the end of it. (Think of a log file, where you add a line each time something happens).
Create a new script called "file_open.py"
Let's try it on a phylip format alignment file, which you can download below:
Put this file in the same folder we've been using for this workshop. You can inspect it from terminal by typing:
The first line has two numbers, the first is how many sequences there are and the second is the alignment length. Then all following lines are the sequence name followed by the sequence.
Let's open this file by entering the following code into "file_open.py":
That's a replica, but it is a single long string and not useful for parsing through. We want to use something more manageable.
Let's use a different method that involves looping through lines read from our file. The fh.readlines() method returns a list of strings, where each string represents a line in the file. Using the list and string methods we've already covered, we can iterate over the list of file lines and do things to these strings to pull out relevant information.
Let's do this here using the method described above, and our goal here is simply to pull out the names of the samples in this alignment.
Try the following code:
the output is:
We are just beginning to scratch the surface here for possibilities. You could also calculate GC content for each sequence along the way, as long as we include the function we invented above and execute it on the sequence variable. You could put the sample names into a list and sort by alphabetical, or store the sequences in a list. You could check for duplicate sequences by creating a set and comparing the length of the set vs. the number of sequences read (start a counter!). You could search for a particular repeat in each sequence, or trim them to a specific length.
The last thing we will do is write to a file.
Create an output file by using a filehandle for it.
We'll modify the above script to output the sample names and sequences to a new text file, separated by a tab:
Nothing appeared on the screen, but we didn't ask anything to print. Check your directory to see the output file, "Output.txt".
If we used the 'w' option for our output file, only the last sample name and sequence would be there, because we would have been constantly writing over the file contents. You can do A LOT with output files, including making headers, then populating your file with information when you iterate over the lines of the input file. You could also summarize information and do mathematical operations, then output these too. Don't forget about the utility of the string.format() method here!
Tips
Congratulations, you survived your first major exposure to Python!The best way to continue learning is through more focused tutorials and practice. I strongly recommend working through the UC Berkeley Python bootcamp wiki page, which has plenty of exercises and solutions, found here.
There are also many other resources for learning Python for free online or available as a download.
Some examples are:
Learning Python
Python Pocket
Python Official Documentation
And 9 times out of 10, googling your problem will help you solve it via a programming forum.
Remember, a lot of bioinformatics is simply processing output files from one program in order to prepare them as inputs for a different program. You can do these formatting changes with the concepts explained above. Become familiar with the specific file types you work with, and play around with writing your own files using information from input files.