Introduction+to+Python+Fall+2015

= Introduction to Python = Sept 28, 2015 Dan Portik

toc

Basics
Welcome to the Introduction to Python workshop at CGRL. This course is aimed at biologists who are interested in learning how to analyze molecular sequence data sets and especially Next Generation Sequencing data sets. Our goal is to provide an impetus to people to start writing small programs and then be able to explore further on their own. The expected audience for this course are those who have little to no programming experience, and this material covered in this course will be at a beginner level. However, since programming requires some understanding of the command line environment we expect that the attendees have an understanding of the Unix or Unix-like environments (e.g. MacOS). To make it easier to write python scripts we will be using a particular text editor called Aquamacs.



Why Python
Python programming language is new and is increasing in popularity. There are several features in the language that are helpful in tackling problems in the genomic data analysis field. Many of the newer programs that are available in this area are written in Python as well. Moreover it is easier than some other languages for a novice programmer to start on.

Python is an evolving language and there are several versions available now. We will be using version 2.7.* in this class. Many of the programs available for genomic data analysis are written in versions 2.6 or 2.7, but the difference between these are not very significant. These versions are still the most popular ones. The material covered in this class will work well for the most part in both. The newest versions Python 3.x have several differences from 2.x and parts of the code covered in this class may not work with it, so please be aware of this.

Programming in general
INPUTS --> PROGRAMS ---> OUTPUTS

Programs are a set of instructions to the computer to do certain actions. Each program will take 0 or more inputs and then carry out the instructions given and produce 0 or more outputs. In general, there are multiple ways of achieving the same results by doing the different sets of actions. Therefore, there are many ways to writing computer programs to achieve the same outcome, with some more efficient and elegant than others. The key point is that with a few basic tools, you will be able to build your own pipelines and have the power to automate certain tasks, such as batch processing files.

As you work through this material, try to think of ways to apply these methods to your specific problems. As beginners, it's good to think in pseudo-code, that is, break down your large problems into smaller actual tasks, and then think about how these coding methods can be used to solve each of those tasks. It's overwhelming to take on a large problem all at once, but breaking the problem into discrete tasks and then applying the python language to those tasks will help.

Setting up your first script
You should have already downloaded and installed Python 2.7 from https://www.python.org/downloads/

Let's open our terminal window, then open the python interpreter, and write our first simple program. code $ python >>> print "Hello, World!" 'Hello, World!' code When you type instructions, the interpreter will execute them and print the result. You can navigate your previous commands using the arrow keys (up, down), which makes it easier to try the same line over with a small change. The interpreter is a good command line tool to quickly run small code snippets or if you want to check something quick instead of re-running your whole program.

You can also write the same instructions in a file and run it. This is the basis for writing scripts and programs. Once you start writing longer programs, you'll want to do use text files for editing your code, and also to share your code with others. To do this, we can use aquamacs to edit the text in our script files, which you also should have installed.

<span style="font-family: Arial,Helvetica,sans-serif;">Let's first create a directory on our desktop, and then create a blank script (text file) to edit. We'll do this using basic Unix commands in our terminal window. code $ mkdir python_folder

$ cd python_folder

$ touch first_script.py code <span style="font-family: Arial,Helvetica,sans-serif;">The extension '.py' is used to indicate this script can be executed using python. You should get used to naming your scripts with this extension. You can check to see the file is now in this directory. Now we want to begin editing this file using aquamacs.

<span style="font-family: Arial,Helvetica,sans-serif;">Here are some quick options for using aquamacs: <span style="font-family: Arial,Helvetica,sans-serif;">open a file from command line: aquamacs [filename] & <span style="font-family: Arial,Helvetica,sans-serif;">save a file: CTRL-X-S, or command-s <span style="font-family: Arial,Helvetica,sans-serif;">close a file: command + W code $ aquamacs first_script.py &

code <span style="font-family: Arial,Helvetica,sans-serif;">By adding the ampersand (&), you keep control of your shell prompt after aquamacs launches. If you don't use the ampersand, you'll need to open another terminal window in a moment to run our first program.

<span style="font-family: Arial,Helvetica,sans-serif;">***If this doesn't work for you, simply open the script manually in aquamacs. The above is a nice command line shortcut to the program.

<span style="font-family: Arial,Helvetica,sans-serif;">Now let's write the program to our script, 'first_script.py'. Copy and paste the following below into your aquamacs editor, then save it: code format="python" print "Hello world!"

code <span style="font-family: Arial,Helvetica,sans-serif;">Now go back to your terminal window and execute the script using Python: code $ python first_script.py code <span style="font-family: Arial,Helvetica,sans-serif;">It should have produced the same output as before: <span style="font-family: Arial,Helvetica,sans-serif;">Hello, World!

<span style="font-family: Arial,Helvetica,sans-serif;">Now, let's try to change a couple quick things in this script to show different ways to use the "print" function: code format="python" print "hello", "world"
 * 1) try this one first

print "hello" print "world"
 * 1) then try this one after

print "hello" print print "world"
 * 1) and finally this one

code <span style="font-family: Arial,Helvetica,sans-serif;">What was different about these, if anything?

<span style="font-family: Arial,Helvetica,sans-serif;">Comments
<span style="font-family: Arial,Helvetica,sans-serif;">Python ignores anything from a # to the end of the line; it's just there as a note for other humans. code format="python" ''' This is a block of commented code. You can use either triple double quotes or triple single quotes I use this at the beginning of a script to explain what it does and how to use it. ''' code
 * 1) this is a comment
 * 2) you can makes notes about what the code is doing throughout your script

<span style="font-family: Arial,Helvetica,sans-serif;">Variables: integers, floating-point numbers (decimal), strings (text)
<span style="font-family: Arial,Helvetica,sans-serif;">A variable is a datum with a human-readable name which can change values as the logic of the program dictates, as opposed to other types of data which have a constant, unchanging value once declared.

<span style="font-family: Arial,Helvetica,sans-serif;">Variables must follow these rules:

<span style="font-family: Arial,Helvetica,sans-serif;">1) Variable names are case-sensitive, so Variable1 and variable1 are different variables. <span style="font-family: Arial,Helvetica,sans-serif;">2) Variable names must start with a letter (a-z) but can contain letters, numbers and underscores. <span style="font-family: Arial,Helvetica,sans-serif;">3) Variable names cannot contain spaces or non-alphanumeric characters <span style="font-family: Arial,Helvetica,sans-serif;">4) Variable names cannot be any of the following keywords that already have special meaning in python: code format="python" and   assert   break    class      continue   def      del      elif else  except   exec     finally    for        from     global   if    import in       is       lambda     not        or       pass     print raise return   try      while      yield code

**<span style="font-family: Arial,Helvetica,sans-serif;">Types (Overview) **
<span style="font-family: Arial,Helvetica,sans-serif;">Let's explore the three main types of variables (strings, integers, floating-point numbers) in a new script. <span style="font-family: Arial,Helvetica,sans-serif;">From your terminal window, let's make a new script: code $ touch variables.py $ aquamacs variables.py & code <span style="font-family: Arial,Helvetica,sans-serif;">Variables are assigned by writing: <span style="font-family: Arial,Helvetica,sans-serif;">- the name you want to use <span style="font-family: Arial,Helvetica,sans-serif;">- a single equals sign <span style="font-family: Arial,Helvetica,sans-serif;">- the value you'd like to store

<span style="font-family: Arial,Helvetica,sans-serif;">This is the same whether the variable you're assigning is of type str (a character string), int (whole number), float (non-integer real number), or more fancy variables. Add this code (copy and paste) to your new variables script: code format="python" variable_1 = "Bananas" variable_2 = 45 variable_3 = 45.5678
 * 1) create your variables

print print variable_1 print "the type of variable this is = ", type(variable_1), '\n'
 * 1) print out variables, then information about the variables
 * 1) type is a function
 * 2) '\n' is the character for newline

print variable_2 print "the type of variable this is = ", type(variable_2), '\n'

print variable_3 print "the type of variable this is = ", type(variable_3), '\n'

code <span style="font-family: Arial,Helvetica,sans-serif;">The output should look like this: code Bananas the type of variable this is = <type 'str'>

45 the type of variable this is = <type 'int'>

45.5678 the type of variable this is = <type 'float'>

code <span style="font-family: Arial,Helvetica,sans-serif;">You can always figure out the variable type by using the Python built-in function 'type', one of many useful basic functions.

**<span style="font-family: Arial,Helvetica,sans-serif;">Numerical Operations with integers and floats **
<span style="font-family: Arial,Helvetica,sans-serif;">There are a number of numerical operations that can be performed with integers and floats, but differences do exist! <span style="font-family: Arial,Helvetica,sans-serif;">Copy and paste the following over your variables script: code format="python" a = 42 b = 3

addition = a + b subtraction = a - b multiplication = a * b division = a / b exponent = a ** b
 * 1) addition uses the plus sign (+)
 * 1) subtraction uses the minus sign (-)
 * 1) multiplication uses the asterisk (*)
 * 1) division uses the slash (/)
 * 1) and exponents use a double-asterisk (**)

print 'sum', addition print 'diff', subtraction print 'prod', multiplication print 'quo', division print 'pow', exponent

x = 5 print "x should be 5, like we made it! see x = ", x x += 1 print "now x is one more than before! see x = ", x x += 1 print "now x is two more than before! see x = ", x
 * 1) edit a variable (type integer) by adding one, useful for creating a counter!

code <span style="font-family: Arial,Helvetica,sans-serif;">The output should look like this: code sum 45 diff 39 prod 126 quo 14 pow 74088 x should be 5, like we made it! see x = 5 now x is one more than before! see x = 6 now x is two more than before! see x = 7 code <span style="font-family: Arial,Helvetica,sans-serif;">Now try it with some 'floats' instead of 'ints'. Change the variables at the beginning of your script to: code format="python" a = 58.672 b = 3.745

code <span style="font-family: Arial,Helvetica,sans-serif;">The output should now look like this: code sum 62.417 diff 54.927 prod 219.72664 quo 15.6667556742 pow 4195403.08144

code <span style="font-family: Arial,Helvetica,sans-serif;">This is important to consider when doing operations, because: <span style="font-family: Arial,Helvetica,sans-serif;">- for integers, 5 / 2 = 2 <span style="font-family: Arial,Helvetica,sans-serif;">- for floats, 5.0 / 2.0 = 2.5

<span style="font-family: Arial,Helvetica,sans-serif;">You can always change the type of variable through casting.

<span style="font-family: Arial,Helvetica,sans-serif;">Note, performing an operation will do nothing visible unless it is printed or stored. For example: code format="python" a = 15 b = 10
 * 1) add variables

a/b
 * 1) this won't 'create' anything visible

print a/b
 * 1) but this will

c = a/b print c
 * 1) and this too, where we store the value as a variable

code

<span style="font-family: Arial,Helvetica,sans-serif;">Casting and Coercion
<span style="font-family: Arial,Helvetica,sans-serif;">There are coercion functions for converting one type to another. Let's demonstrate this using a script. It's a little long, but will demonstrate the changes that can be made between variable types and how they can affect operations. code format="python" int_1 = 52 int_2 = 4 int_3 = 15

print int_1, 'is a', type(int_1) print int_2, 'is a', type(int_2) print int_3, 'is a', type(int_3)
 * 1) print information about variables

print int_1, "divided by", int_2, "=", int_1/int_2 print int_2, "divided by", int_3, "=", int_2/int_3
 * 1) perform some division

print '\n', "Now let's change those integers to floats!", '\n'

int_1 = float(int_1) int_2 = float(int_2) int_3 = float(int_3)
 * 1) change variable types from integers to floats using the float function

print int_1, 'is a', type(int_1) print int_2, 'is a', type(int_2) print int_3, 'is a', type(int_3)
 * 1) print information about newly coerced variables to confirm changes

print int_1, "divided by", int_2, "=", int_1/int_2 print int_2, "divided by", int_3, "=", int_2/int_3
 * 1) perform some division with the floats

print '\n', "Now let's change those floats to strings!", '\n'

int_1 = str(int_1) int_2 = str(int_2) int_3 = str(int_3) print int_1, 'is a', type(int_1) print int_2, 'is a', type(int_2) print int_3, 'is a', type(int_3)
 * 1) change to a string using the str function
 * 1) confirm the changes have occurred

code

<span style="font-family: Arial,Helvetica,sans-serif;">Notice in particular how 4 divided by 15 is calculated as an integer and following a transformation to a float.

<span style="font-family: Arial,Helvetica,sans-serif;">This is a useful tool especially when it comes to importing text from files, we need control over what variable types certain items should be. We can even turn numbers into strings by using the str function, but numerical operations won't work the same for these. String operations are a whole different topic!

<span style="font-family: Arial,Helvetica,sans-serif;">Strings
<span style="font-family: Arial,Helvetica,sans-serif;">Strings are used A LOT so it's worth going into details about their peculiarities. Much of what you'll import from text files will enter your python world as strings, so we need to know how to manipulate them and change them. <span style="font-family: Arial,Helvetica,sans-serif;">- Strings are immutable. Once you declare a string, you cannot change the elements of a string. <span style="font-family: Arial,Helvetica,sans-serif;">- They can use either single or double quotes, which helps when you need to include one in the content. <span style="font-family: Arial,Helvetica,sans-serif;">- If you use triple quotes, you can span multiple lines.

<span style="font-family: Arial,Helvetica,sans-serif;">Create a new script for some string manipulations: code $ touch strings.py $ aquamacs strings.py & code <span style="font-family: Arial,Helvetica,sans-serif;">And add the following: code format="python" s = 'I think "coffee" is pretty fantastic.' print s, '\n' s2 = "I much prefer tea." print s2, '\n' s3 = '''I 'prefer' drinking both because I "could" really, really use caffeine.''' print s3, '\n' code <span style="font-family: Arial,Helvetica,sans-serif;">Notice the output: code I think "coffee" is pretty fantastic.

I much prefer tea.

I 'prefer' drinking both because I "could" really, really use caffeine.

code <span style="font-family: Arial,Helvetica,sans-serif;">The triple quotes will preserve line breaks as is in the code, and allow single and double quotation pairs.

**<span style="font-family: Arial,Helvetica,sans-serif;">Escape Characters **
<span style="font-family: Arial,Helvetica,sans-serif;">Now for a difficulty in working with strings... escape characters. Try running the following code by copying and pasting over your string script: code format="python" s = 'some\thing is missing' print s, '\n'
 * 1) bring on the tab

s2 = "somethi\ng is broken" print s2, '\n'
 * 1) bring on the newline

s3 = something th\at will drive you b\an\an\as print s3, '\n'
 * 1) bring on the bells

s4 = r'\a solu\tio\n' print s4, '\n'
 * 1) solution 1

s5 = '\\another solu\\tio\\n' print s5, '\n' code <span style="font-family: Arial,Helvetica,sans-serif;">This produces a very ugly output that does not match most of our strings: code some   hing is missing
 * 1) solution 2

somethi g is broken

something tht will drive you bnns

\a solu\tio\n

\another solu\tio\n

code <span style="font-family: Arial,Helvetica,sans-serif;">Escaping characters with special meanings is a general problem in programming. <span style="font-family: Arial,Helvetica,sans-serif;">In files, the main characters to know are '\t' (tab) and '\n' (newline), but also include bell noises '\a' and other bothers.

<span style="font-family: Arial,Helvetica,sans-serif;">Python offers two ways around this: the safest is to escape your escape, using a second backslash (see s5 above, '\\'), but this requires adding text to the string. A fancier way involves a special kind of string, the raw string. Raw strings start with r' and end with ' will treat every character between as exactly what it looks like, with the exception of the single quote (which ends the string). If you do use raw strings, watch out for two catches: <span style="font-family: Arial,Helvetica,sans-serif;">1) You must still escape single quotes you want to appear inside the string. <span style="font-family: Arial,Helvetica,sans-serif;">2) The last character of the string cannot be a backslash, since this will escape the closing quote.

<span style="font-family: Arial,Helvetica,sans-serif;">These may not be big problems for you if you don't write many strings containing backslashes, but you may not have control over what you import from other text files.

**<span style="font-family: Arial,Helvetica,sans-serif;">String Placeholders **
<span style="font-family: Arial,Helvetica,sans-serif;">You can put placeholders in strings, and fill them in using variables from outside of the string. This is VERY useful for printing and also when it comes to writing to files further down the line. We'll test this using the string.format method. Try the following code below: code format="python" a = 'teas' b = 'coffees' c = 5 d = 4.5

print "I would like several {}".format(b)
 * 1) use the string.format method to include variables in the string

print "Actually, I would like roughly {} {} and maybe {} of your {}".format(d, b, c, a)
 * 1) use the same method with multiple variables

print "Okay, just give me {0} {1} and {0} {2}, then {0} more {2}".format(c, a, b)
 * 1) you can even reuse the same variables if we assign them numbers, starting with zero

code <span style="font-family: Arial,Helvetica,sans-serif;">Your output should be: code I would like several coffees Actually, I would like roughly 4.5 coffees and maybe 5 of your teas Okay, just give me 5 teas and 5 coffees, then 5 more coffees

code <span style="font-family: Arial,Helvetica,sans-serif;">The curly braces allow you to insert a variable into the string anywhere you'd like. Follow the string with: .format(variable_to_insert), making sure to include multiple variables if multiple braces are used.

<span style="font-family: Arial,Helvetica,sans-serif;">Notice we can include any type of variable in the string, including other strings, integers, and floats. This is especially useful for printing to the screen and for writing to output files, which will largely be in strings.

**<span style="font-family: Arial,Helvetica,sans-serif;">String Methods **
<span style="font-family: Arial,Helvetica,sans-serif;">Here are some of the most common string methods. A method is like a function, but it runs "on" an object. If the variable s is a string, then the code s.lower runs the lower method on that string object and returns the result. We've used functions [like str and float and len] and methods [string.format] in our previous scripts.

<span style="font-family: Arial,Helvetica,sans-serif;">Here are some of the common string methods (the ones I use the most are in bold):
 * <span style="font-family: Arial,Helvetica,sans-serif;">**s.lower, s.upper** -- returns the lowercase or uppercase version of the string
 * <span style="font-family: Arial,Helvetica,sans-serif;">Useful for converting molecular sequence data to the same format (all uppercase)


 * <span style="font-family: Arial,Helvetica,sans-serif;">**s.strip** -- returns a string with whitespace removed from the start and end
 * <span style="font-family: Arial,Helvetica,sans-serif;">An essential tool for removing oddities from lines when importing them from outside text files


 * <span style="font-family: Arial,Helvetica,sans-serif;">s.isalpha/s.isdigit/s.isspace... -- tests if all the string chars are in the various character classes


 * <span style="font-family: Arial,Helvetica,sans-serif;">**s.startswith('other'), s.endswith('other')** -- tests if the string starts or ends with the given other string
 * <span style="font-family: Arial,Helvetica,sans-serif;">Imagine searching for a particular file extension with s.endswith('fasta') in a directory...


 * <span style="font-family: Arial,Helvetica,sans-serif;">s.find('other') -- searches for the given other string (not a regular expression) within s, and returns the first index where it begins or -1 if not found


 * <span style="font-family: Arial,Helvetica,sans-serif;">s.replace('old', 'new') -- returns a string where all occurrences of 'old' have been replaced by 'new'


 * <span style="font-family: Arial,Helvetica,sans-serif;">**s.split('delim')** -- returns a list of substrings separated by the given delimiter. The delimiter is not a regular expression, it's just text. 'aaa,bbb,ccc'.split(',') -> ['aaa', 'bbb', 'ccc']. As a convenient special case s.split (with no arguments) splits on all whitespace chars.
 * This can be used with characters we already know, for example '\t' for splitting a tab-delimited text file.


 * <span style="font-family: Arial,Helvetica,sans-serif;">s.join(list) -- opposite of split, joins the elements in the given list together using the string as the delimiter. e.g. '---'.join(['aaa', 'bbb', 'ccc']) -> aaa---bbb---ccc

<span style="font-family: Arial,Helvetica,sans-serif;">Let's show some examples using some fake sequence data. code format="python" seq1 = "aCtgttttggcagccGGGGAtG" print "seq1:", seq1

seq1 = seq1.upper print "-seq1 using uppercase transform:" print seq1
 * 1) change to all upper case

print "-test whether beginning matches ACT:" print seq1.startswith("ACT")
 * 1) test whether beginning of sequence matches a pattern

print "-test whether beginning matches act:" print seq1.startswith("act")

seq1 = seq1.split("C") print "-sequence after splitting it using C:" print seq1
 * 1) what happens when we split?

code <span style="font-family: Arial,Helvetica,sans-serif;">The output should look like this: code seq1: aCtgttttggcagccGGGGAtG -seq1 using uppercase transform: ACTGTTTTGGCAGCCGGGGATG -test whether beginning matches ACT: True -test whether beginning matches act: False -sequence after splitting it using C: ['A', 'TGTTTTGG', 'AG', '', 'GGGGATG'] code <span style="font-family: Arial,Helvetica,sans-serif;">Notice that some of these methods return a result which may require you to assign a new variable. You can simply use the same name as the original variable, which effectively replaces the original variable information with the returned results (permanently), or assign to a new variable name. The last method in the script returns a list, which we will talk about in the next section. <span style="font-family: Arial,Helvetica,sans-serif;">For a nice tutorial on some of these methods, check out the prior UC Berkeley Summer Python Bootcamp Wiki.

**<span style="font-family: Arial,Helvetica,sans-serif;">String Slicing **
<span style="font-family: Arial,Helvetica,sans-serif;">The "slice" syntax is a handy way to refer to sub-parts of sequences -- typically strings and lists. The slice s[start:end] consists of the elements beginning at start and extending up to (but not including) the end.

<span style="font-family: Arial,Helvetica,sans-serif;">We'll start with a string example. All strings have a length that is equal to the number of characters they are composed of, and a slice is simply a portion of that total length. You can decide what that slice is composed of by providing index numbers, which can be thought of as "coordinates" for the substring.

<span style="font-family: Arial,Helvetica,sans-serif;">Let's visualize this using a string, "Hello".



<span style="font-family: Arial,Helvetica,sans-serif;">This string has a length of 5, but the index number will always start at ZERO, causing the final index number of the string to be 4 rather than 5. This is a fundamental concept for learning slicing with strings and lists.

<span style="font-family: Arial,Helvetica,sans-serif;">Let's create a new script for slicing. code $ touch slice.py $ aquamacs slice.py & code <span style="font-family: Arial,Helvetica,sans-serif;">And play around with some slicing options: code format="python" s = 'Hello' print len(s) print s[1:4] print s[1:] print s[:]
 * 1) 5 -- using the len function will return the total number of characters in the string
 * 1) is 'ell' -- chars starting at index 1 and extending up to but not including index 4 (the o)
 * 1) is 'ello' -- omitting either index defaults to the start or end of the string
 * 1) is 'Hello' -- omitting both always gives us a copy of the whole thing

code <span style="font-family: Arial,Helvetica,sans-serif;">The standard zero-based index numbers give easy access to chars near the start of the string. As an alternative, Python uses negative numbers to give easy access to the chars at the opposite end of the string: <span style="font-family: Arial,Helvetica,sans-serif;">s[-1] is the last char 'o' <span style="font-family: Arial,Helvetica,sans-serif;">s[-2] is 'l' the next-to-last char <span style="font-family: Arial,Helvetica,sans-serif;">and so on. Negative index numbers count back from the end of the string: code format="python" print s[-1] print s[-4] print s[:-3] print s[-3:] code <span style="font-family: Arial,Helvetica,sans-serif;">It is a neat truism of slices that for any index n, s[:n] + s[n:] == s. This works even for n negative or out of bounds. Or put another way s[:n] and s[n:] always partition the string into two string parts, conserving all the characters. As we'll see in the list section later, slices work with lists too.
 * 1) is 'o' -- last char (1st from the end)
 * 1) is 'e' -- 4th from the end
 * 1) is 'He' -- going up to but not including the last 3 chars.
 * 1) is 'llo' -- starting with the 3rd char from the end and extending to the end of the string.

**<span style="font-family: Arial,Helvetica,sans-serif;">String Concatenation **
<span style="font-family: Arial,Helvetica,sans-serif;">There is use for joining strings together. You can use string1 + string2 to accomplish this.

<span style="font-family: Arial,Helvetica,sans-serif;">The + does what you might imagine. Its use is fairly straightforward, but it is oddly similar to a numerical operator we've seen: the plus sign. The difference between string concatenation and numerical addition is only whether the values on either side are strings or numbers. Thus, using one of each, like 'string' + 5 will confuse python, and therefore should be avoided. If you meant to concatenate 'string' and the string '5' to get 'string5', you should either coerce 5 into a string first or use the string.format method. This is particularly relevant when it comes time to write files using variables within our scripts.

<span style="font-family: Arial,Helvetica,sans-serif;">Let's examine the uses of concatenation vs. numerical operations with the plus sign using the code below: code format="python" seq1 = "AATTCGCTATTCGGGGAATCGATCGA" seq2 = "TTACGATCGAGGATCCTATGAT" seq3 = "CCTATTCGAGGTCTTGAGCTTTAGC"

print "partial sequence lengths:", len(seq1), len(seq2), len(seq3)

pred_length = len(seq1) + len(seq2) + len(seq3) print "predicted complete length = ", pred_length
 * 1) notice what happens when the + is used in this context (an integer is produced)

complete_seq = seq1+seq2 + seq3
 * 1) notice what happens when the + is used in this context (a merged string is produced)

print "complete sequence length = ", len(complete_seq) print complete_seq

code <span style="font-family: Arial,Helvetica,sans-serif;">The output should look like this: code partial sequence lengths: 26 22 25 predicted complete length = 73 complete sequence length = 73 AATTCGCTATTCGGGGAATCGATCGATTACGATCGAGGATCCTATGATCCTATTCGAGGTCTTGAGCTTTAGC code That's all we will cover for strings in this introductory course, but there are more powerful string methods, including regular expression operations. I've found that using the above tools, you can process text from files and write to output files with no problem.

<span style="font-family: Arial,Helvetica,sans-serif;">Lists
<span style="font-family: Arial,Helvetica,sans-serif;">Now you know how to do stuff with variables. But what if you want to do stuff with lots of variables at once? You group them into one large collection and refer to that as a variable. We'll focus on lists today. There are lots of other types of collections (dictionaries, sets, tuples, etc) that are better for specific situations, but you should default to using a list unless you have some specific reason not to.

<span style="font-family: Arial,Helvetica,sans-serif;"> Let's create a new script to examine lists and some list functions. code $ touch lists.py $ aquamacs lists.py & code <span style="font-family: Arial,Helvetica,sans-serif;">Here's how you create a list and access individual elements from it: code format="python" mylist = [1, "1", "xyz", "seven", "24.5", 24.5]

print mylist print mylist[0], type(mylist[0]) print mylist[1], type(mylist[1]) print len(mylist)

code <span style="font-family: Arial,Helvetica,sans-serif;">Your output should look like this: code [1, '1', 'xyz', 'seven', '24.5', 24.5] 1 <type 'int'> 1 <type 'str'> 6 code <span style="font-family: Arial,Helvetica,sans-serif;">Notice that integers and floats and strings can be added to the same list, and that you should pay attention to how you are adding elements to a list. When printing a full list, the strings will always have single quotes around them, whereas the numerical variable will not. When printing an individual element from a list, notice though that these quotes disappear for strings.

<span style="font-family: Arial,Helvetica,sans-serif;">Elements in a list are indexed, starting from 0 (just like with string slicing). The len function gives the number of elements in the list. <span style="font-family: Arial,Helvetica,sans-serif;">You can also use negative numbers relative to the end of the list, like with string slicing.



To create an empty list, use empty brackets. The list can be populated with variables manually, or as we'll see later on, using a loop.

When working with text files, you will often be importing lines then splitting them using a delimiter to parse out useful information. As we saw above, splitting strings will create a list. We can take elements from those lists and create new lists that we are interested in. Here is a practical example: code format="python" header1 = ">gi|307750512|gb|GU048508.1| Ampithoe valida isolate AV339 cytochrome c oxidase subunit I gene, partial cds; mitochondrial"


 * 1) you can create a list from a string by splitting it
 * 2) this is almost certainly what you'll do when importing lines from files
 * 3) you can split by tabs, spaces, or other characters
 * 4) here, I want the Genbank accession number, so I'll split using the | symbol

header1_list = header1.split("|")

print len(header1_list)
 * 1) how long is our list?

print header1_list
 * 1) what does the list look like?

print header1_list[3]
 * 1) find the element of interest in the list using the index number

genbank_no = header1_list[3] print genbank_no
 * 1) we can assign this to a variable to keep better track of it

code And the output should look like this: code 5 ['>gi', '307750512', 'gb', 'GU048508.1', ' Ampithoe valida isolate AV339 cytochrome c oxidase subunit I gene, partial cds; mitochondrial'] GU048508.1 GU048508.1 code Sometimes it can be useful to simply print the list out and count which element you are interested in, it is easy to mess up the index numbers with complex string splits. Alternatively, you can print out each element one by one to locate the item of interest (list[0], list[1], list[2], etc).

Let's make a list of Genbank numbers from several of these line headers following the same methods: code format="python" header1 = ">gi|307750512|gb|GU048508.1| Ampithoe valida isolate AV339 cytochrome c oxidase subunit I gene, partial cds; mitochondrial" header2 = ">gi|307750510|gb|GU048507.1| Ampithoe valida isolate AV338 cytochrome c oxidase subunit I gene, partial cds; mitochondrial" header3 = ">gi|307750508|gb|GU048506.1| Ampithoe valida isolate AV337 cytochrome c oxidase subunit I gene, partial cds; mitochondrial" header4 = ">gi|307750506|gb|GU048505.1| Ampithoe valida isolate AV336 cytochrome c oxidase subunit I gene, partial cds; mitochondrial"
 * 1) Here are a bunch of strings that you can imagine came from a fasta file downloaded from GenBank
 * 2) We want to parse through these to pull out the genbank numbers, then put them in a new list

header1_list = header1.split("|") header2_list = header2.split("|") header3_list = header3.split("|") header4_list = header4.split("|")
 * 1) perform splitting operation for strings

genbank_list = [] genbank_list.append(header1_list[3]) genbank_list.append(header2_list[3]) genbank_list.append(header3_list[3]) genbank_list.append(header4_list[3])
 * 1) create an empty list to populate
 * 1) we can use the index of the lists created to pull out the genbank number
 * 2) add the genbank information to the list using the list.append function

print genbank_list
 * 1) print the list

code This gives the following output: code ['GU048508.1', 'GU048507.1', 'GU048506.1', 'GU048505.1'] code We can do useful things with this list now, like sorting by name, deleting items, slicing, etc.

**<span style="font-family: Arial,Helvetica,sans-serif;">List Methods **
<span style="font-family: Arial,Helvetica,sans-serif;">Just like strings have methods, lists have methods too. <span style="font-family: Arial,Helvetica,sans-serif;">Here are some other common list methods, but you'll probably be using the append method most frequently.
 * <span style="font-family: Arial,Helvetica,sans-serif;">**list.append(elem)** -- adds a single element to the end of the list. Common error: does not return the new list, just modifies the original.
 * <span style="font-family: Arial,Helvetica,sans-serif;">list.insert(index, elem) -- inserts the element at the given index, shifting elements to the right.
 * <span style="font-family: Arial,Helvetica,sans-serif;">list.extend(list2) adds the elements in list2 to the end of the list. Using + or += on a list is similar to using extend.
 * <span style="font-family: Arial,Helvetica,sans-serif;">list.index(elem) -- searches for the given element from the start of the list and returns its index. Throws a ValueError if the element does not appear (use "in" to check without a ValueError).
 * <span style="font-family: Arial,Helvetica,sans-serif;">list.remove(elem) -- searches for the first instance of the given element and removes it (throws ValueError if not present)
 * <span style="font-family: Arial,Helvetica,sans-serif;">list.reverse -- reverses the list in place (does not return it)
 * <span style="font-family: Arial,Helvetica,sans-serif;">list.pop(index) -- removes and returns the element at the given index. Returns the rightmost element if index is omitted (roughly the opposite of append).

<span style="font-family: Arial,Helvetica,sans-serif;">You can modify lists after you create them, unlike strings that are immutable.

<span style="font-family: Arial,Helvetica,sans-serif;">Some methods return objects whereas others don't. The three list methods append, insert and remove modify the list in place and do not return anything. <span style="font-family: Arial,Helvetica,sans-serif;">However, the method pop removes an element of the list from the specified index and returns the element, which you could assign to a new variable.

**<span style="font-family: Arial,Helvetica,sans-serif;">Sorting a list **
<span style="font-family: Arial,Helvetica,sans-serif;">The easiest way to sort is with the sorted(list) function, which takes a list and returns a new list with those elements in sorted order. The original list is not changed. code format="python" list1 = ['GU048508.1', 'GU048507.1', 'GU048505.1', 'GU048506.1']

sorted_list1 = sorted(list1) sorted_list2 = sorted(list1, reverse=True)
 * 1) perform sorted function on list, assign to new variable
 * 1) reverse sorting

print "starting list:", list1 print "sorted list:", sorted_list1 print "reverse sorted list:", sorted_list2

print "without a variable assignment:", sorted(list1)
 * 1) note, we can always print the new list without assigning it to a variable

code This produces: code starting list: ['GU048508.1', 'GU048507.1', 'GU048505.1', 'GU048506.1'] sorted list: ['GU048505.1', 'GU048506.1', 'GU048507.1', 'GU048508.1'] reverse sorted list: ['GU048508.1', 'GU048507.1', 'GU048506.1', 'GU048505.1'] without a variable assignment: ['GU048505.1', 'GU048506.1', 'GU048507.1', 'GU048508.1']

code <span style="font-family: Arial,Helvetica,sans-serif;">It's most common to pass a list into the sorted function, but in fact it can take as input any sort of iterable collection. The sorted function can be customized though optional arguments. The sorted optional argument reverse=True, e.g. sorted(list, reverse=True), makes it sort backwards, which can be useful in certain cases.

**<span style="font-family: Arial,Helvetica,sans-serif;">Custom Sorting With key= **
<span style="font-family: Arial,Helvetica,sans-serif;">For more complex custom sorting, sorted takes an optional "key=" specifying a "key" function that transforms each element before comparison. The key function takes in 1 value and returns 1 value, and the returned "proxy" value is used for the comparisons within the sort.

<span style="font-family: Arial,Helvetica,sans-serif;">For example with a list of strings, specifying key=len (the built in len function) sorts the strings by length, from shortest to longest. The sort calls len for each string to get the list of proxy length values, and the sorts with those proxy values. code format="python" strs = ['ccc', 'aaaa', 'd', 'bb']

print sorted(strs, key=len)
 * 1) produces ['d', 'bb', 'ccc', 'aaaa'], which is sorted by length

print sorted(strs, key=len, reverse=True) code Here is a way to visualize what is happening in this script:
 * 1) produces ['aaaa', 'ccc', 'bb', 'd'], which is reverse sorted by length



<span style="font-family: Arial,Helvetica,sans-serif;">As another example, specifying "str.lower" as the key function is a way to force the sorting to treat uppercase and lowercase the same: code format="python" strs = ['zz','BB','aa','CC']

print sorted(strs)
 * 1) normal sort
 * 1) produces ['BB', 'CC', 'aa', 'zz'], which is sorted by uppercase, then alphabetical

print sorted(strs, key=str.lower) code
 * 1) use "key" argument specifying str.lower function to use for sorting
 * 1) produces ['aa', 'BB', 'CC', 'zz']
 * 2) this list is sorted alphabetically using the proxy values, which are all converted to lowercase

**A list of lists**
You can add a variety of thing to lists, including other lists. You can create and access these data structures in the following way: code format="python" geneset1 = ['cytb', 'nd2', '16s', '12s'] geneset2 = ['pomc', 'rag1', 'kiaa2013', 'tyr', 'pdc'] geneset3 = ['aa1', 'aa2', 'aa3']
 * 1) create lists of genes

new_list = [] new_list.append(geneset1) new_list.append(geneset2) new_list.append(geneset3)
 * 1) create empty list
 * 1) populate empty list with our existing lists

print "length of new list:", len(new_list) print new_list print "first element of new list:", new_list[0] print "second element of new list", new_list[1] print "third element of new list:", new_list[2]
 * 1) characterize this list of lists

print "first element in the first element:", new_list[0][0] print "second element in the first element:", new_list[0][1]
 * 1) retrieve specific elements from the list of lists

code The output looks like this: code length of new list: 3 'cytb', 'nd2', '16s', '12s'], ['pomc', 'rag1', 'kiaa2013', 'tyr', 'pdc'], ['aa1', 'aa2', 'aa3' first element of new list: ['cytb', 'nd2', '16s', '12s'] second element of new list ['pomc', 'rag1', 'kiaa2013', 'tyr', 'pdc'] third element of new list: ['aa1', 'aa2', 'aa3'] first element in the first element: cytb second element in the first element: nd2

code Notice that the first index of the list ( new_list[0] ) will return that element, which itself is a list. To access an element within that list, you need more coordinates. Here, we add the second index, new_list[0][1]. This will retrieve first element in our 'new_list', which is the list 'geneset1', then retrieve the second element in 'geneset1', which is 'nd2'. You can use a list of lists for a variety of tasks, and it is a convenient way to store several related lists together.

<span style="font-family: Arial,Helvetica,sans-serif;">Tuples
A tuple is essentially a list that you can not change. You can index, slice them and add them together to make new tuples but not use sort, reverse, delete or remove items from them. If you ever have a tuple that you want to change, you have to turn it into a list. Tuples have merits, but we won't say much more about them here.

Sets
Like a set in mathematics, a set has a bunch of elements with no repeats. To build a set, you can pass in a list and automatically remove duplicates, or build a set one by one (again automatically removing duplicates). Sets are like lists, but they are not sorted and have other special properties. The greatest utility of sets is the easy ability for making "Venn Diagrams". We can build one set to simply know how many unique elements it contains. We can compare two sets to find only the elements they have in common, only elements found in one or the other, and to "add" or "subtract" the unique values from respective sets. Sets can be converted into lists to use list methods if necessary.

Let's make a new script for investigating the properties of sets: code $ touch sets.py $ aquamacs sets.py &

code Then add the following code to the script: code format="python" geneset1 = set(['pomc', 'pomc', 'pomc', 'pomc', 'rag1', 'kiaa2013', 'tyr', 'pdc','cytb', 'nd2', '16s', '12s']) geneset2 = set(['pomc', 'rag1', 'kiaa2013', 'tyr', 'pdc', 'aa1', 'aa2', 'aa3'])
 * 1) syntax for adding to a set

geneset3 = set geneset3.add('pomc') geneset3.add('rag1')
 * 1) or create an empty set and populate it item by item (useful for looping)
 * 2) similar to list.append method, but we use set.add instead

print "unique elements in geneset1:", len(geneset1) print "unique elements in geneset2:", len(geneset2) print "unique elements in geneset3:", len(geneset3)
 * 1) notice the length of geneset1 is not 12, but rather 9, because duplicates are eliminated
 * 1) should be 9
 * 1) should be 8
 * 1) should be 2

code Now let's perform some comparisons between the sets. Replace your code with this: code format="python" geneset1 = set(['pomc', 'pomc', 'pomc', 'pomc', 'rag1', 'kiaa2013', 'tyr', 'pdc','cytb', 'nd2', '16s', '12s']) geneset2 = set(['pomc', 'rag1', 'kiaa2013', 'tyr', 'pdc', 'aa1', 'aa2', 'aa3']) print "geneset1:", geneset1 print "elements in geneset1:", len(geneset1) print "geneset2:", geneset2 print "elements in geneset2:", len(geneset2), '\n'

combined_set = geneset1 | geneset2 print "combined_set of all unique values:" print len(combined_set) print combined_set, '\n'
 * 1) combine sets to include all unique values

shared_set = geneset1 & geneset2 print "shared_set of only elements common to both sets:" print len(shared_set) print shared_set, '\n'
 * 1) combine sets but only include values shared between both sets

subtract_set = geneset1 - geneset2 print "subtract_set of geneset1 minus geneset2:" print len(subtract_set) print subtract_set, '\n'
 * 1) combine sets by subtraction, removing all elements of the second set from the first

diff_set = geneset1 ^ geneset2 print "diff_set composed of elements not shared between both sets:" print len(diff_set) print diff_set, '\n' code This produces the following output: code geneset1: set(['12s', 'kiaa2013', 'cytb', 'rag1', 'pomc', 'nd2', 'pdc', 'tyr', '16s']) elements in geneset1: 9 geneset2: set(['kiaa2013', 'aa2', 'rag1', 'aa1', 'aa3', 'pomc', 'pdc', 'tyr']) elements in geneset2: 8
 * 1) combine sets and remove only the shared elements between them, leaving only elements present in one set but not both

combined_set of all unique values: 12 set(['pdc', 'cytb', 'aa2', '12s', 'aa1', 'aa3', 'pomc', 'rag1', 'nd2', 'kiaa2013', 'tyr', '16s'])

shared_set of only elements common to both sets: 5 set(['pdc', 'rag1', 'kiaa2013', 'tyr', 'pomc'])

subtract_set of geneset1 minus geneset2: 4 set(['cytb', '12s', 'nd2', '16s'])

diff_set composed of elements not shared between both sets: 7 set(['12s', 'cytb', 'aa1', 'aa3', 'aa2', 'nd2', '16s']) code That is pretty useful. You could imagine scenarios of comparing sets of molecular data where you may want to use one of the above methods.

For a complete list of set methods, look here.

<span style="font-family: Arial,Helvetica,sans-serif;">Dictionaries
A dictionary is another type of data structure, and very comparable to an actual dictionary in which we reference a word and find its definition. To retrieve information out of a python dictionary, you look up a word, called a key, and you find information associated with that word, called the value.

Strings, numbers, and tuples work as keys, and any type can be a value. Other types may or may not work correctly as keys (strings and tuples work cleanly since they are immutable). Just like lists and sets, dictionaries are special and have their own useful features.

Let's create a new script for examining dictionaries and their properties: code $ touch dictionaries.py $ aquamacs dictionaries.py &

code And we will start a dictionary in the script: code format="python" dict = {}
 * 1) An empty dictionary is started by setting a variable name = {}

dict['a'] = 'alpha' dict['g'] = 'gamma' dict['o'] = 'omega'
 * 1) key/value pairs can be added to the dict following this format:
 * 2) dictionary_name[key_name] = value
 * 1) note that if instead of 'a', we typed dict[a] = 'alpha', python would look for a variable called 'a'

dict['p'] = '' dict['p']+= 'pi'
 * 1) a temporary placeholder can be used for a key value that can be filled in later
 * 2) this is sometimes useful in loops (we will see this for parsing fasta files to create dictionaries)

print dict
 * 1) notice there is no order for the dictionary keys when it prints

print dict['a']
 * 1) look up a particular key, which will return the VALUE (not the key!)

code And the output: code {'a': 'alpha', 'p': 'pi', 'o': 'omega', 'g': 'gamma'} alpha

code Here is a way to visualize the dictionary data structure:



**<span style="font-family: Arial,Helvetica,sans-serif;">Dictionary methods **
<span style="font-family: Arial,Helvetica,sans-serif;">The methods dict.keys and dict.values return lists of the keys or values explicitly. There's also an items function which returns a list of (key, value) tuples, which is the most efficient way to examine all the key value data in the dictionary. All of these lists can be passed to the sorted function.

<span style="font-family: Arial,Helvetica,sans-serif;">Try adding this code and running it: code format="python" print dict.keys print
 * 1) First, let's get the keys
 * 2) we can simply print them

key_list = dict.keys print key_list print
 * 1) or we can store them as a list in a new variable

print dict.values print
 * 1) there are similar methods for the values in the dictionary
 * 2) simply print the values

values_list = dict.values print values_list
 * 1) or store the values in a list

print dict.items print
 * 1) Finally, there is a way to retrieve keys and items together expressed as (key, value) tuples
 * 2) we can print this

key_value_tup = dict.items print key_value_tup code <span style="font-family: Arial,Helvetica,sans-serif;">and the output: code ['a', 'p', 'o', 'g']
 * 1) or store it as a tuple

['a', 'p', 'o', 'g']

['alpha', 'pi', 'omega', 'gamma']

['alpha', 'pi', 'omega', 'gamma']

[('a', 'alpha'), ('p', 'pi'), ('o', 'omega'), ('g', 'gamma')]

[('a', 'alpha'), ('p', 'pi'), ('o', 'omega'), ('g', 'gamma')] code <span style="font-family: Arial,Helvetica,sans-serif;">From a performance point of view, the dictionary is one of your greatest tools, and you should use it to organize your data. A common usage for molecular data is to have a sequence name as a key and the DNA sequence as the corresponding value. You can iterate over an entire fasta file to create a sequence dictionary. Remember, keys and values can always be turned into lists or tuples for additional methods.

<span style="font-family: Arial,Helvetica,sans-serif;">Tests and Loops: if, for, while
<span style="font-family: Arial,Helvetica,sans-serif;">Now that you understand variables and several data structures, it is time to create statements that will execute some action across multiple variables automatically, or execute that action only under a certain condition. The three words **if**, **for**, and **while** can be used to construct such statements.

<span style="font-family: Arial,Helvetica,sans-serif;">First word: if
The **if** statement is one of the most fundamental components of programming, and it is found in some form or another in every general-purpose programming language. It simply tells the computer to execute a block of code if a given statement is true. If the statement is false, the code is skipped. Although the concept is simple, conditional statements like **if** allow your programs to make decisions on their own based on their inputs. This allows <span style="font-family: Arial,Helvetica,sans-serif;">you to have two or more alternative blocks of code, and only do one of them depending on the situation.

> **execute this line of code** > **execute more code if wanted too**
 * if** <insert something that may or may not be true here>:

The key pieces of syntax here are the **if**, the condition that we would like to evaluate (note: the condition is expressed within brackets **<>** for emphasis in this example, but brackets are not used in real code), the colon **:** that signifies the end of the logical test, and the tab-indentation for the things to do below the first line of the statement. Aquamacs will automatically tab over for you following this type of statement.

Let's create a script for this section: code $ touch if_statements.py $ aquamacs if_statements.py code And add the following code: code format="python" seq1 = "ACTGTATCGATGCGGCATTTTTCGGATC"
 * 1) create sequence

if len(seq1) > 10: print "sequence is longer than 10 bp"
 * 1) create a conditional statement
 * 2) if the length of the sequence (an integer) is greater than 10, print the statement

elif len(seq1) < 10: print "sequence is shorter than 10 bp"
 * 1) OR else if the length is less than 10, print this statement

elif len(seq1) == 10: print "sequence is exactly 10 bp!" code Try changing the sequence length to meet the other conditions, and see the results.
 * 1) OR else if the length is exactly 10, print this statement

<span style="font-family: Arial,Helvetica,sans-serif;">Only the first part with the 'if' is required. Then you can have zero or more 'elif's (short for 'else if'), and finally a single 'else' if you want. Each of them is called a 'branch'. Python checks the 'guard' at the beginning of each branch, and if it's a true statement that branch runs. The 'else' branch runs if none of them were true.

Second word: for
The for loop allows you to perform the same actions multiple times, but with different data each time through (some languages have an equivalent construct and call it a for each loop).

Let's make a script for this: code $ touch for_statements.py $ aquamacs for_statements.py code And try out some of these examples: code format="python" seq1 = "ACTGT"

genelist1 = ['nd2', '16s', '12s', 'cytb']

dict1 = {'a': 'alpha', 'p': 'pi', 'o': 'omega', 'g': 'gamma'}

print "the for loop in a string iterates over each character of the string:" for x in seq1: print x

print '\n', "the for loop in a list iterates over each item in the list:" for x in genelist1: print x

print '\n', "the for loop in a dictionary will default to iterating over the keys:" for x in dict1: print x

print '\n', "special for loop using a dictionary method for key, value pairs:" for key, value in dict1.items: print key, value

code and the output: code the for loop in a string iterates over each character of the string: A C T G T

the for loop in a list iterates over each item in the list: nd2 16s 12s cytb

the for loop in a dictionary will default to iterating over the keys: a p g o

special for loop using a dictionary method for key, value pairs: a alpha p pi g gamma o omega

code In these cases, we use the variable x to iterate across the string or list, and print out that value. We could change the name of x to be more meaningful in the context of the loop (bp, item, key, etc.), as long as the naming follows the rules for a variable. The loop ends when there are no more values left to iterate over. You could imagine that as you move through each item or character or key, you may want to perform some function.

You may also control what happens using the range function: code format="python" x = int(0)
 * 1) assign x to the integer zero

for i in range(5): # print value of x   print x    # then add one increment to x    x += 1 code <span style="font-family: Arial,Helvetica,sans-serif;">and the output: code 0 1 2 3 4
 * 1) for every number from 0 to 4 (5 numbers), execute the following code

code <span style="font-family: Arial,Helvetica,sans-serif;">Here range is another function that returns a list of integers from 0 up to but not including the number in parenthesis.

Third word: while
Another way to loop through a list is to use the while loop. This example is very similar to the previous one: code format="python" x = int(0)
 * 1) assign x to be the integer zero

while x < 5: # add one to x   x += 1 # print out the value of x   print "x is now {}".format(x)
 * 1) as long as x is less than 5, execute the following code

code <span style="font-family: Arial,Helvetica,sans-serif;">and the output: code x is now 1 x is now 2 x is now 3 x is now 4 x is now 5 code <span style="font-family: Arial,Helvetica,sans-serif;">Don't forget to increment the x (x += 1), or you will enter an infinite loop!

<span style="font-family: Arial,Helvetica,sans-serif;">Nesting and Complex Statements
<span style="font-family: Arial,Helvetica,sans-serif;">The first thing to realize is that programming languages are like natural languages: as long as you follow the rules of grammar, anything can be nested inside anything else. Here are a couple examples:

<span style="font-family: Arial,Helvetica,sans-serif;">Make a new script and call it GC_content.py <span style="font-family: Arial,Helvetica,sans-serif;">Add the following: code format="python" seq1 = "TTGAGCTGGGATACTCGGAGCCTCACTTAGAATAATCATCCGGACGGAGTTAAGCGGCCCCGGTAATTTA" seq2 = "ATTGGGGATGACCAAATTTACAACACCATCGTAACAGCTCATGCTTTTATCATAATTTTTTTTATAGTTA" seq3 = "TACCTATTATAATCGGTGGCTTCGGAAATTGATTGGTTCCTCTAATATTAGGAAGACCGGATATAGCCTT" seq4 = "CCCACGAATGAATAATATAAGATTTTGATTACTACCGCCCTCTTTAACTCTCCTATTAATAAGAGGAATG" seq5 = "GTAGAAAGAGGGGTAGGTACTGGGTGAACAGTTTACCCCCCTTTAAGAGGGGCAATTGCCCATAGAGGGG"

seq_list = []
 * 1) create empty list to put these sequences in

seq_list.append(seq1) seq_list.append(seq2) seq_list.append(seq3) seq_list.append(seq4) seq_list.append(seq5)
 * 1) add the sequences one by one

for seq in seq_list: # each time a sequence is selected, start the GC_content and total_bases counters at zero GC_count = float(0) total_bases = float(0)
 * 1) start a for loop, iterating over every sequence in the list

# start the second for loop, this time iterating over every base in a given sequence for base in seq: # for every base in this sequence, add one to our total base pair counter total_bases += 1 # start a conditional statement, that if a 'C' is found, add one to the GC_content counter if base == "C": GC_count += 1 # another conditional statement, that if a 'G' is found, add one to the GC_content counter elif base == "G": GC_count += 1

#at the end of the sequence, before moving on to the next sequence in the list, calculate the percentage # of G's and C's in this sequence GC_content = int( (GC_count/total_bases)*100 )

# print this information out using the string.format method print "The GC content of {0} is {1}%".format(seq, GC_content)

#after this code, the loop will continue with the next sequence in the seq_list

code and have a look at the output: code The GC content of TTGAGCTGGGATACTCGGAGCCTCACTTAGAATAATCATCCGGACGGAGTTAAGCGGCCCCGGTAATTTA is 50% The GC content of ATTGGGGATGACCAAATTTACAACACCATCGTAACAGCTCATGCTTTTATCATAATTTTTTTTATAGTTA is 30% The GC content of TACCTATTATAATCGGTGGCTTCGGAAATTGATTGGTTCCTCTAATATTAGGAAGACCGGATATAGCCTT is 38% The GC content of CCCACGAATGAATAATATAAGATTTTGATTACTACCGCCCTCTTTAACTCTCCTATTAATAAGAGGAATG is 34% The GC content of GTAGAAAGAGGGGTAGGTACTGGGTGAACAGTTTACCCCCCTTTAAGAGGGGCAATTGCCCATAGAGGGG is 52% code It's nice to use the same functions on multiple items in a row. By using combinations of tests and loops, you can accomplish a lot with relatively simple code. The main tricky part starting out is keeping track of what happens inside each loop, and when. It is also important to consider where to place a variable or list or dictionary that you are trying to build, either inside or outside of a loop. Sometimes your variable or item may reset each time the loop begins, if you've defined it inside the loop.

Take a look at the following example, which might help you think about nesting loops and how they operate: code format="python" outside_counter = int(0)

for i in range(10): outside_counter += 1

print outside_counter
 * 1) this prints 10
 * 2) the counter is outside of the loop



for i in range(10): inside_counter = int(0) inside_counter += 1

print inside_counter
 * 1) this prints 1, rather than 10
 * 2) why?



first_loop = int(0) second_loop = int(0) third_loop = int(0)

for i in range(5): first_loop += 1

for j in range(5): second_loop += 1

for h in range(5): third_loop += 1

print first_loop
 * 1) this count is 5

print second_loop
 * 1) this count is 25

print third_loop
 * 1) this count is 125


 * 1) how can you explain the differences in numbers? what order do the loops occur in?

code Perhaps a better way to visualize this last example is to print a statement following each iteration of nested loops, so you can see what is being executed and when. It's a nice way to keep track. Try this code: code format="python" first_loop = int(0) second_loop = int(0) third_loop = int(0)

for i in range(5): first_loop += 1 print "I am in the primary for statement now!"

for j in range(5): second_loop += 1 print '\t', "I am in the secondary for statement here."

for h in range(5): third_loop += 1 print '\t', '\t', "And here I am in the third for statement, doing loops!"

code I put tabs in the statements to correspond with the tabs for the nested loops. Does this make more sense? Getting a grasp on nesting statements like these is important for really using the language to do your tasks.

<span style="font-family: Arial,Helvetica,sans-serif;">Functions
<span style="font-family: Arial,Helvetica,sans-serif;">Functions are used to group code into 'chunks' that do something meaningful, and hide the boring details. <span style="font-family: Arial,Helvetica,sans-serif;">They also keep you from having to type things over and over.

<span style="font-family: Arial,Helvetica,sans-serif;">All functions begin by: <span style="font-family: Arial,Helvetica,sans-serif;">def name_of_function(arguments): <span style="font-family: Arial,Helvetica,sans-serif;"> block of code

Let's make a 'functions.py' file. Here is an example of a function that is numerical in principle: code format="python" def percent_calc(x,y): percent =( float(x) / float(y) )*100 #return the percent variable from this function return percent
 * 1) assign a name to function and list arguments it will take, in this case two

print percent_calc(100,300)
 * 1) to execute the function, use the name of the function and give it
 * 2) the appropriate arguments defined by you (in this case, two numbers)
 * 3) print the output of the function, performed for 100 and 300 as arguments

perc = percent_calc(100,300) print perc
 * 1) or store it as a variable for use later

code
 * 1) in both cases 33.3333333333 is what prints to the screen

<span style="font-family: Arial,Helvetica,sans-serif;">Functions can be thought of in two different ways, depending on what you want to do. <span style="font-family: Arial,Helvetica,sans-serif;">First, you can treat them like functions in math, which take 0 or more arguments and return one result, like above.

<span style="font-family: Arial,Helvetica,sans-serif;">The other way to look at them is as subroutines. They group a series of steps together so you can just write it once and then reference it in other code. Remember our GC_content calculator? We can create a function for that and execute that for a single sequence or a list of sequences, now that it is in a handy format. code format="python" def gc_calculator(seq): total_bases = float(0) GC_count = float(0) for base in seq: # for every base in this sequence, add one to our total base pair counter total_bases += 1 # start a conditional statement, that if a 'C' is found, add one to the GC_content counter if base == "C": GC_count += 1 # another conditional statement, that if a 'G' is found, add one to the GC_content counter elif base == "G": GC_count += 1 GC_content = int( (GC_count/total_bases)*100 ) return GC_content

seq1 = "TTGAGCTGGGATACTCGGAGCCTCACTTAGAATAATCATCCGGACGGAGTTAAGCGGCCCCGGTAATTTA" seq2 = "ATTGGGGATGACCAAATTTACAACACCATCGTAACAGCTCATGCTTTTATCATAATTTTTTTTATAGTTA" seq3 = "TACCTATTATAATCGGTGGCTTCGGAAATTGATTGGTTCCTCTAATATTAGGAAGACCGGATATAGCCTT"

seq_list = []

seq_list.append(seq1) seq_list.append(seq2) seq_list.append(seq3)

print "gc_content of seq1 is:", gc_calculator(seq1), "%"
 * 1) now we can use the function on one sequence, or iterate over the list

for seq in seq_list: gc = gc_calculator(seq) print "the GC content of {0} is: {1}%".format(seq,gc)

code <span style="font-family: Arial,Helvetica,sans-serif;">and the output:

code gc_content of seq1 is: 50 % the GC content of TTGAGCTGGGATACTCGGAGCCTCACTTAGAATAATCATCCGGACGGAGTTAAGCGGCCCCGGTAATTTA is: 50% the GC content of ATTGGGGATGACCAAATTTACAACACCATCGTAACAGCTCATGCTTTTATCATAATTTTTTTTATAGTTA is: 30% the GC content of TACCTATTATAATCGGTGGCTTCGGAAATTGATTGGTTCCTCTAATATTAGGAAGACCGGATATAGCCTT is: 38% code

<span style="font-family: Arial,Helvetica,sans-serif;">Any function without a return statement just returns None (which the interpreter doesn't bother printing). You can store that somewhere if you want, but usually you just ignore it.

<span style="font-family: Arial,Helvetica,sans-serif;">Working with Files
<span style="font-family: Arial,Helvetica,sans-serif;">Bioinformatics often involves reading and writing files. So far, we've been delivering information to Python in the scripts we are writing, by assigning variables and creating data structures in the script itself. In most cases, this information will come from an outside file, and we need to learn how to import this information and use it in meaningful ways. That's what the focus of this section is, and we'll end with writing information to output files!

A ** filehandle ** is an object that controls the stream of information between your program and a file stored somewhere on the computer. Filehandles are not filenames, and they are not the files themselves. They are a tool that your program uses to interact with files, nothing more.

<span style="font-family: Arial,Helvetica,sans-serif;">**file handle objects** have their own methods.


 * <span style="font-family: Arial,Helvetica,sans-serif;">There is a close method that closes an open file handle.
 * <span style="font-family: Arial,Helvetica,sans-serif;">next method goes to the next line in the file and returns it
 * <span style="font-family: Arial,Helvetica,sans-serif;">readlines method reads the entire file into memory and returns a list where each element is one line
 * <span style="font-family: Arial,Helvetica,sans-serif;">read method reads the entire file into memory and returns the contents of the file as as string.
 * <span style="font-family: Arial,Helvetica,sans-serif;">write method writes whatever you pass as an argument to it, to the file handle, you first need to open a file handle with write permissions, usually append permissions. This is useful to write outputs of your scripts to.

We create filehandles in the simplest sense with the open command:

fh = open('some_file', [permissions])

The permissions include 'r', 'w', and 'a'. <span style="font-family: Arial,Helvetica,sans-serif;">The 'r' tells Python to open the file for reading. The other options are 'w' for write and 'a' for append. The difference is writing deletes any existing data in the file, while append adds lines to the end of it. (Think of a log file, where you add a line each time something happens).

Create a new script called "file_open.py" Let's try it on a phylip format alignment file, which you can download below:

Put this file in the same folder we've been using for this workshop. You can inspect it from terminal by typing: code $ less Alignment.phy

(hit q when done)

code The first line has two numbers, the first is how many sequences there are and the second is the alignment length. Then all following lines are the sequence name followed by the sequence.

Let's open this file by entering the following code into "file_open.py": code format="python" fh = open('Alignment.phy', 'r') contents = fh.read print contents
 * 1) fh is a nice abbreviation for filehandle
 * 2) the 'r' indicates read-only, so we can't make changes to this file

fh.close code That's a replica, but it is a single long string and not useful for parsing through. We want to use something more manageable.
 * 1) always close a file after you are done using the following command:

Let's use a different method that involves looping through lines read from our file. The fh.readlines method returns a list of strings, where each string represents a line in the file. Using the list and string methods we've already covered, we can iterate over the list of file lines and do things to these strings to pull out relevant information.

Let's do this here using the method described above, and our goal here is simply to pull out the names of the samples in this alignment. Try the following code: code format="python" fh = open('Alignment.phy', 'r')

lines = fh.readlines
 * 1) read the lines into a list

for line in lines[1:]: #eliminate the newline characters and odd whitespaces trailing lines #ALWAYS a good idea to use this function! You can assign to a new variable or keep it simple #and replace the line variable, since we only need it here line = line.strip #split the line based on a delimiter, in this case the space between the sequence name and sequence #at this stage you could also use line.split('\t') if you have a tab-delimited text file for example line = line.split #NOTE: we just turned our string into a list of strings... now we are dealing with a list! #assign the elements in the newly created list to variable names for quick reference name = line[0] sequence = line[1]
 * 1) we can now iterate over the lines to do useful things
 * 2) let's skip the first line using index numbers (remember lines[0] will be the first line)

print name # prints the name part of this line, then will move on to the next line in our readlines generated list

fh.close code the output is: code VZ_235724 VZ_246017 VZ_250452 VZ_234475 VZ_238928 VZ_236603 VZ_236601 VZ_237077 AS_251028 VZ_235850 VZ_235851 VZ_234488 AS_251029 VZ_241417 VZ_250289 VZ_234491 code We are just beginning to scratch the surface here for possibilities. You could also calculate GC content for each sequence along the way, as long as we include the function we invented above and execute it on the sequence variable. You could put the sample names into a list and sort by alphabetical, or store the sequences in a list. You could check for duplicate sequences by creating a set and comparing the length of the set vs. the number of sequences read (start a counter!). You could search for a particular repeat in each sequence, or trim them to a specific length.
 * 1) always close a file after you are done using the following command:

The last thing we will do is **write to a file**. Create an output file by using a filehandle for it. We'll modify the above script to output the sample names and sequences to a new text file, separated by a tab: code format="python" fh = open('Alignment.phy', 'r') lines = fh.readlines

output_name = "Output.txt" fh_out = open(output_name, 'a')
 * 1) create the output file here, and make sure we can append it ('a'), not write over it ('w')

for line in lines[1:]: line = line.strip line = line.split name = line[0] sequence = line[1]

#this is where the writing happens with the fh.write function #we are using a concatenated string here: name plus a tab character plus sequence name, plus newline #if you didn't include the newline, all names and seqs would all appear on a single line! fh_out.write(name+'\t'+sequence+'\n')

fh.close fh_out.close
 * 1) add line to close output file

code

Nothing appeared on the screen, but we didn't ask anything to print. Check your directory to see the output file, "Output.txt". If we used the 'w' option for our output file, only the last sample name and sequence would be there, because we would have been constantly writing over the file contents. You can do A LOT with output files, including making headers, then populating your file with information when you iterate over the lines of the input file. You could also summarize information and do mathematical operations, then output these too. Don't forget about the utility of the string.format method here!

Tips
Congratulations, you survived your first major exposure to Python!

The best way to continue learning is through more focused tutorials and practice. I strongly recommend working through the UC Berkeley Python bootcamp wiki page, which has plenty of exercises and solutions, found here.

There are also many other resources for learning Python for free online or available as a download. Some examples are: Learning Python Python Pocket Python Official Documentation And 9 times out of 10, googling your problem will help you solve it via a programming forum.

Remember, a lot of bioinformatics is simply processing output files from one program in order to prepare them as inputs for a different program. You can do these formatting changes with the concepts explained above. Become familiar with the specific file types you work with, and play around with writing your own files using information from input files.