Welcome to the Introduction to Python workshop at CGRL. This course is aimed at biologists who are interested in learning how to analyze genomic data sets and especially Next Generation Sequencing data sets. Our goal is to provide an impetus to people to start writing small programs and then be able to explore further on their own. Since the expected audience is people who have little to no programming experience this material covered in this course will be at a beginner level. However, since programming requires some understanding of the command line environment we expect that the attendees have an understanding of the Unix or Unix-like environments (e.g. MacOS). You are also expected to know one of the text editors to read and write text files in the Unix environment. Some of the common ones are emacs, vi, nano etc.
Why Python
Python programming language is new and is increasing in popularity. There are several features in the language that are helpful in tackling problems in the genomic data analysis field. Many of the newer programs that are available in this area are written in Python as well. Moreover it is easier than some other languages for a novice programmer to start on.
Python is an evolving language and there are several versions available now. We will be using version 2.7.1 in this class. Many of the programs available for genomic data analysis are written in versions 2.6 or 2.7, but the difference between these are not very significant. These versions are still the most popular ones. The material covered in this class will work well for the most part in both. The newest versions Python 3.x have major differences from 2.x and some of the code covered in this class may not work with it.
Programming in general
INPUTS --> PROGRAMS ---> OUTPUTS
Programs are a set of instructions to the computer to do certain actions. Each program will take 0 or more inputs and then carry out the instructions given and produce 0 or more outputs. In general, there are multiple ways of achieving the same results by doing the different sets of actions. Therefore, there are many of writing computer programs to achieve the same outcome. However, some programs are more efficient and elegant than others. Some may be more easy to read and maintain over time. We will not place much emphasis on these points since it is a beginner level class, but it is good to learn those differences and incorporate these into your programs and programming style as you become more proficient.
Working environment
You are welcome to use the programming environment on your own laptops if you prefer. Having Python 2.6 or 2.7 already installed will be best. However, we will not be able to help much if there any major difficulties using it. You have the option to use Python on the CGRL server machine. The details about logging into the machine will be given in class.
Let's get our feet wet
Once you have logged into your account and set up the environment (PATH variable for Python), check out the version of Python you have.
$ python -V Python 2.7.1
The traditional way to start programming is to print out the words "Hello, World!". To do this in Python you will have to start the Python interpreter. We do this by simply typing in the "python" command at the prompt.
$ python Python 2.7.1 (r271:86832, May 12 2011, 10:02:04) [GCC 4.6.0] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>>
The python interpreter has been started and is now waiting for you input. This is one of the ways to use Python. Python is an interpreted language, which means that it takes instructions from the programmer, one at a time, and executes them. Let us get Python to print out something. >>>print "Hello, World!" Hello, World >>>
The interpreter executed your instruction and prompts you for the next command. Try an arithmetic expression:
>>>3 + 2 5 >>>
So Python interpreter can act as a simple (you can make it do much more complex ones too) calculator as well. Let's see how we can stop the interpreter and come back to the Unix command prompt.
>>>quit() $
That is it! Not particularly useful, but you have used Python to run a small script or program. Although some people may be finicky about the words scripts and programs, here we will use it interchangeably. It is a set of instructions to a computer in one or other programming language, which in our case is Python.
Running Python programs from command line
We have seen the interactive of the Python interpreter. The advantage of using Python interpreter this is way is to quickly see the output of each action. When programs are being developed this is a good way to interact with Python. However, once you leave the interpreter there is no saved history of the commands. To do that we have call the Python interpreter to execute a program from the command line.
To do that let's first create a simple program which will do the same task as before; print out the string "Hello, World!". You will use (your favorite) an editor to create the file containing your program the text of which is shown below.
$ emacs helloWorld.py
#!/usr/bin/env pythonprint"Hello, World!"
Now you can execute the program by calling Python on this program file
$ python helloWorld.py Hello, World!
The first line of the program starting with the "#!" sets up the environment to execute the file; in this case by calling "python" which is the interpreter. We do not have to specify python interpreter if the file is executable.
$ chmod +x helloWorld.py
$ ./helloWorld.py
To try a similar program with varying input and output
$ cat helloDude.py
Here we with look at mathematical expressions, variables or named objects, and statements in Python. The best to understand these are to get back into the interactive Python interpreter again. We are going to look at types, precedence or operations, and assignment in this section. Variables names can be anything subject to a few conditions: they must begin with a letter, can contain letters or numbers or underscores, cannot be one of Python keywords such as
and del from not while
as elif global or with
assert else if pass yield
break except import print
class exec in raise
continue finally is return
def for lambda try
$ python
Python 2.7.1 (r271:86832, May 122011,10:02:04)[GCC 4.6.0] on linux2
Type "help","copyright","credits"or"license"for more information.
>>>2 + 5 - 34>>> result =2 + 5 * 4>>> result
22>>>print result
22>>>type(result)<type'int'>>>> greeting ="Welcome">>> greeting
'Welcome'>>>print greeting
Welcome
>>>type(greeting)<type'str'>>>> ltor =50 / 5 * 2>>>print ltor
20>>>50 / 5 ** 22>>>45 / (5 + 10)3>>> n
Traceback (most recent call last):
File "<stdin>", line 1,in<module>NameError: name 'n'isnot defined
>>>
When in doubt use parenthesis to explicitly specify the order of execution.
Numeric values
There are mainly two types of numeric values - integers and floating point values. When all number in an expression are integers the operations will be integer arithmetic.
>>> x = y = z =0>>> x
0>>> z
0>>> width =7>>> length =5 * 4>>> width * length
140>>> area = width * length
>>>7 / 32>>>12 / 5.02.4>>>
Strings
Strings are an important data type in Python. They are specified enclosed in single or double quotes usually. If a quote character is in the string the surrounding quote should be of the other type or the internal quote must be "escaped". We will look at some examples below.
>>>'spam eggs''spam eggs'>>>"doesn't""doesn't">>>'doesn\'t'"doesn't">>>'"Yes,", he said.''"Yes,", he said.'>>> hello ='''This is a really long line
... continuing on to a second line'''>>> hello
'This is a really long line\n continuing on to a second line'>>>print hello
This is a really long line
continuing on to a second line
>>> x ="AC">>> y ="TG">>> x + y
'ACTG'>>> z = x + y
>>> z
'ACTG'>>> z * 3'ACTGACTGACTG'>>> rep_z = z * 3>>>len(rep_z)12>>> rep_z[0]'A'>>> rep_z[2]'T'>>> rep_z[11]'G'>>> rep_z[0:4]'ACTG'>>> rep_z[:4]'ACTG'>>> z[2:]'TG'>>> rep_z[2:]'TGACTGACTG'>>> new_z = z[1:-1] + 'q'>>> new_z
'CTq'
Lists
Lists are similar to strings in that they are ordered collections. However, strings contain characters as the items whereas lists can contain characters, strings, numbers, or even other lists as their items. The lists can also be accessed through their indexes just like strings. Unlike strings however, the contents of the lists can be changed.
>>> cheeses =['cheddar','gouda','cottage']>>> numbers =[17,15,23]>>> mixlist =['contains',34]>>> empty =[]>>>print cheeses, numbers, empty, mixlist
['cheddar','gouda','cottage'][17,15,23][]['contains',34]>>> a =['spam','eggs',100,1234]>>> a
['spam','eggs',100,1234]>>> a[2]= a[2] + 23>>> a
['spam','eggs',123,1234]>>> a[0:2]=[1,12]>>> a
[1,12,123,1234]>>> a[0:2]=[]>>> a
[123,1234]>>> a[1:1]=['foo','bar']>>> a
[123,'foo','bar',1234]>>>#Insert copy of a at beginning
... a[:0]= a
>>> a
[123,'foo','bar',1234,123,'foo','bar',1234]
Control Flow
In addition to straight line execution of instructions we need to do be able to specify actions based on conditions. Control-flow statements help achieve this objective. The two main control flow statements we will look at are the "if-elif-else" statement and the "for" statement.
>>> x =int(raw_input("Please enter an integer: "))
Please enter an integer: 42>>>if x <0:
... x=0
... print'Negative changed to zero'
... elif x ==0:
... print'Zero'
... elif x ==1:
... print'Single'
... else:
... print'More'
...
More>>> words =['cat','jump','window']>>>for w in words:
... print w,len(w)
...
cat3
jump 4
window 6
Exercise
In DNA strings, symbols 'A' and 'T' are complements of each other, as are 'C' and 'G'. The reverse complement of a DNA string s is the string r formed by reversing the symbols of s, then taking the complement of each symbol (e.g., the reverse complement of "GTCA" is "TGAC"). Write a Python script to read in a DNA string of atmost 1000 bp and return the reverse complement of the string.
Functions
Functions help encapsulate code which can be reused over and over again. We have already used a few like "raw_input()" and "len()". These are predefined functions available in the language. We can also define our own.
>>>def print_twice (stuff):
... print stuff
... print stuff
...
>>> print_twice ("try this")try this
try this
>>> print_twice ("sleep is divine")
sleep is divine
sleep is divine
File Input and Output
Files are necessary to persist data after programs are done executing. We can read data from files and write data to files.
See the example below to read the first few lines of a text file and print it out
>>> fin =open('/global/courses/spr2013/pythonWorkshop/frost.txt','r')>>> x =0>>>while x <4:
... line= fin.readline().strip()
... x= x + 1
... print line
...
Two roads diverged in a yellow wood,
And sorry I could not travel both
And be one traveler,long I stood
And looked down one as far as I could
>>> fin.close()
Exercises
1. Write a Python script read a fasta file and print out the sequence name and nucleotide counts for the entries in the file. Use the file in /global/courses/spr2013/pythonWorkshop/mrna10.fa as a test case. This file is copied below
2. Write a Python program to read the file (beware this file is not small and is in compressed format) /global/courses/spr2013/pythonWorkshop/mrna.fa.gz and write the sequence names and GC percentage of the sequence to another file.
Introduction to Python for Genomic Data Analysis
April 29, 2013
Jeff Johnson and Madhavan Ganesh
Basics
Welcome to the Introduction to Python workshop at CGRL. This course is aimed at biologists who are interested in learning how to analyze genomic data sets and especially Next Generation Sequencing data sets. Our goal is to provide an impetus to people to start writing small programs and then be able to explore further on their own. Since the expected audience is people who have little to no programming experience this material covered in this course will be at a beginner level. However, since programming requires some understanding of the command line environment we expect that the attendees have an understanding of the Unix or Unix-like environments (e.g. MacOS). You are also expected to know one of the text editors to read and write text files in the Unix environment. Some of the common ones are emacs, vi, nano etc.Why Python
Python programming language is new and is increasing in popularity. There are several features in the language that are helpful in tackling problems in the genomic data analysis field. Many of the newer programs that are available in this area are written in Python as well. Moreover it is easier than some other languages for a novice programmer to start on.Python is an evolving language and there are several versions available now. We will be using version 2.7.1 in this class. Many of the programs available for genomic data analysis are written in versions 2.6 or 2.7, but the difference between these are not very significant. These versions are still the most popular ones. The material covered in this class will work well for the most part in both. The newest versions Python 3.x have major differences from 2.x and some of the code covered in this class may not work with it.
Programming in general
INPUTS --> PROGRAMS ---> OUTPUTS
Programs are a set of instructions to the computer to do certain actions. Each program will take 0 or more inputs and then carry out the instructions given and produce 0 or more outputs. In general, there are multiple ways of achieving the same results by doing the different sets of actions. Therefore, there are many of writing computer programs to achieve the same outcome. However, some programs are more efficient and elegant than others. Some may be more easy to read and maintain over time. We will not place much emphasis on these points since it is a beginner level class, but it is good to learn those differences and incorporate these into your programs and programming style as you become more proficient.
Working environment
You are welcome to use the programming environment on your own laptops if you prefer. Having Python 2.6 or 2.7 already installed will be best. However, we will not be able to help much if there any major difficulties using it. You have the option to use Python on the CGRL server machine. The details about logging into the machine will be given in class.Let's get our feet wet
Once you have logged into your account and set up the environment (PATH variable for Python), check out the version of Python you have.$ python -V
Python 2.7.1
The traditional way to start programming is to print out the words "Hello, World!". To do this in Python you will have to start the Python interpreter. We do this by simply typing in the "python" command at the prompt.
$ python
Python 2.7.1 (r271:86832, May 12 2011, 10:02:04)
[GCC 4.6.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>>
The python interpreter has been started and is now waiting for you input. This is one of the ways to use Python. Python is an interpreted language, which means that it takes instructions from the programmer, one at a time, and executes them. Let us get Python to print out something.
>>> print "Hello, World!"
Hello, World
>>>
The interpreter executed your instruction and prompts you for the next command. Try an arithmetic expression:
>>> 3 + 2
5
>>>
So Python interpreter can act as a simple (you can make it do much more complex ones too) calculator as well. Let's see how we can stop the interpreter and come back to the Unix command prompt.
>>> quit()
$
That is it! Not particularly useful, but you have used Python to run a small script or program. Although some people may be finicky about the words scripts and programs, here we will use it interchangeably. It is a set of instructions to a computer in one or other programming language, which in our case is Python.
Running Python programs from command line
We have seen the interactive of the Python interpreter. The advantage of using Python interpreter this is way is to quickly see the output of each action. When programs are being developed this is a good way to interact with Python. However, once you leave the interpreter there is no saved history of the commands. To do that we have call the Python interpreter to execute a program from the command line.
To do that let's first create a simple program which will do the same task as before; print out the string "Hello, World!". You will use (your favorite) an editor to create the file containing your program the text of which is shown below.
$ emacs helloWorld.py
Now you can execute the program by calling Python on this program file
$ python helloWorld.py
Hello, World!
The first line of the program starting with the "#!" sets up the environment to execute the file; in this case by calling "python" which is the interpreter. We do not have to specify python interpreter if the file is executable.
$ chmod +x helloWorld.py
$ ./helloWorld.py
To try a similar program with varying input and output
$ cat helloDude.py
Expressions, Variables, Statements
Here we with look at mathematical expressions, variables or named objects, and statements in Python. The best to understand these are to get back into the interactive Python interpreter again. We are going to look at types, precedence or operations, and assignment in this section. Variables names can be anything subject to a few conditions: they must begin with a letter, can contain letters or numbers or underscores, cannot be one of Python keywords such as
$ python
When in doubt use parenthesis to explicitly specify the order of execution.
Numeric values
There are mainly two types of numeric values - integers and floating point values. When all number in an expression are integers the operations will be integer arithmetic.
Strings
Strings are an important data type in Python. They are specified enclosed in single or double quotes usually. If a quote character is in the string the surrounding quote should be of the other type or the internal quote must be "escaped". We will look at some examples below.
Lists
Lists are similar to strings in that they are ordered collections. However, strings contain characters as the items whereas lists can contain characters, strings, numbers, or even other lists as their items. The lists can also be accessed through their indexes just like strings. Unlike strings however, the contents of the lists can be changed.Control Flow
In addition to straight line execution of instructions we need to do be able to specify actions based on conditions. Control-flow statements help achieve this objective. The two main control flow statements we will look at are the "if-elif-else" statement and the "for" statement.Exercise
In DNA strings, symbols 'A' and 'T' are complements of each other, as are 'C' and 'G'. The reverse complement of a DNA string s is the string r formed by reversing the symbols of s, then taking the complement of each symbol (e.g., the reverse complement of "GTCA" is "TGAC"). Write a Python script to read in a DNA string of atmost 1000 bp and return the reverse complement of the string.Functions
Functions help encapsulate code which can be reused over and over again. We have already used a few like "raw_input()" and "len()". These are predefined functions available in the language. We can also define our own.File Input and Output
Files are necessary to persist data after programs are done executing. We can read data from files and write data to files.See the example below to read the first few lines of a text file and print it out
Exercises
1. Write a Python script read a fasta file and print out the sequence name and nucleotide counts for the entries in the file. Use the file in /global/courses/spr2013/pythonWorkshop/mrna10.fa as a test case. This file is copied below2. Write a Python program to read the file (beware this file is not small and is in compressed format) /global/courses/spr2013/pythonWorkshop/mrna.fa.gz and write the sequence names and GC percentage of the sequence to another file.