How to teach yourself Python


This workshop will give you some advice on how to learn Python through self study, explain some basic concepts of computer science, guide you through setup of a Python programming environment, clarify what you can expect on your journey to becoming an expert hacker, and point you in the direction of resources to help you learn. This workshop will not teach you how to code in Python. We could barely scratch the surface in 2 hours, or even a full day, so instead I will try to prepare you and equip you to take on this challenge.

Learning to code is hard. It requires learning the syntax of a new language, a new discipline of logic, and a practical knowledge of computing all at once. It is a challenge that anyone is can learn, though, just like reading or mathematics. Coding can also be a lot of fun. Coding is a creative challenge, like solving a logic puzzle, and provides rapid rewards. Instead of waiting overnight to see if your cloning worked you can run your script and get results near instantly. Stick with it and you will find working with computers joyful and rewarding.

Additionally, your first language is always the hardest. Once you learn one language learning another will easier since the underlying logic will be the same.


Part 1: Strategies to keep motivated and learn effectively


Commit

The key to learning from self-paced courses and tutorials is to commit to completing them. Set aside the time you need to complete your chosen course or tutorial. It will take at least 40 hours of concerted effort to complete a basic tutorial and start doing something. For example, the Python Bootcamp for bioinformatics will probably take you 40 hours to complete. If you're very new to computers, or you choose to complete an in depth tutorial, it will take much longer. For example, completing CS 61A will probably take you closer to 200 hours.

Obviously, the more you give, the more you will get. How much expertise you actually need depends on your goals. A basic tutorial should give you enough knowledge to get started stringing together bioinformatics pipelines and analyzing your next generation sequencing data. If you plan to write your own software or bioinformatics packages, though, you may want to consider a more in depth tutorial like the CS 61A course.

While it's essential to commit to learning the skills of programming, it's OK to stop when you're done. While you are working try to think of how you can use what you are learning in your research. This will hopefully keep you motivated, too. Hopefully, at some point, you will think, “Hey, I could use this to automate that boring task I hate!” When that happens, I say “Do it!” Set the tutorial aside and get to work. We are all busy, our time is precious, and one of the best ways to learn is by doing.

Be Consistent

I recommend working at Python every day. If you wait too long between lessons you will forget what you learned in a previous lesson. Set aside at least an hour every day. If you can, block out an entire week or two and do nothing but learn Python.

Use It or Lose It

Start learning Python when you need it. If you take two weeks to complete a tutorial then wait 6 months for your NGS data to arrive you will very likely have forgotten everything you learned. Learning a programming language is just like learning a native language in that regard. Unless you keep practicing it will fade with time. Once you become fluent, though, it is easy to bring it back to the surface.

Work on a Project

Working on a project is a great way to motivate yourself to keep learning and to keep in practice. Choose a project that you are excited about working on. If you have a pile of data to analyze, or a bioinformatics project in mind, that is a great project. Work towards learning enough to get you started on that project, then take a swing at it.

Your progress will be slow at first. You may find yourself spending more time searching for questions and reading documentation then actually writing code. That's OK, you are learning as you go. Working on your project will focus you to learn the precise skills you need, rather than wasting time learning how to interface with email servers or create your own exception classes.

If you don't have any data currently then find something else you would be excited to work on. Like playing board games? Write an electronic version of your favorite board game. Have a messy music library? Write a script to help organize your music. Don't back up your files often enough? Write a script to automatically back up your home directory every night. It's not important what project you choose, or even if it's feasible or you can do it well, just that you enjoy working on it and you will continue to work on it.

Work with a Friend

Having a learning buddy can be hugely helpful. Working collaboratively can be more enjoyable than slogging along by yourself, and you can encourage each other to keep working by setting up practice dates. If you both commit to reviewing a lesson or exercises together then you can hold each other accountable for completing your goals on time.

You can also learn a lot from your Python buddy. You can go to your Python buddy with your questions, and when your buddy comes to you with a question you can practice explaining what you are learning. Teaching is one of the most effective ways to learn.

Finally, you can learn more by seeing each other's solutions and comparing your approaches. Does one method work better than another? Are there relative advantages or disadvantages? This will also help you practice reading and understanding other people's code.


Part 2: Computer Jargon


One of the more useful skills for independently studying how to code is the ability to ask the right question. In order to phrase your question precisely it is helpful to understand some computer jargon. This is also helpful in understanding the answers you get. For example, if you ask, "How can I get PyCharm to use Anaconda?" you may get an answer like, "You need to add the path to your Anaconda Python executable to PyCharm's list of interpreters." This answer may be hard to understand if you don't know what a path, an executable, or an interpreter are.

Here is a basic overview of terms and concepts that are helpful to know for anyone who wants to use a computer to do work. As you learn more Python you will learn many more Python specific terms, as well as general names for common tasks in programming. Breaking in to computer jargon can be daunting at first - you may find yourself reading a wiki page to define a term, only to end up with three new terms or concepts you don't understand. For example, researching an "unhashable type" error may lead you to researching what a hash table is, as well as what a mutable object is. Then trying to understand mutability may lead you to trying to understand what a "variable" is in Python, what it means to "assign" a variable, and what it means for Python to be a "pass by reference" language. You may find yourself in many such rabbit holes that seem dark and confusing, but know that this is an important part of the learning processes. Computer science can be surprisingly holistic, and as you understand one concept a little better it will help you to understand another more than you did before and so forth.

Software

Programming Language – A language designed to communicate instructions to a computer.
  • cannot communicate your intent

Low Level Language – A language with a low level of abstraction from machine code.
  • communicates more directly with the computer hardware.
  • often specific to a computer's architecture
  • e.g. assembly language and machine code

High Level Language – A language that is highly abstracted from the machine code.
  • more similar to natural language
  • often portable
  • e.g. Python, JavaScript, R

Program – A set of instructions (for a machine).

Script – A program written to automate a tasks that could be done manually by a human.

Application – A program to do a set of coordinated tasks not related to essential system functions.

Operating System – A program that interfaces between system hardware and other software.
  • often bundled with a basic set of software to help users control the computer

Interfaces

Shell – an interface between the operating system and a human
  • wraps the operating system like a shell

Interpreter – a program that translates high level code into low level code and executes it.
  • instructions executed not always in same order they are interpreted
  • follows “order of operations” depending on language

Command Line/Language Interpreter/Interface (CLI) – An interpreter where commands are entered by the user on the command line.

Graphical User Interface (GUI) – A program where users interact with graphical elements, rather than through a computer language.

BASH – Bourne Again SHell. The default command line shell in OSX and most Linux/Unix.
  • is both a language, and a CLI type shell

Software for Coding

Text Editor – Software for editing plain text files (like scripts and applications).
  • does not format text
  • often assists in writing code

Integrated Development Environment (IDE) – An application that has comprehensive tools to help you write code.
  • often includes a text editor, interpreter, file browser, and environment visualizer
  • often specific to a single language (e.g. IDLE and PyCharm for Python, Eclipse for Java, RStudio for R)

Hardware

Central Processing Unit (CPU) – the part of a computer that carries out instructions
  • can only follow instructions from one program at a a time,
  • but can switch between programs very fast,
  • and almost all modern computers have multiple CPUs or “cores”

Hard Disk Drive (HDD) – long term data storage device
  • data stored on magnetic plates
  • very slow to read and write

Random Access Memory (RAM) – short term data storage device
  • data stored in capacitors – lost when powered off
  • very fast to read and write
  • “between” HDD and CPU

File Systems

File – a piece of data stored on the hard disk
  • durable – stays around after a program closes

Loading – reading data from a file on the hard drive into memory

Garbage Collection – freeing up the bits in RAM to store new data

Saving – writing data from the memory onto a hard drive

Directory – a catalog of files or other directories
  • a piece of a hierarchical data structure
  • references it's parent and it's child files and directories

Folder – a metaphor to describe how directories work

Path – the human readable address of a file or directory

PATH – an environment variable that tells the OS which directories to look in for programs

Installing – Putting a program file into a PATH directory, or changing your PATH to contain a new directory with the program.


Part 3: Setting up a Python environment


We are going to install three pieces of software that are integral to using Python. First, we will install Python itself, which includes the Python interpreter and a host of useful packages. We will also install a text editor that you can use to write scripts, and an IDE that some people may prefer to a basic text editor.

Installing Python

There are several different versions of Python, with the main versions being Python 2 and Python 3. Python 3 is newer, and arguably better, but within the scientific community Python 2 is still widely used because there are more packages written in Python 2. There are also a bunch of different Python distributions that package a version of Python with a suite of useful modules and tools. We are going to install the Anaconda Python distribution because it comes with many of the most commonly used modules for scientific and data analysis. You can choose whether you want to install the Python 3.5 version of Anaconda or the Python 2.7 version.

  • Navigate to https://www.continuum.io/downloads
  • Download the installer for your OS
    • You almost certainly want the 64-bit version, unless your computer is very old.
  • Run the installer
    • For Windows and MacOS, just double click it and go through the menus.
    • For Linux, navigate to the install script in a terminal and try sudo bash Anaconda2-4.2.0-Linux-x86_64.sh

Installing a Text Editor

There are a lot of text editors to choose from, and people can have very strong opinions about which one is "the best". Ultimately, though, it comes down to preference and your individual needs. To start with, just choose one and learn how to use it. I recommend finding or putting together a reference sheet of keyboard shortcuts, printing it out, and keeping it beside you for your first few months of coding. Learning to use keyboard shortcuts effectively will greatly improve your efficiency.

Atom
  • Atom is free as in "free beer" and free as in "free speech"
  • Atom works on OS X, Linux, and Windows

Emacs
  • Emacs is free as in "free beer" and free as in "free speech"
  • Aquamacs is the Mac version of Emacs
  • Emacs works on Linux and Windows, but is mostly used on Linux
    • to install in Ubuntu type sudo apt install emacs
  • Emacs is very powerful, but requires learning a new set of keybindings from what you're probably used to

BBEdit
  • BBEdit is a popular proprietary text editor for Mac OS
  • BBEdit has a free (beer) version, but is not free (speech)

Notepad++
  • Notepad++ is free as in "free beer" and free as in "free speech"
  • It is the most popular text editor for Windows

Unless you're on Linux and want to install Emacs, go ahead and download the installer for one or more text editors and run it. If you are on Linux and want to install Emacs, just use apt or yum from your console.

Installing an IDE

An IDE is an application that's designed to include all the tools you need to write software, usually for a single language. An IDE often bundles a text editor to write your code, a file browser to organize your projects, and an interpreter to run little bits of code or your program as you're testing it, and an integrated help system or manual. Some IDEs also have other tools or displays. For example, RStudio includes a pane just for graphs and figures that you produce and a pane showing all the data you're currently working with.

You don't need an IDE, but a lot of people find them helpful. Personally, I use Emacs as my text editor, and just keep a terminal window open next to Emacs and run my scripts from the terminal. I also usually have a file browser open with tabs to my directory of scripts and to the directory I'm working in, and a web browser open where I am constantly searching Python manuals and Google.

We will install PyCharm, which is not free (speech), but does work on Linux, OS X, and Windows.
  • Visit the webpage, click "Download Now", select your OS, then click "Download"
  • For Windows and OS X, run the installer ('.exe' or '.dmg', respectively)
  • For Linux, move the '.tar.gz' archive to /opt, then extract it with tar -xvzf pycharm-community-2016.2.3.tar.gz
    • you can remove the leftover archive once you've extracted it
    • you may need to use 'sudo' to move the archive into /opt and to extract it
  • Run PyCharm
    • For Linux, run pycharm_directory/bin/pycharm.sh
  • Set up PyCharm's configuration
    • For Linux and OS X you can install a command line script so that you can run PyCharm from the command line
    • For all operating systems you can choose to create a desktop launcher
  • Make sure you are using Anaconda's Python version by following the instructions here: https://docs.continuum.io/anaconda/ide_integration#pycharm


Part 4: What can Python do for you?


First, there are general advantages of being able to write a script to automate computer tasks. Then we will discuss the advantages of Python as well as some other languages you might find it worthwhile to learn, or at least know about.

Advantages of Automation

1. Efficiency

The primary advantage of automation is, of course, that you can use computers to do work for you. Modern computers can do an incredible amount of computation very fast, but you need to know how to give very precise instructions to the computer in order to leverage that ability to its fullest extent. Any time you need to perform the same task on many things, or perform a series of related tasks, or perform some task on many combinations of a few things, the problem is well suited to automation.

That said, it is not always best to approach every problem by finding a way to automate it. Instead, first judge how long a task will take you to do manually, then think about how long it will take you to write a script to automate it. At first, when you're still new to coding, even simple scripts will take significant effort and troubleshooting. As you develop your skills, though, you will find that writing a short script is often faster than typing out the same command even a dozen times.

It is also not always wise to spend too much time writing the best, most efficient code possible. Consider how many times you will have to run the script. In science we often have to run an analysis just once or several times. In these cases it often saves time to write a quick and dirty script, even if you have to let it run overnight, than to spend all day figuring out a highly efficient algorithm.

2. Less Tedium

Computers are especially good at performing tedious, repetitive tasks, and with increased efficiency you will have to spend less time doing them! From printing out the name of every gene expressed over a certain level, to BLASTING those genes against the NCBI database, to sorting and counting the resulting hits, scripting saves you a huge amount of tedious labor. Nobody wants to type, or even copy and paste, hundreds of BLAST queries.

Even when writing a script doesn't actually save you much time it can be way more interesting than performing the task manually. Coding is an engaging, creative, and challenging process. I find that I can often turn a dull task that I am avoiding into an interesting problem by writing a script to do it for me.

3. Reproducability

Perhaps the greatest advantage to a scientist in automating analysis is that the analysis can be reproduced exactly. Your exact methods are laid out in your Python script, where you and others can scrutinize, repeat, and modify them.

Fully automating your analysis leaves you with a start to finish pipeline that anyone can use. Ideally, your raw data will be read in, and all your figures and numerical results will be output. If you use a pseudorandom generator, though, remember to set a random number seed so that your output will be exactly the same every time.

To make your code usable and your science reproducible, it is imperative to document your code with clear comments. This will also help you later when you come back and can't remember what a script does or why you wrote it that way.

4. Consistency

As well as other people being able to see your precise methods and recreate them, you will have greater certainty that you performed each analysis in exactly the same way every time.

As well as being incredibly mind-numbing, manually running bioinformatics tools is dangerous. What if you accidentally type 'Neurospora_crassa_CA_SNPs.vcf' instead of 'Neurospora_crassa_CO_SNPs.vcf', accidentally substituting your California population for your Colorado population? Or 'clean_reads.py expensive_dataset.fq > expensive_dataset.fq' instead of 'clean_reads.py expensive_dataset.fq > expensive_dataset.clean.fq'? There are thousands of ways you can accidentally screw up your analysis to either ruin your day or produce erroneous results.

Automation reduces the risk of stupid typos and other accidents. You won't forget to include mydata.part.14.bam in the analysis when you run results = [analyse(data) for data in mydata].

5. Parallelization

Modern computers, even budget laptops, now have multiple processors, which means you can run several or even hundreds of analyses at once (if you have access to a supercomputing cluster)! Starting and managing multiple processes simultaneously is often best done in an automated way. Python provides a number of tools to help you manage these processes and make the most out of parallel computing.

Why Python?

1. Scientific Packages

As a biologist, the primary reason to learn Python over other languages is because other biologists use it. Python is widely used for scientific and data analysis, and there are many tools already built that you can use without having to reinvent the wheel. These include the general science package SciPy, the plotting package MatPlotLib, and the biology package BioPython, as well as more specialized tools like the phylogenetic tree package ETE or NCBI's E-Utilities.

2. Approachable

Python's syntax is more similar to native language than many other language, which makes it much easier to read for new and veteran programmers alike. One of the things Python does differently from other popular languages is that it is sensitive to white space. In other languages spaces, tabs, and line breaks have no meaning or are not used. In Python code is organized by indentation, and changing your indentation signals the interpreter to treat blocks of code together. This frustrates some veteran programmers who are used to formatting their code in a particular fashion, but it also forces Python code to be organized into a bullet-point like hierarchy that makes it very easy to see the "flow" of the code.

It is also built on a philosophy of explicitness. This means that Python strives to avoid shorthand and implied values or actions. Veteran coders like shorthand that saves them typing, and sometimes feel overly clever by invoking a little known behavior to achieve a desired effect, but this kind of code is incomprehensible to others, especially newbie coders. This is kind of like minimizing the use of pronouns and in-jokes in ordinary language to increase clarity. Python also has an excellent system of exceptions, meaning that when something goes wrong Python is relatively good at telling you what went wrong and where the bug in your code is.

3. Powerful

Just because Python strives for simplicity doesn't mean it isn't also very powerful. Python is widely used by companies like Google, Dropbox, Netflix, and Spotify. It's wide usage means that there are a lot of people developing for Python. Python's structure is also highly modular, which makes it easy to extend functionality into a new area. As a result, Python has an extensive library of packages - code that other people have already written so that you don't have to.

One of the reasons for Python's wide adoption is that, like Java, it is highly portable. This means that Python programs written on one computer, maybe a Windows machine, can be run on any other computer, even if it's running Linux or OS X. This is why the Dropbox and Spotify desktop clients work on any computer, but the Box and Pandora clients don't.

Other Languages of Interest

BASH: The Bourne Again SHell scripting language is integral to many Unix based computers, including OS X. It is a very powerful shell, and is even widely used on Windows through Cygwin. Some level of knowledge of BASH or its relatives is essential for anyone who wants to do work with computers, and simple tasks that operate on batches of files and simple pipelines are very easy and fast to write with BASH.

R: R is a language written by statisticians and for statisticians. It is one of the most popular languages for data analysis, and continues to gain in popularity rapidly. The widespread development of packages for bioinformatics such as Bioconductor make it the most powerful statistical language for biologists. R can perform advanced analysis very fast, and can also output beautiful figures. However, R does not share Python's love of beautiful code, and it's matrix-centric design can make it extraordinarily frustrating to deal with data that can't be represented as a matrix. Python is much better at scripting, text manipulation, and flow and process controls, while R has powerful statistical tools built in.

Perl: Perl is sometimes considered a competitor to Python, and it also has a strong following of Biologists who have developed tools like BioPerl. Perl is renowned for it's text processing ability and use of regular expressions (a way to express a pattern of characters). However, Perl is also renowned for being difficult to read and unintelligible to those who don't already know Perl.

Java: Java is perhaps the most popular language today. Like Python, Java is highly portable. The ability of Java to run on any machine is key to it's success.

C & C++: C and C++ are mid-level programming languages that allow you to delve down and manipulate the hardware more directly. This makes them slower to write, and different system architectures means C and C++ needs to be compiled for the machine on which it is to be run. However, the fine level of control makes C the language of choice when highly efficient code is needed.


Part 5: Resources to help you learn Python


Classes at Berkeley

“Python Bootcamp” (Introduction to Programming for Bioinformatics) - 40 hours

Introduction to Bioinformatics - 120 hours
  • the bootcamp turned into a 6 week summer course
  • first 2 weeks will cover Python
  • last 4 weeks will cover statistics and data analysis
  • formal course that grants credits and grades

CS 61A (The Structure and Interpretation of Computer Programs) - 200+ hours
  • intense computer science course for majors
  • all course materials online: http://cs61a.org/
  • teaches deep understanding of programming
  • primer in Python recommended before attempting CS 61A
    • would be good follow-up to the bootcamp

Web Tutorials and Courses

There are many web tutorials and courses available – just search Google. Here are some that myself and colleagues have used that take different approaches. Go ahead and try a few and see which style suits you.

Note that some of these tutorials teach Python 3, and some Python 2, and some either. I wouldn't worry about it too much, the differences between Python 2 and 3 are fairly slight, and if you learn one you can easily learn the other. If you aren't sure if a tutorial is teaching you Python 2 or Python 3 look for this difference:

Python 2 uses a print keyword without parenthesis, like:
print “Hello, World!”

Python 3 uses a print function with parenthesis, like:
print(“Hello, World!”)

The Python Tutorial
  • the official tutorial for Python in the official Python documentation
  • very extensive
    • covers everything you could need
    • probably covers more than you need
  • written for computer scientists
    • doesn't explain jargon too much
    • may make your head spin at first

Codecademy
  • free hands-on lessons, quizes and projects are paid
  • web interpreter, so you can start immediately without needing to install Python
  • checks your work

HackerRank
  • minimal tutorial
  • challenge based
  • web interpreter
  • point and rank system gamifys learning

Rosalind
  • bioinformatics specific tutorial platform
  • learn by solving bioinformatics challenges
  • web interpreter

Learn Python the Hard Way
  • learn by copying
  • focuses on basic skills of reading and writing code
    • learning syntax
    • precise typing
    • spotting details in code
  • does not emphasize logic, design, or theory of programming

Coursera
  • many courses on Python from major universities
  • including some specific to bioinformatics

Google's Python Class
  • for people who have studied a little bit of programming or another language

Books

There are also a ton of books on Python. I recommend a book that gives you lots of exercises to practice with between short chapters or lessons. You won't learn Python by simply sitting down and reading a reference book from cover to cover

Here is a list of introductory books on Python: https://wiki.python.org/moin/IntroductoryBooks

And a list of books specific to science: https://wiki.python.org/moin/ScientificProgrammingBooks

Here are some books that take different approaches:

Learning Python
  • an extensive textbook on Python
  • huge: 1648 pages
  • fairly Windows focused
    • could still be used on Linux or OS X if understand this workshop
  • lighter on exercises, longer chapters

Learn to Program with Minecraft
  • a fun way to learn if you enjoy video games
  • creatively make your own challenges and projects

Automate the Boring Stuff with Python
  • a fun way to learn if you're into projects to increase efficiency
  • creatively make your own challenges and projects

Help & References

Here are some good places to go when you have questions or get stuck. Don't be afraid to use them, I usually have a web browser open to one of these sites as I code, and I consult them often.

Google (or another search engine)
  • this will usually point you to someone who has already answered your question or to the relevant documentation
  • remember to include “python” or the name of the relevant package in your search
    • e.g. “python search substring”
    • e.g. “matplotlib change axis labels”

The Python Documentation
  • all the reference documentation you need about basic Python
  • doesn't have packages like BioPython
  • doesn't have advice on strategies to solve common problems

Stack Overflow
  • a community question and answer platform
  • answers to almost any question you can imagine
  • don't simply copy-paste
    • make sure you understand why a solution works

Other Useful Tools

Python Visualizer
  • valuable tool to help you understand how Python works under the hood
    • shows how Python interpreter moves through code
    • shows how Python objects are referenced and changed
  • try pasting code here whenever you don't understand why something works the way it does
  • also paste your code here when you are stuck on an error or bug