Installing R

install.packages("ggplot2")
source("http://bioconductor.org/biocLite.R") # This allows you to install Bioconductor packages
biocLite("GenomicRanges")

What is R?

R is a free and powerful statistical programming language, and also a project, and also an environment. It’s an interpreted language, as opposed to a compiled language. That simply means you run code directly in the R program; there’s no need to compile the code you write. That makes it a bit like Perl or Python, but with different syntax. R syntax originally developed from S as a GNU project, and is probably most similar to the proprietary programming languages Stata and MATLAB.


What will you be introduced to in this R workshop?


Getting Started in R

The most useful function for a new user:

help.start()

Under “An Introduction to R”, there is a great “sample session” in Appendix A.

Note: if you just type help.start, you’ll see the function code, which will be pretty meaningless to you; the general syntax to execute a function in R is function()

The most important (RStudio) shortcuts:

  • the up arrow and the down arrow keys can be used to cycle through your previous entries
  • The tab key can autocomplete
  • cmd + the up arrow (control + the up arrow on Windows) provides suggestions from only what you have previously typed
  • cmd + return (control + enter on Windows) runs the code selected in the editor
  • esc aborts and erases the current line in the console (control + c from within R outside of RStudio)

What if I want to do something really specific, and I have no idea what to do and/or the code I wrote doesn’t work?

Example: I want to make a kernel density plot, and I have no idea what the function is in R

Data structures

R reads in data and operates on it as objects. This makes R an object-oriented language, which is one reason that it’s easier for beginners to learn. It also makes it easily extensible.

Let’s make an object using the <-, which is equivalent to the = in most languages for defining objects:

new_obj <- 1

What did we make?

new_obj

The naming of objects can be fairly arbitrary in R, and you can name them with underscores, dots and different cases:

new_obj <- 1
new.obj <- 1
newObj <- 1

R is case sensitive. Try to avoid naming objects the same as existing functions and objects in R.

To save memory and/or clean up the workspace, you can remove objects with the rm remove function:

rm(new_obj)
rm(new.obj)
rm(newObj)

The general scheme of R data structures

Dimension Homogeneous elements Heterogeneous elements
1 “Atomic” vector List
2 Matrix Data Frame
N Array

Use the combine function c to create “atomic” vectors, of which there a 4 most common types in R:

# R assumes here that you want a double vector
dbl_var <- c(1, 2.5, 4.5)

# With the L suffix, you can force an integer vector rather than a double
int_var <- c(1L, 6L, 10L)

# Use TRUE and FALSE (or T and F) to create logical vectors
log_var <- c(TRUE, FALSE, T, F)

# A character vector
char_var <- c("these are", "some strings")

Let’s see the built-in example data using the data sets function:

data()

Ok, let’s see what is in the object pressure

pressure
##    temperature pressure
## 1            0   0.0002
## 2           20   0.0012
## 3           40   0.0060
## 4           60   0.0300
## 5           80   0.0900
## 6          100   0.2700
## 7          120   0.7500
## 8          140   1.8500
## 9          160   4.2000
## 10         180   8.8000
## 11         200  17.3000
## 12         220  32.1000
## 13         240  57.0000
## 14         260  96.0000
## 15         280 157.0000
## 16         300 247.0000
## 17         320 376.0000
## 18         340 558.0000
## 19         360 806.0000

What kind of R data structure is this object?

We can interrogate an object using the structure display str function:

str(pressure)

How do we access or subset parts of an object?

pressure$temperature

We can subset an individual element from a vector:

pressure$temperature[5]

We can coerce one data type into another:

pressure.matrix <- as.matrix(pressure)

We can subset individual columns and rows:

pressure.matrix[1,]
pressure.matrix[,1]
pressure.matrix[4,2]

Reasonable and nested subsetting is one of the strengths of R:

z <- c(3.4, 2.2, 0.4, -0.4, 1.2)
z[3:5]
z[c(-4, -5)]
z[c(3, 2, 4, 5, 1)]
z[5:1]
rev(z)
order(z)
z[order(z)]
z[c(TRUE, TRUE, FALSE, TRUE, FALSE, FALSE)]
z>0
z[z>0]

Subsetting extends to a 2-dimensional matrix using the convention [Rows, Columns]:

y <- cbind(z, rev(z))
y[1:2, ]
y[-2, ]
y[order(z), ]
y>0
y[y>0]

Basic functions

First, we use a function rnorm to randomly generate a normally distributed sample of 10 data points:

# Make 10 data points with mean 3.4 and standard deviation of 1
a <- rnorm(10, mean=3.4, sd=1)
a

We can calculate the sample average:

mean(a)
median(a)

How about measures of dispersion?

# sample variance
var(a)

# sample standard deviation
sd(a)

# interquartile range
IQR(a)

# median absolute deviation
mad(a)

The quicker method for some basic stats is to use the summary function in R:

summary(a)

Finally, quitting the R program

q()

Each function has different input and output and arguments (parameters), which is all very confusing. How do I use a particular function?

Use the code ?<topic> to learn more about a particular function. For example, let’s learn about the plot generic X-Y plotting function using the help function, which can be used by ?:

?plot

and the par graphical parameters function:

?par

Plotting

The most basic X-Y plot using plot:

plot(0,0)

Ok, what does the pressure data look like?

# X-Y plot of the pressure data set
plot(pressure)

# Fit a locally weighted scatter plot smoothing  (LOWESS) by polynomial regression
fit <- lowess(pressure)

# Display the LOWESS fit
lines(fit)

# We can make the LOWESS less smooth by tweaking the argument f
fit.lesssmooth <- lowess(pressure, f=1/3)

# Display the LOWESS fit
lines(fit.lesssmooth, lty=2)

Let’s look at real car engine data instead:

plot(mtcars$disp, mtcars$mpg)

# We'll change some graphical arguments to make it look better
plot(mtcars$disp, mtcars$mpg, xlab="Engine displacement", ylab="Miles per gallon", main="MPG of engines")

# We can add a LOWESS fit
fit.engines <- lowess(mtcars$disp, mtcars$mpg)
lines(fit.engines)

File handling

Find out what directory you’re in and change directory:

getwd()
setwd("")
# Read in a table
tab <- read.table("temp-pressure.txt")

# Write out a table
write.table(t(tab), "temp-pressure_transposed.txt")

# Write out a table without the row and column names
write.table(t(tab), "temp-pressure_transposed.txt", col.names=FALSE, row.names=FALSE)

Let’s write a plot to a file:

png("EngineMPG.png",width=1000,height=700)
plot(mtcars$disp, mtcars$mpg, xlab="Engine displacement", ylab="Miles per gallon", main="MPG of engines")
fit.engines <- lowess(mtcars$disp, mtcars$mpg)
lines(fit.engines)
dev.off()

Packages

We’ll introduce now the ggplot2 package. A package is a set of functions, data and documentation authored to add functionality to R.

ggplot2 is developed by Hadley Wickham, Chief Scientist for RStudio. ggplot2 is a slightly more user-friendly plotting environment within R that lets you make accurate plots that incorporate features of the grammar of graphics.

If you’d lke to become a ggplot2 expert or see what it’s capable of, check out Wickham’s YouTube video A Backstage Tour of ggplot2

Let’s load the ggplot2 package we installed already:

library(ggplot2)

Now you can learn about the package, using the search version of help:

??ggplot2

The most general up-to-date documentation is on the ggplot2 website

# We can use a couple lines of code to break things down in a straightforward way
p <- ggplot(mpg, aes(class, hwy))
p + geom_boxplot()

# We can also change the presentation of the data in many informative ways
p + geom_boxplot() + geom_jitter(width = 0.2)

User-defined functions and loops

a <- rnorm(10, mean=3.4, sd=1)

# Bootstrap
for (i in 1:10) {
  print(sample(a, 10, replace=TRUE))
}

bootstrap.custom <- function(data, x) { 
for (i in 1:x) {
  print(sample(data , 10, replace=TRUE))
}
}

bootstrap.custom(a, 10)

Statistical tests

Next, we’ll randomly generate a normally distributed samples of 10 data points with a different mean:

# Make 10 data points with mean 4.8 and standard deviation of 1
b <- rnorm(10, mean = 4.8, sd = 1)
b

We can easily see their distributions in a boxplot:

boxplot(a, b, names = c("a", "b"))

Are they correlated using Pearson’s correlation coefficient?

cor(a, b, method = "pearson")

We can easily perform Student’s t-Test:

t.test(a, b, var.equal=TRUE)

Interacting with genomic data in R

Vince Buffalo’s “Bioinformatics Data Skills: REPRODUCIBLE AND ROBUST RESEARCH WITH OPEN SOURCE TOOLS” Chapters 8 through 10

Genomic ranges

library(IRanges)
x <- IRanges(start=c(40, 80), end=c(67, 114))
x + 4L
y <- IRanges(start=c(4, 6, 10, 12), width=13)
flank(x, width=7)
flank(x, width=7, start=FALSE)

RNA-seq

Demo using sleuth: let’s look at some QC and differential expression

This is adapted from the Pachter lab’s sleuthintro guide.

Install sleuth, including rhdf5 and biomaRt from Bioconductor:

source("http://bioconductor.org/biocLite.R")
biocLite("rhdf5")
biocLite("biomaRt")
install.packages("devtools")
devtools::install_github("pachterlab/sleuth")

Load in the example data after you download it from here (change the directory below to the correct one on your computer) and run sleuth, including the interactive Shiny app:

library(sleuth)
setwd("~/Downloads/cuffdiff2_data_kallisto_results")
sample_id <- dir(file.path("results"))
kal_dirs <- sapply(sample_id, function(id) file.path("results", id, "kallisto"))
s2c <- read.table(file.path("hiseq_info.txt"), header = TRUE, stringsAsFactors = FALSE)
s2c <- dplyr::select(s2c, sample = run_accession, condition)
s2c <- dplyr::mutate(s2c, path = kal_dirs)
mart <- biomaRt::useMart(biomart = "ENSEMBL_MART_ENSEMBL",
  dataset = "hsapiens_gene_ensembl",
  host = 'ensembl.org')
t2g <- biomaRt::getBM(attributes = c("ensembl_transcript_id", "ensembl_gene_id",
    "external_gene_name"), mart = mart)
t2g <- dplyr::rename(t2g, target_id = ensembl_transcript_id,
  ens_gene = ensembl_gene_id, ext_gene = external_gene_name)
so <- sleuth_prep(s2c, ~ condition, target_mapping = t2g)

## Can instead aggregate on the gene level, rather than transcript level:
# so <- sleuth_prep(s2c, ~condition, target_mapping = t2g,
#  aggregation_column = 'ens_gene')

so <- sleuth_fit(so)
so <- sleuth_wt(so, 'conditionscramble')

## Or can alternatively do a likelihood ratio test:
# so <- sleuth_fit(so, ~1, 'reduced')
# so <- sleuth_lrt(so, 'reduced', 'full')
# results_table <- sleuth_results(so, 'reduced:full', test_type = 'lrt')

results_table <- sleuth_results(so, test='conditionscramble', test_type = 'wald')
results_table[grep("HOXA1",results_table[,13]),]

sleuth_live(so)

An exceptionally good guide and cookbook for using DESeq2 (and tximport) for differential expression:

https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.pdf

e.g., accurate quantification of transcripts can be obtained using Salmon and then loaded into R with tximport to analyze DE

Also, the CGRL has a full workshop on differential expression using DESeq2

Also, there is a great new pipeline for more RNA-seq mapping, including transcript annotation & quantification, with its own R package ballgown

Pertea et al. “Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown” Nature Protocols 11, 1650–1667 (2016) doi:10.1038/nprot.2016.095


Genomic localization

Demo using seqplots: let’s look at some chromatin data with average metaplots, clustering and heatmaps

Install seqplots from Bioconductor and start an interactive session:

source("http://bioconductor.org/biocLite.R") # This allows you to install Bioconductor packages
biocLite("seqplots")
library(seqplots)
run()

Alternatively, try out the online Shiny seqplots demo: https://seqplots.shinyapps.io/seqplots/

Or install the SeqPlots app bundle on your computer: http://przemol.github.io/seqplots/#installation---app-for-mac-win-and-linux

More information and nice guide on seqplots: http://przemol.github.io/seqplots/


Advanced use: integration into pipelines

The main thing that is useful is the ability to run your R code non-interactively by not invoking the R window.

In Mac OS X terminal or Linux command line:

R CMD BATCH [options] my_script.R [outfile]

Windows command:

“C:\Program Files\R\R-3.3.3\bin\R.exe” CMD BATCH --vanilla --slave “C:\my projects\my_script.R”


Advanced use: publishing results

Let’s write a postscript:

postscript("EngineMPG.ps",width=7,height=5)
plot(mtcars$disp, mtcars$mpg, xlab="Engine displacement", ylab="Miles per gallon", main="MPG of engines")
fit.engines <- lowess(mtcars$disp, mtcars$mpg)
lines(fit.engines)
dev.off()

Advanced use: reproducible research

This whole file is an R Markdown document.

Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents and can be used to author documents on GitHub. For more details on using R Markdown see http://rmarkdown.rstudio.com.

The R Markdown cheatsheet is really useful for getting up and running.


What did we cover in this workshop?


How do I keep learning about R?

Fantastic cheatsheets from the RStudio team

https://www.rstudio.com/resources/cheatsheets/

The Manual: An Introduction to R by W. Venables, D. Smith and the R Core Team

In Appendix A is a quick sample session to get started in R, some of which is adapted in this workshop. Can also be opened in R:

help.start()

Also, Chapter 12 has a lot of good info about graphics.

If you want a super-dense companion, it can be very useful when you become more familiar with R or if you’re familiar with other languages. Some of the companion material was adapted in this workshop.

The 3-/4-day SCF D-Lab R bootcamp at Berkeley typically happens each summer. Also, they have shorter workshops on specialized topics in R and other languages/environments.

The Department of Statistics has online Tutorials on advanced R topics.

For expert advice on clean and efficient R coding and the most advanced topics, like complex stats, machine learning, integrating different languages and parallelization, Chris Paciorek is a great point of contact on campus (paciorek at stat.berkeley.edu).