R is constantly updated and you should download a recent version; the version when this workshop was written was 3.3.2 “Sincere Pumpkin Patch”
I also highly recommend the free open source edition of RStudio integrated development environment
Finally, for this workshop install the ggplot2 and GenomicRanges packages using the following code:
install.packages("ggplot2")
source("http://bioconductor.org/biocLite.R") # This allows you to install Bioconductor packages
biocLite("GenomicRanges")
R is a free and powerful statistical programming language, and also a project, and also an environment. It’s an interpreted language, as opposed to a compiled language. That simply means you run code directly in the R program; there’s no need to compile the code you write. That makes it a bit like Perl or Python, but with different syntax. R syntax originally developed from S as a GNU project, and is probably most similar to the proprietary programming languages Stata and MATLAB.
The most useful function for a new user:
help.start()
Under “An Introduction to R”, there is a great “sample session” in Appendix A.
help.start, you’ll see the function code, which will be pretty meaningless to you; the general syntax to execute a function in R is function()Example: I want to make a kernel density plot, and I have no idea what the function is in R
R reads in data and operates on it as objects. This makes R an object-oriented language, which is one reason that it’s easier for beginners to learn. It also makes it easily extensible.
Let’s make an object using the <-, which is equivalent to the = in most languages for defining objects:
new_obj <- 1
What did we make?
new_obj
The naming of objects can be fairly arbitrary in R, and you can name them with underscores, dots and different cases:
new_obj <- 1
new.obj <- 1
newObj <- 1
To save memory and/or clean up the workspace, you can remove objects with the rm remove function:
rm(new_obj)
rm(new.obj)
rm(newObj)
| Dimension | Homogeneous elements | Heterogeneous elements |
|---|---|---|
| 1 | “Atomic” vector | List |
| 2 | Matrix | Data Frame |
| N | Array |
Use the combine function c to create “atomic” vectors, of which there a 4 most common types in R:
# R assumes here that you want a double vector
dbl_var <- c(1, 2.5, 4.5)
# With the L suffix, you can force an integer vector rather than a double
int_var <- c(1L, 6L, 10L)
# Use TRUE and FALSE (or T and F) to create logical vectors
log_var <- c(TRUE, FALSE, T, F)
# A character vector
char_var <- c("these are", "some strings")
Let’s see the built-in example data using the data sets function:
data()
Ok, let’s see what is in the object pressure
pressure
## temperature pressure
## 1 0 0.0002
## 2 20 0.0012
## 3 40 0.0060
## 4 60 0.0300
## 5 80 0.0900
## 6 100 0.2700
## 7 120 0.7500
## 8 140 1.8500
## 9 160 4.2000
## 10 180 8.8000
## 11 200 17.3000
## 12 220 32.1000
## 13 240 57.0000
## 14 260 96.0000
## 15 280 157.0000
## 16 300 247.0000
## 17 320 376.0000
## 18 340 558.0000
## 19 360 806.0000
What kind of R data structure is this object?
We can interrogate an object using the structure display str function:
str(pressure)
How do we access or subset parts of an object?
pressure$temperature
We can subset an individual element from a vector:
pressure$temperature[5]
We can coerce one data type into another:
pressure.matrix <- as.matrix(pressure)
We can subset individual columns and rows:
pressure.matrix[1,]
pressure.matrix[,1]
pressure.matrix[4,2]
Reasonable and nested subsetting is one of the strengths of R:
z <- c(3.4, 2.2, 0.4, -0.4, 1.2)
z[3:5]
z[c(-4, -5)]
z[c(3, 2, 4, 5, 1)]
z[5:1]
rev(z)
order(z)
z[order(z)]
z[c(TRUE, TRUE, FALSE, TRUE, FALSE, FALSE)]
z>0
z[z>0]
Subsetting extends to a 2-dimensional matrix using the convention [Rows, Columns]:
y <- cbind(z, rev(z))
y[1:2, ]
y[-2, ]
y[order(z), ]
y>0
y[y>0]
First, we use a function rnorm to randomly generate a normally distributed sample of 10 data points:
# Make 10 data points with mean 3.4 and standard deviation of 1
a <- rnorm(10, mean=3.4, sd=1)
a
We can calculate the sample average:
mean(a)
median(a)
How about measures of dispersion?
# sample variance
var(a)
# sample standard deviation
sd(a)
# interquartile range
IQR(a)
# median absolute deviation
mad(a)
The quicker method for some basic stats is to use the summary function in R:
summary(a)
Finally, quitting the R program
q()
Use the code ?<topic> to learn more about a particular function. For example, let’s learn about the plot generic X-Y plotting function using the help function, which can be used by ?:
?plot
and the par graphical parameters function:
?par
The most basic X-Y plot using plot:
plot(0,0)
Ok, what does the pressure data look like?
# X-Y plot of the pressure data set
plot(pressure)
# Fit a locally weighted scatter plot smoothing (LOWESS) by polynomial regression
fit <- lowess(pressure)
# Display the LOWESS fit
lines(fit)
# We can make the LOWESS less smooth by tweaking the argument f
fit.lesssmooth <- lowess(pressure, f=1/3)
# Display the LOWESS fit
lines(fit.lesssmooth, lty=2)
Let’s look at real car engine data instead:
plot(mtcars$disp, mtcars$mpg)
# We'll change some graphical arguments to make it look better
plot(mtcars$disp, mtcars$mpg, xlab="Engine displacement", ylab="Miles per gallon", main="MPG of engines")
# We can add a LOWESS fit
fit.engines <- lowess(mtcars$disp, mtcars$mpg)
lines(fit.engines)
Find out what directory you’re in and change directory:
getwd()
setwd("")
# Read in a table
tab <- read.table("temp-pressure.txt")
# Write out a table
write.table(t(tab), "temp-pressure_transposed.txt")
# Write out a table without the row and column names
write.table(t(tab), "temp-pressure_transposed.txt", col.names=FALSE, row.names=FALSE)
Let’s write a plot to a file:
png("EngineMPG.png",width=1000,height=700)
plot(mtcars$disp, mtcars$mpg, xlab="Engine displacement", ylab="Miles per gallon", main="MPG of engines")
fit.engines <- lowess(mtcars$disp, mtcars$mpg)
lines(fit.engines)
dev.off()
We’ll introduce now the ggplot2 package. A package is a set of functions, data and documentation authored to add functionality to R.
ggplot2 is developed by Hadley Wickham, Chief Scientist for RStudio. ggplot2 is a slightly more user-friendly plotting environment within R that lets you make accurate plots that incorporate features of the grammar of graphics.
If you’d lke to become a ggplot2 expert or see what it’s capable of, check out Wickham’s YouTube video A Backstage Tour of ggplot2
Let’s load the ggplot2 package we installed already:
library(ggplot2)
Now you can learn about the package, using the search version of help:
??ggplot2
# We can use a couple lines of code to break things down in a straightforward way
p <- ggplot(mpg, aes(class, hwy))
p + geom_boxplot()
# We can also change the presentation of the data in many informative ways
p + geom_boxplot() + geom_jitter(width = 0.2)
a <- rnorm(10, mean=3.4, sd=1)
# Bootstrap
for (i in 1:10) {
print(sample(a, 10, replace=TRUE))
}
bootstrap.custom <- function(data, x) {
for (i in 1:x) {
print(sample(data , 10, replace=TRUE))
}
}
bootstrap.custom(a, 10)
Next, we’ll randomly generate a normally distributed samples of 10 data points with a different mean:
# Make 10 data points with mean 4.8 and standard deviation of 1
b <- rnorm(10, mean = 4.8, sd = 1)
b
We can easily see their distributions in a boxplot:
boxplot(a, b, names = c("a", "b"))
Are they correlated using Pearson’s correlation coefficient?
cor(a, b, method = "pearson")
We can easily perform Student’s t-Test:
t.test(a, b, var.equal=TRUE)
library(IRanges)
x <- IRanges(start=c(40, 80), end=c(67, 114))
x + 4L
y <- IRanges(start=c(4, 6, 10, 12), width=13)
flank(x, width=7)
flank(x, width=7, start=FALSE)
sleuth: let’s look at some QC and differential expressionThis is adapted from the Pachter lab’s sleuthintro guide.
Install sleuth, including rhdf5 and biomaRt from Bioconductor:
source("http://bioconductor.org/biocLite.R")
biocLite("rhdf5")
biocLite("biomaRt")
install.packages("devtools")
devtools::install_github("pachterlab/sleuth")
Load in the example data after you download it from here (change the directory below to the correct one on your computer) and run sleuth, including the interactive Shiny app:
library(sleuth)
setwd("~/Downloads/cuffdiff2_data_kallisto_results")
sample_id <- dir(file.path("results"))
kal_dirs <- sapply(sample_id, function(id) file.path("results", id, "kallisto"))
s2c <- read.table(file.path("hiseq_info.txt"), header = TRUE, stringsAsFactors = FALSE)
s2c <- dplyr::select(s2c, sample = run_accession, condition)
s2c <- dplyr::mutate(s2c, path = kal_dirs)
mart <- biomaRt::useMart(biomart = "ENSEMBL_MART_ENSEMBL",
dataset = "hsapiens_gene_ensembl",
host = 'ensembl.org')
t2g <- biomaRt::getBM(attributes = c("ensembl_transcript_id", "ensembl_gene_id",
"external_gene_name"), mart = mart)
t2g <- dplyr::rename(t2g, target_id = ensembl_transcript_id,
ens_gene = ensembl_gene_id, ext_gene = external_gene_name)
so <- sleuth_prep(s2c, ~ condition, target_mapping = t2g)
## Can instead aggregate on the gene level, rather than transcript level:
# so <- sleuth_prep(s2c, ~condition, target_mapping = t2g,
# aggregation_column = 'ens_gene')
so <- sleuth_fit(so)
so <- sleuth_wt(so, 'conditionscramble')
## Or can alternatively do a likelihood ratio test:
# so <- sleuth_fit(so, ~1, 'reduced')
# so <- sleuth_lrt(so, 'reduced', 'full')
# results_table <- sleuth_results(so, 'reduced:full', test_type = 'lrt')
results_table <- sleuth_results(so, test='conditionscramble', test_type = 'wald')
results_table[grep("HOXA1",results_table[,13]),]
sleuth_live(so)
DESeq2 (and tximport) for differential expression:https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.pdf
e.g., accurate quantification of transcripts can be obtained using Salmon and then loaded into R with tximport to analyze DE
DESeq2ballgownseqplots: let’s look at some chromatin data with average metaplots, clustering and heatmapsInstall seqplots from Bioconductor and start an interactive session:
source("http://bioconductor.org/biocLite.R") # This allows you to install Bioconductor packages
biocLite("seqplots")
library(seqplots)
run()
Alternatively, try out the online Shiny seqplots demo: https://seqplots.shinyapps.io/seqplots/
Or install the SeqPlots app bundle on your computer: http://przemol.github.io/seqplots/#installation---app-for-mac-win-and-linux
More information and nice guide on seqplots: http://przemol.github.io/seqplots/
The main thing that is useful is the ability to run your R code non-interactively by not invoking the R window.
In Mac OS X terminal or Linux command line:
R CMD BATCH [options] my_script.R [outfile]
Windows command:
“C:\Program Files\R\R-3.2.4\bin\R.exe” CMD BATCH --vanilla --slave “C:\my projects\my_script.R”
Let’s write a postscript:
postscript("EngineMPG.ps",width=7,height=5)
plot(mtcars$disp, mtcars$mpg, xlab="Engine displacement", ylab="Miles per gallon", main="MPG of engines")
fit.engines <- lowess(mtcars$disp, mtcars$mpg)
lines(fit.engines)
dev.off()
This whole file is an R Markdown document.
Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents and can be used to author documents on GitHub. For more details on using R Markdown see http://rmarkdown.rstudio.com.
The R Markdown cheatsheet is really useful for getting up and running.
sleuth (after kallisto) for RNA-seq data explorationseqplots for genomic average metaplots, clustering and heatmapsIn Appendix A is a quick sample session to get started in R, some of which is adapted in this workshop. Can also be opened in R:
help.start()
Also, Chapter 12 has a lot of good info about graphics.