Recent Changes

Wednesday, May 10

  1. page CGRL user guide workshop edited ... What compute nodes are available? How do I run my jobs? Example interactive Slurm session …
    ...
    What compute nodes are available?
    How do I run my jobs?
    Example interactive Slurm sessionGetting data into the BRC HPC environment
    # see what Slurm jobs are running on CGRL's Vector cluster:
    squeue -p vector
    log in to the DTN
    ssh username@dtn.brc.berkeley.edu
    # log in to the Genomic Sequencing Laboratory's FTP server
    lftp ftp://gslftp@gslserver.qb3.berkeley.edu

    Example interactive Slurm session
    # see what Slurm jobs are running on CGRL's Vector cluster
    ...
    # add something like the following: export R_LIBS_USER="global/home/users/$USER/R"
    source ~/.bashrc
    module load r/3.2.5
    R

    (view changes)
    1:29 pm
  2. page CGRL user guide workshop edited ... # see what Slurm jobs are running on CGRL's Vector cluster: squeue -p vector # Example int…
    ...
    # see what Slurm jobs are running on CGRL's Vector cluster:
    squeue -p vector
    #Example interactive Slurm session
    # see what Slurm jobs are running on CGRL's Vector cluster
    squeue -p vector
    #
    see all
    ...
    at the moment:moment
    squeue -u $USER
    ...
    CGRL's Rosalind condo:condo
    srun --pty --partition=savio2_htc --account=co_rosalind --qos=rosalind_htc2_normal --time=00:30:00 bash -i
    ...
    you're currently running:running
    echo $SLURM_JOB_ID
    scontrol show job $SLURM_JOB_ID
    ...
    # exit your bash session
    exit
    into the [R] command line. This will bring up a text page with detailed information about the function
    Getting Help with [R]
    Google your question! Write a short description and add on at the end “R-help”. This will most likely bring up a thread in the R-help forum that deals with the same or similar problem. Almost every problem you could have can be found answered on this forum. You can also subscribe to the [R] mailing list by going to https://stat.ethz.ch/mailman/listinfo/r-help. A guide for posting questions can be found at http://www.r-project.org/posting-guide.html
    Elizabeth Purdom has several [R] tutorials that can be found at http://www.stat.berkeley.edu/~epurdom/. In particular, see http://www.stat.berkeley.edu/~epurdom/Saving/Saving.html for using directories and saving objects and www.stat.berkeley.edu/~epurdom/BrownLab/ for creating and manipulating data sets in [R].
    Installing edgeR
    With an internet connection, edgeR can be easily installed by running the following commands in [R]
    Example
    Example Slurm batch:batch job: RNA-Seq quantification with kallisto
    # download some example data to your Savio (Rosalind condo) scratch folder

    cd /global/scratch/$USER
    wgetcurl -L https://www.dropbox.com/sh/0abnf67z8m9iv02/AADbq28QEqBXmfPFe7jVvbiLa?dl=1 > download.zip
    unzip download.zip
    # editing batch a script
    vim kallisto_for_workshop.sh
    # running the Slurm batch job with 4 CPUs on a Savio2 HTC nodes in CGRL's Rosalind condo

    sbatch kallisto_for_workshop.sh
    ...
    40 test_R1.fastq test_2.fastqtest_R2.fastq
    # check output

    head test/abundance.tsv
    Customizing, locally installing
    # setting a local library directory for installing R packages

    cd ~
    vim .bashrc
    ...
    like the following to specify local package library location for R:following: export R_LIBS_USER="global/home/users/$USER"R_LIBS_USER="global/home/users/$USER/R"
    source ~/.bashrc
    Usually, a RNA-Seq data analysis starts with a set of FASTQ files which contain information on both the sequence and the quality of the short reads.
    There are several tools to align the reads to the reference genome (e.g. Bowtie, TopHat, ...).
    You have just seen how to align the reads to a reference genome using Tophat. Assume we already performed this step for all the replicates and we produced a table of read counts for each gene.
    We will consider an example based on the data analyzed in [1]. The Sherlock Lab in the Department of Genetics at Stanford sequenced 14 strains of Saccharomyces Cerevisiae grown in three media, namely YPD, Delft and Glycerol each with 3-8 biological replicates.
    The samples were sequenced using Illumina’s Genome Analyzer II yielding 36 bp-long single-end reads. Reads were mapped to the reference genome (SGD release 64) using Bowtie, considering only unique mapping and allowing up to two mismatches.
    The read count for a given gene is defined as the number of reads with 5’-end falling within the corresponding region. The gene-level counts for this example are provided in the yeastRNASeqRisso2011 R package.
    In addition to the read counts, the R package contains some sample and gene information, such as GC-content and gene length.
    To install the package you need to download the file from this link then in an R session type (on a Mac):
    install.packages("~/Downloads/yeastRNASeqRisso2011_0.0.2.tar.gz",repos=NULL,type="source")
    Windows users should simply replace "~/Downloads/" with the appropriate path to their Downloads directory.
    If you can't install the package, download the data from here:
    {geneLevelCounts.rda}
    Exploratory Analysis
    Setting up the count matrix
    First, we load the data into [R] and make sure we can access the read counts
    library(yeastRNASeqRisso2011)
    data(geneLevelCounts)
    data(laneInfo)
    data(geneInfo)
    head(geneLevelCounts)
    head(geneInfo)
    laneInfo
    We want to filter out the non-expressed genes. For simplicity, we consider only the genes with an average read count of 10 or more.
    means <- rowMeans(geneLevelCounts)
    filter <- means >= 10
    table(filter)
    geneLevelCounts <- geneLevelCounts[filter,]
    dim(geneLevelCounts)
    Looking at the data
    One of the main characteristics of RNA-Seq data is that each library will generally have a different sequencing depth.
    We can visualize the total number of mapped reads to known genes with the barplot function. To check for systematic effects we can color-code the plot by different biological or technical variables
    colors <- c(rep("yellow3",8),rep("blue",3),rep("green",3))
    totCounts <- colSums(geneLevelCounts)
    barplot(totCounts, las=2, col=colors)
    barplot(totCounts, las=2, col=c("red","blue")[laneInfo[,4]])
    The boxplot function provides an easy way to visualize the difference in distribution between each experiment.
    boxplot(geneLevelCounts, las=2, col=colors)
    boxplot(log2(geneLevelCounts+1), las=2, col=colors)
    Normalization
    Building the edgeR object
    DGEList is the function that coverts the count matrix into an edgeR object.
    In addition to the counts, we need to group the samples according to the variable of interest in our experiment.
    We can then see the elements that the object contains by using the names function
    library(edgeR)
    group <- laneInfo[,2]
    group <- droplevels(group)
    counts <- geneLevelCounts
    cds <- DGEList( counts , group = group )
    names(cds)
    These elements can be accessed using the $ symbol. We give some examples here of looking at what is saved in this object.
    head(cds$counts) # original count matrix
    cds$samples # contains a summary of your samples
    sum(cds$all.zeros) # How many genes have 0 counts across all samples
    Normalization in edgeR
    We can calculate the normalization factors which correct for the different sequencing depth of each library (or library size).
    cds <- calcNormFactors(cds)
    cds$samples
    By default, the function calcNormFactors normalize the data using the "weighted trimmed mean of M-values" (TMM) method, proposed by [2]. Other options are RLE [3] and upper-quartile [4]. If we want to use the upper-quartile to normalize, we can add an extra argument to the function
    cds <- calcNormFactors(cds, method="upperquartile")
    cds$samples
    If we want the normalized pseudo-counts, useful for instance for cluster analysis, we can get them with the following commands.
    scale <- cds$samples$lib.size*cds$samples$norm.factors
    normCounts <- round(t(t(counts)/scale)*mean(scale))
    boxplot(log2(normCounts+1), las=2, col=colors)
    Multi-Dimensional Scaling Plot
    An MDS plot measures the similarity of the samples and projects this measure into 2-dimensions. This can be useful for quality control and visualization of your samples.
    plotMDS(cds, main="MDS Plot for Count Data", labels=colnames(cds$counts))
    Differential Expression Testing
    A two-class comparison
    As a first example, let's focus on the comparison between the three Delft and the three Glycerol yeast libraries.
    group <- laneInfo[9:14,2]
    group <- droplevels(group)
    counts <- geneLevelCounts[, 9:14]
    cds <- DGEList( counts , group = group )
    cds <- calcNormFactors(cds, method="upperquartile")
    Estimating Dispersion
    First, we need to estimate the common dispersion: this assumes that all the genes have the same dispersion.
    cds <- estimateCommonDisp(cds, verbose=TRUE)
    To understand what this value means, recall the parameterization for the variance of the negative binomial is ν(μ) = μ+μ2 ·φ. For poisson it’s ν(μ) = μ. The implied standard deviations are the square-roots of the variances. Now, suppose a gene had an average count of 200. Then the sd’s under the two models would be
    sqrt(200) # poisson sd
    sqrt(200 + 200^2 * cds$common.dispersion) # negative binomial
    sqrt(200 + 200^2 * cds$common.dispersion) / sqrt(200) # ratio: NB sd is almost 5 times larger
    In real experiments, the assumption of common dispersion is almost never met. Often, we observe a relation between mean counts and dispersion, i.e., the more expressed genes have less dispersion.
    The way edgeR estimates a tagwise (i.e. gene-wise) dispersion parameter is by "shrinking" the gene-wise dispersions toward a common value (the common dispersion estimated in the previous step). Alternatively, one can shrink the gene-wise estimates to a common trend, by estimating a smooth function prior to the shrinkage (using the estimateTrendedDisp function). Here we keep things simple and shrink the estimates to the common value
    cds <- estimateTagwiseDisp(cds)
    We can plot the biological coefficient of variation (i.e. the square root of the dispersions) with the following
    plotBCV(cds)
    Another useful plot is the mean-variance relation.
    meanVarPlot <- plotMeanVar(cds, show.raw.vars=TRUE, show.tagwise.vars=TRUE, show.binned.common.disp.vars=FALSE, show.ave.raw.vars=FALSE, NBline = TRUE , nbins = 100 , #these are arguments about what is plotted
    pch = 16 , xlab ="Mean Expression (Log10 Scale)" , ylab = "Variance (Log10 Scale)" , main = "Mean-Variance Plot" ) #these arguments are to make it look prettier
    Testing for DE
    The function exactTest performs pair-wise tests for differential expression between two groups. The important parameter is pair which indicates which two groups should be compared. The output of exactTest is a list of elements: we can get the table of the results with the topTags function.
    et <- exactTest(cds, pair=levels(group))
    topTags(et)
    top <- topTags(et, n=nrow(cds$counts))$table
    head(top)
    We can store the ID of the DE genes and look at the distribution of the p-values
    de <- rownames(top[top$FDR<0.05,])
    length(de)
    head(de)
    hist(top$PValue, breaks=20)
    Visualizing the results
    We can use the function plotSmear to produce a mean-difference plot
    plotSmear(cds , de.tags=de)
    abline(h=c(-2, 2), col="green")
    Another useful plot is the "volcano plot" that relates log-fold-changes and p-values
    plot(top$logFC, -log10(top$PValue), pch=20, cex=.5, ylab="-log10(p-value)", xlab="logFC", col=as.numeric(rownames(top) %in% de)+1)
    abline(v=c(-2, 2), col="green")
    Finally, we can plot a heatmap of the top genes and extract lists of up- and down-regulated genes.
    heatmap(log(normCounts[de[1:500],9:14]+1), ColSideColor=colors[9:14], col=rev(heat.colors(20)))
    upreg <- rownames(top[rownames(top) %in% de & top$logFC > 0,])
    downreg <- rownames(top[rownames(top) %in% de & top$logFC < 0,])
    Outputting the results
    We can write the results to a file that we can later open in excel or in other programs
    write.table(top, file="two-class-results.txt", sep='\t', quote=FALSE)
    A three-class comparison
    In some experiments, the design is more complicated than a two-class comparison. edgeR can deal with complicated designs via a Generalized Linear Model (GLM) approach.
    As an example, we will consider the three-class comparison of YPD, Delft and Glycerol in the yeast data.
    The main ingredient that we will need in the GLM approach is the design matrix. This matrix tells the model what are the important variables in the experiment. These can be either biological (e.g., growth condition, KO/WT, disease status, ...) or technical (e.g., library prep protocol, flow-cell, ...). To keep things simple, we will model only the growth condition.
    Building the edgeR object
    Again, we need to build an edgeR object (this time with all the samples)
    group <- laneInfo[,2]
    counts <- geneLevelCounts
    cds <- DGEList( counts , group = group )
    names(cds)
    In addition, we need the design matrix. This can be constructed via the model.matrix function
    design <- model.matrix(~group)
    design
    Normalization
    The normalization step is the same as in the two-class comparison.
    cds <- calcNormFactors(cds, method="upperquartile")
    cds$samples
    Estimating Dispersion
    As before, we need to estimate dispersion. There are "GLM versions" of the functions we used before. Remember to add the design matrix as parameter.
    cds <- estimateGLMCommonDisp(cds, design, verbose=TRUE)
    cds <- estimateGLMTagwiseDisp(cds, design)
    We can plot the biological coefficient of variation and the mean-variance relation
    plotBCV(cds)
    meanVarPlot <- plotMeanVar(cds, show.raw.vars=TRUE, show.tagwise.vars=TRUE, show.binned.common.disp.vars=FALSE, show.ave.raw.vars=FALSE, NBline = TRUE , nbins = 100 , pch = 16 , xlab ="Mean Expression (Log10 Scale)" , ylab = "Variance (Log10 Scale)" , main = "Mean-Variance Plot" )
    Testing for DE
    The function glmFit fits a GLM to the data and the function glmLRT performs a Likelihood Ratio Test (LRT) for the null hypothesis that the variable of interest does not influence the gene expression. The coef argument refers to the column numbers of the design matrix to be tested. Alternatively, one can use contrasts to test a specific linear combination of the variables (see the package vignette for details).
    fit <- glmFit(cds, design)
    lrt <- glmLRT(fit, coef=2:3)
    The topTags function can still be used to extract the significant genes.
    top <- topTags(lrt, n=nrow(cds$counts))$table
    head(top)
    We can store the ID of the DE genes and look at the distribution of the p-values.
    de <- rownames(top[top$FDR<0.05,])
    length(de)
    head(de)
    hist(top$PValue, breaks=20)
    Visualizing the results
    We can use the function plotSmear to produce a mean-difference plot.
    plotSmear(cds, de.tags=de, pair=c("YPD", "Del"))
    abline(h=c(-2, 2), col="green")
    plotSmear(cds, de.tags=de, pair=c("YPD", "Gly"))
    abline(h=c(-2, 2), col="green")
    plotSmear(cds, de.tags=de, pair=c("Del", "Gly"))
    abline(h=c(-2, 2), col="green")
    Finally, we can plot a heatmap of the top genes and again extract lists of genes highly and lowly expressed in each condition.
    heatmap(log(X.upperquartile[de[1:500],]+1), ColSideColor=colors, col=rev(heat.colors(20)))
    upregDel <- rownames(top[rownames(top) %in% de & top$logFC.groupGly < 0 & top$logFC.groupYPD < 0,])
    downregDel <- rownames(top[rownames(top) %in% de & top$logFC.groupGly > 0 & top$logFC.groupYPD > 0,])
    upregYPD <- rownames(top[rownames(top) %in% de & top$logFC.groupYPD > 0 & top$logFC.groupYPD > top$logFC.groupGly,])
    downregYPD <- rownames(top[rownames(top) %in% de & top$logFC.groupYPD < 0 & top$logFC.groupYPD < top$logFC.groupGly,])
    upregGly <- rownames(top[rownames(top) %in% de & top$logFC.groupGly > 0 & top$logFC.groupGly > top$logFC.groupYPD,])
    downregGly <- rownames(top[rownames(top) %in% de & top$logFC.groupGly < 0 & top$logFC.groupGly < top$logFC.groupYPD,])
    Outputting the results
    We can write the results to a file that we can later open in excel or in other programs.
    write.table(top, file="three-class-results.txt", sep='\t', quote=FALSE)
    R script
    Here is the script from the workshop demonstration, which just compiles most of the code provided above.
    {RNAseqWorkshop.R}
    References
    Risso D, Schwartz K, Sherlock G, Dudoit S (2011). GC-content normalization for RNA-Seq data. BMC Bioinformatics 12:480
    Robinson MD, Oshlack A (2010). A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biology 11:R25.
    Anders S, Huber W (2010). Differential expression analysis for sequence count data. Genome Biology 11:R106.
    Bullard JH, Purdom E, Hansen KD, Dudoit S (2010). Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics 11:94
    Robinson MD, McCarthy DJ and Smyth GK (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139-140
    Robinson MD and Smyth GK (2007). Moderated statistical tests for assessing differences in tag abundance. Bioinformatics 23, 2881-2887
    Robinson MD and Smyth GK (2008). Small-sample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics, 9, 321-332

    (view changes)
    1:20 pm
  3. file test_R2.fastq uploaded
    12:53 pm
  4. page CGRL user guide workshop edited ... With an internet connection, edgeR can be easily installed by running the following commands i…
    ...
    With an internet connection, edgeR can be easily installed by running the following commands in [R]
    Example Slurm batch: RNA-Seq quantification with kallisto
    cd /global/scratch/$USER
    wget

    sbatch kallisto_for_workshop.sh test_genome_index test 40 test_R1.fastq test_2.fastq
    head test/abundance.tsv
    (view changes)
    12:53 pm
  5. page CGRL user guide workshop edited ... CGRL User Guide for Computing Workshop Introduction What Being a user of the CGRL gives yo…
    ...
    CGRL User Guide for Computing Workshop
    Introduction
    WhatBeing a user of the CGRL gives you access to computing resources are available for CGRL users?
    edgeR is a package written for [R] that performs differential tag/gene expression using count data under a negative binomial model. The methods used
    in edgeR do NOT support FPKM, RPKM or other types of data that are not counts.
    You can obtain more information about edgeR from
    the bioconductor website. Here you’ll findBerkeley Research Computing (BRC) high-performance computing (HPC) environment, including a user’s guide/vignette with detailed examples,large variety of genomic and bioinformatic programs. Getting to know how to use these resources efficiently is a reference manualchallenge, even for all edgeR functions, andthose familiar with command-line use. In this workshop, we'll introduce the [R] codeCGRL resources available to users, including new computing resources for your largest genomics projects and recommendations for best practices.
    Starting point: Berkeley Research Computing (BRC) supercluster high-performance computing (HPC) user guide
    http://research-it.berkeley.edu/services/high-performance-computing/user-guide
    CGRL-specific user guide
    for installing the package. A useful discussion on a forum about edgeR can be found on a SeqAnswers Thread .
    Getting Help with edgeR
    The bioconductor website provides access to a mailing list that all can subscribe to. Any questions concerning any bio-conductor package can be sent to this mailing list. They are very responsive
    Vector cluster and very helpful;Rosalind condo in Savio
    http://research-it.berkeley.edu/services/high-performance-computing/cgrl-vectorrosalind-user-guide
    How do
    I have used it several times. To subscribe go to https://stat.ethz.ch/mailman/listinfo/bioconductor . For guides to posting questions go to http://www.bioconductor.org/help/mailing-list/posting-guide/ .
    If you just need help understanding
    sign up for an edgeR function or wantaccount, link my account to see an example ofBRC HPC, and log in?
    Where should I store data and
    how it’s used simply typeshould I move it around?
    What compute nodes are available?
    How do I run my jobs?

    Example interactive Slurm session
    # see what Slurm jobs are running on CGRL's Vector cluster:
    (view changes)
    12:50 pm
  6. page CGRL user guide workshop edited ... The bioconductor website provides access to a mailing list that all can subscribe to. Any ques…
    ...
    The bioconductor website provides access to a mailing list that all can subscribe to. Any questions concerning any bio-conductor package can be sent to this mailing list. They are very responsive and very helpful; I have used it several times. To subscribe go to https://stat.ethz.ch/mailman/listinfo/bioconductor . For guides to posting questions go to http://www.bioconductor.org/help/mailing-list/posting-guide/ .
    If you just need help understanding an edgeR function or want to see an example of how it’s used simply type
    ?functionnameExample interactive Slurm session
    # see what Slurm jobs are running on CGRL's Vector cluster:
    squeue -p vector
    # see all of my jobs running at the moment:
    squeue -u $USER
    # start an interactive bash session Slurm job with 1 CPU on a Savio2 HTC node in CGRL's Rosalind condo:
    srun --pty --partition=savio2_htc --account=co_rosalind --qos=rosalind_htc2_normal --time=00:30:00 bash -i
    # see information about the job you're currently running:
    echo $SLURM_JOB_ID
    scontrol show job $SLURM_JOB_ID
    # see all of the jobs running on the HTC node partition of the Savio2 cluster
    squeue -p savio2_htc
    # exit your bash session
    exit

    into the [R] command line. This will bring up a text page with detailed information about the function
    Getting Help with [R]
    ...
    Installing edgeR
    With an internet connection, edgeR can be easily installed by running the following commands in [R]
    Example session:Slurm batch: RNA-Seq quantification
    sbatch kallisto_for_workshop.sh test_genome_index test 40 test_R1.fastq test_2.fastq
    head test/abundance.tsv
    (view changes)
    12:13 pm
  7. page CGRL user guide workshop edited ... Installing edgeR With an internet connection, edgeR can be easily installed by running the fo…
    ...
    Installing edgeR
    With an internet connection, edgeR can be easily installed by running the following commands in [R]
    Example session: RNA-Seq quantification with kallisto
    sbatch kallisto_for_workshop.sh test_genome_index test 40 test_R1.fastq test_2.fastq
    head test/abundance.tsv

    cd ~
    vim .bashrc
    ...
    R: export R_LIBS="global/home/users/$USER"R_LIBS_USER="global/home/users/$USER"
    source ~/.bashrc
    source("http://www.bioconductor.org/biocLite.R")
    biocLite("edgeR")
    Make sure you have a recent version of R (>=2.15), or some of the initial commands will not work.
    Example session: RNA-Seq quantification with kallisto

    Usually, a RNA-Seq data analysis starts with a set of FASTQ files which contain information on both the sequence and the quality of the short reads.
    There are several tools to align the reads to the reference genome (e.g. Bowtie, TopHat, ...).
    (view changes)
    11:58 am

More