We assume you have previously downloaded the yeast RNA-Seq data from Risso et al. (2011).
The Sherlock Lab in the Department of Genetics at Stanford sequenced 10 strains of Saccharomyces Cerevisiae grown in three media, namely YPD, Delft and Glycerol each with 3-4 biological replicates.
The samples were sequenced using Illumina’s Genome Analyzer II yielding 36 bp-long single-end reads
Download
One example of replicate is available on the Sequence Read Archive (SRA), a public repository of high-throughput sequencing data. It can be downloaded here. The default format of an archive downloaded from the SRA is the SRA format. For this one example, we prefer to download the sample directly as a FASTQ file. To do so, click on the run id (SRR390934) which leads you to the "Run Browser". Go to the "Reads" tab and click on "Filtered download" (which here won't be filtered since we leave the field empty). Choose the FASTQ format and download the data.
Note the size of the file (2.8G) which is quite large knowing that it is only for one sample and in the yeast organism that has a small genome.
A subset of the fastq file named exsub.fastq is available here.
The FASTQ format
The FASTQ format is text-based format for representing raw reads and provides three main sets of information about the data: the read sequences, their quality scores and the read physical coordinates on tiles (and lanes), as read by the sequencer. All of this information can be combined to evaluate read quality, and filter out bad-quality reads.
Let's have a look at the last three entries of our fastq file:
Each record corresponds to a read and comprises four lines. Line 1 begins with a “@” character and is followed by a sequence identifier (SRR390934.1.1). The
line may also contain an optional description specific to the sequencing platform: here, the length of the read (length=36). Line 2 reports the sequence of nucleotides. Line 3 begins with a “+” character and is optionally followed by the same sequence identifier (and any description) as in Line 1.
Line 4 reports quality scores for the sequence in Line 2. The quality scores are determined by the base-calling algorithm, here, Bustard from the standard Illumina Genome Analyzer pipeline. The Bustard quality score for a given position in the read is defined as the Phred-like score −10 log10 p, where p is the “probability” that the corresponding base-call is incorrect. The numeric scores are then encoded using ASCII characters by adding 64 to the Phred values.
Count the number of reads:
The sequence and quality parts of a .fastq entry span 4 lines per entry, putting the sequence and quality strings on a single line each. This means we can count the number of reads by dividing the number of lines in the file by four:
There are various tools built for simple quality control of raw sequence reads. FastQC is lightweight to download and simple to use, and also takes various data input formats.
In order to run, FastQC needs your system to have a suitable Java Runtime Environment (JRE) installed. Please look at the installation instructions.
FastQC can be run either in interactive mode or in non-interactive mode using the command line.
### Using the interactive mode in Linux:
./FastQC/fastqc
# then do Open > File and choose ex.fastqc
## In MaxOSX, just drag the application from the disk to your Applications folder
### Using the command line:
# create the output directory for FastQC in the current directory
$ mkdir fastq
# Run FastQC on our fastq file
$ ~/tools/FastQC/fastqc ex.fastq -o fastqc
The FastQC report is given as an html file in fastqc/ex_fastqc/fastqc_report.html.
Example FastQC-generated graphics (using the read sequences and the scores per base to summarize the data):
(upper plot) ACGT content per base across all reads;
In a random library the lines in this plot should run parallel with each other, and the relative amount of each base should reflect the overall amount of these bases in your genome.
When they are strongly imbalanced from each other, it means there are strong biases across the read bases. This usually indicates an overrepresented sequence which is contaminating your library.
Here we can see there are indeed biases. FastQC provides a list of the overrepresented sequences in the output html file. The top two represent more than 4% of the reads.
(bottom plot) boxplot of the quality score distribution per base position. In most experiments, quality declines with cycle.
Mapping
In order to align a set of reads, you need some sort of reference. In RNA-Seq you can use the genome or transcriptome. The most common way to store a reference is in FASTA format. We will cover how to download a reference in the next section.
Depending on how you will be doing your analysis (transcriptome or genome) you will do your mapping differently. Below are the different pipelines:
Genome pipeline:
Transcriptome pipeline:
Building a Bowtie index
The first step in sequence alignment is acquiring a Bowtie index of your reference(s).
You can also download the reference with the annotation: iGenomes instead of building the index and downloading the annotation yourself.
To acquire an index, do the following:
You should now see a number of files in directory index/:
$ ls index
sacCer3.1.bt2 sacCer3.3.bt2 sacCer3.fa sacCer3.rev.2.bt2
sacCer3.2.bt2 sacCer3.4.bt2 sacCer3.rev.1.bt2
Remember to keep the original multi-FASTA file with the index (sacCer3.fa). Programs that use the index often also require the multi-FASTA and will look for it here. If the file is not there, then it will have to rebuild the multi-FASTA which can often take a while.
Mapping with TopHat
How TopHat works
TopHat first uses Bowtie to map reads to the genome. After the initial mapping, there are unmapped reads. TopHat attempts to map these reads against both novel and known junctions. It does this by creating Bowtie indexes for junctions.
It finds junctions by finding all possible canonical sites (GT-AG) with introns between 70 - 20,000 bp long. This is based on the fact that 93% of known mouse introns in the UCSC annotation fall within this range. This is a parameter that can be adjusted if you suspect is incorrect for your model system.
It also finds junctions by using the assembly of mapped reads which imply exons.
Please see TopHat paper for more information. Please note that with yeast, bowtie is enough. But since many of you work on other genomes we will use TopHat.
Choosing an annotation file
You can supply TopHat with an annotation file referencing the known transcripts of your genome of interest as a GTF or GFF3 formatted file.
If this option is provided, TopHat will first extract the transcript sequences and use Bowtie to align reads to this virtual transcriptome first. Only the reads that do not fully map to the transcriptome will then be mapped on the genome. The reads that did map on the transcriptome will be converted to genomic mappings (spliced as needed) and merged with the novel mappings and junctions in the final tophat output. Alternatively you could not use an annotation at all, we won't use one for our example.
Select the genome and assembly you will be working on
Make sure you choose the same assembly for both your genome reference and annotation
Select "Group: Genes and Gene Prediction Tracks"
Choose a track that contains the isoforms of interest. If you don't know:
RefSeq is rather conservative
Ensembl is much more complete
Ensure "Region: genome" is selected
Select "Output: GTF - gene transfer format"
Type a relevant "output file" name
e.g.: hg19_refGene.gtf
Click "get output"
Running a TopHat process
Please carefully read the TopHat manual to choose the parameters relevant to your experiment. In particular if you have paired reads, you will need to specify the expected (mean) inner distance between mate pairs (--mate-inner-distance), and the standard deviation for the distribution on inner distances between mate pairs (--mate-std-dev). We will only align a subset of the fastq file contained in exsub.fastq. This file contains the first 30,000 read entries stored in ex.fastq (if you haven't done it yet, please download this file in your working directory).
$ tophat -p 3 index/sacCer3 exsub.fastq
-p specifies how many concurrent threads will be run in Bowtie, you should use at most the number of processors available on your computer.
It is helpful to write a script in Bash to keep track of the parameters that were used. Below is the same example in a Bash script:
#!/bin/bash# Change these variables to match your experiment
EXP_NAME=exp1
DATA_DIR=~/exp1
FASTQ="ex.fastq"
BWT_IDX=~/index/sacCer3
OUT_DIR=$EXP_NAME
CMD="tophat -o $OUT_DIR $BWT_IDX $DATA_DIR/$FASTQ"
echo "Running $CMD"# Show output on the screen AND save in $EXP_NAME.log
$CMD 2>&1| tee $EXP_NAME.log
Once you have written your script, save it (as something like exp1.sh), and change the permissions (chmod 755 exp1.sh) so that you can run it (./exp1.sh).
Understanding the output
The standard output will be three output files in the directory you specified (using -o or the default tophat_out/ )
accepted_hits.bam
junctioned.bed
accepted_hits.bam
This is where the alignments will be stored in SAM format. If you have a TopHat BAM file, you can see the code that generated the file.
It will appear in the "CL:" field of the header of the file:
$ samtools view -H tophat_out/accepted_hits.bam
Samtools can be used to get some summaries about the alignments contained in a BAM file (more informative when the reads are paired which is not the case in our example).
$ samtools flagstat tophat_out/accepted_hits.bam
4698975 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
4698975 + 0 mapped (100.00%:-nan%)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (-nan%:-nan%)
0 + 0 with itself and mate mapped
0 + 0 singletons (-nan%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
We can also look at the alignments in the BAM file:
The accepted_hits.bam file is the file that you should later use for abundance estimation. You can convert the BAM file into SAM format using samtools view: $ samtools view tophat_out/accepted_hits.bam > accepted_hits.sam
junctions.bed
The junctioned.bed file contains a list of the junctions from the annotation and the novel junctions that TopHat found.
Since here our example was yeast, no junction was found.
Count summarization
Given a set of features of interest (e.g. genes or transcripts), we would like to count the number of reads mapping to these regions. HTSeq is a python package that allows to process data from high-throughput sequencing assays.
It includes a module, htseq-count; dedicated to the count of the number of hits per region of interest.
htseq-count needs two input file: 1) the SAM file containing the alignments of the reads on the genome of interest, 2) a GTF file listing the genomic positions of the features of interest.
We can for example download the Ensembl annotations for S. cerevisiae .
You have to make sure that the names of the chromosomes in the .sam file and in the .gtf file are the same.
Here they are not the same: a prefix 'chr' is used in the chromosome names in the .sam file while only the chromosome numbers are used in the .gtf file.
We therefore need to modify the gtf file before running htseq-count. Since the chromosome is the first element of each line, we just need to paste 'chr' at the beginning of each line.
We can do that using 'sed':
$ sed -e 's/^/chr/' Saccharomyces_cerevisiae.EF4.65.gtf > ensembl_sacCer3.gtf
Quality Control, Mapping, and Gene-level summary
The Data
We assume you have previously downloaded the yeast RNA-Seq data from Risso et al. (2011).
The Sherlock Lab in the Department of Genetics at Stanford sequenced 10 strains of Saccharomyces Cerevisiae grown in three media, namely YPD, Delft and Glycerol each with 3-4 biological replicates.
The samples were sequenced using Illumina’s Genome Analyzer II yielding 36 bp-long single-end reads
Download
One example of replicate is available on the Sequence Read Archive (SRA), a public repository of high-throughput sequencing data. It can be downloaded here.
The default format of an archive downloaded from the SRA is the SRA format.
For this one example, we prefer to download the sample directly as a FASTQ file. To do so, click on the run id (SRR390934) which leads you to the "Run Browser". Go to the "Reads" tab and click on "Filtered download" (which here won't be filtered since we leave the field empty). Choose the FASTQ format and download the data.
Decompress the file and rename it as 'ex.fastq':
Note the size of the file (2.8G) which is quite large knowing that it is only for one sample and in the yeast organism that has a small genome.
A subset of the fastq file named exsub.fastq is available here.
The FASTQ format
The FASTQ format is text-based format for representing raw reads and provides three main sets of information about the data: the read sequences, their quality scores and the read physical coordinates on tiles (and lanes), as read by the sequencer. All of this information can be combined to evaluate read quality, and filter out bad-quality reads.
Let's have a look at the last three entries of our fastq file:
Each record corresponds to a read and comprises four lines. Line 1 begins with a “@” character and is followed by a sequence identifier (SRR390934.1.1). The
line may also contain an optional description specific to the sequencing platform: here, the length of the read (length=36).
Line 2 reports the sequence of nucleotides.
Line 3 begins with a “+” character and is optionally followed by the same sequence identifier (and any description) as in Line 1.
Line 4 reports quality scores for the sequence in Line 2. The quality scores are determined by the base-calling algorithm, here, Bustard from the standard Illumina Genome Analyzer pipeline. The Bustard quality score for a given position in the read is defined as the Phred-like score −10 log10 p, where p is the “probability” that the corresponding base-call is incorrect. The numeric scores are then encoded using ASCII characters by adding 64 to the Phred values.
FASTQ format
Score Quality
Encoding
Count the number of reads:
The sequence and quality parts of a .fastq entry span 4 lines per entry, putting the sequence and quality strings on a single line each. This means we can count the number of reads by dividing the number of lines in the file by four:
This run produced 14,968,430 reads.
Quality control
There are various tools built for simple quality control of raw sequence reads. FastQC is lightweight to download and simple to use, and also takes various data input formats.
In order to run, FastQC needs your system to have a suitable Java Runtime Environment (JRE) installed. Please look at the installation instructions.
FastQC can be run either in interactive mode or in non-interactive mode using the command line.
The FastQC report is given as an html file in fastqc/ex_fastqc/fastqc_report.html.
Example FastQC-generated graphics (using the read sequences and the scores per base to summarize the data):
(upper plot) ACGT content per base across all reads;
In a random library the lines in this plot should run parallel with each other, and the relative amount of each base should reflect the overall amount of these bases in your genome.
When they are strongly imbalanced from each other, it means there are strong biases across the read bases. This usually indicates an overrepresented sequence which is contaminating your library.
Here we can see there are indeed biases. FastQC provides a list of the overrepresented sequences in the output html file. The top two represent more than 4% of the reads.
(bottom plot) boxplot of the quality score distribution per base position.
In most experiments, quality declines with cycle.
Mapping
In order to align a set of reads, you need some sort of reference. In RNA-Seq you can use the genome or transcriptome. The most common way to store a reference is in FASTA format. We will cover how to download a reference in the next section.
Depending on how you will be doing your analysis (transcriptome or genome) you will do your mapping differently. Below are the different pipelines:
Genome pipeline:
Transcriptome pipeline:
Building a Bowtie index
The first step in sequence alignment is acquiring a Bowtie index of your reference(s).You can also download the reference with the annotation: iGenomes instead of building the index and downloading the annotation yourself.
To acquire an index, do the following:
- Build the actual index:
You should now see a number of files in directory index/:Remember to keep the original multi-FASTA file with the index (sacCer3.fa). Programs that use the index often also require the multi-FASTA and will look for it here. If the file is not there, then it will have to rebuild the multi-FASTA which can often take a while.
Mapping with TopHat
How TopHat works
TopHat first uses Bowtie to map reads to the genome. After the initial mapping, there are unmapped reads. TopHat attempts to map these reads against both novel and known junctions. It does this by creating Bowtie indexes for junctions.It finds junctions by finding all possible canonical sites (GT-AG) with introns between 70 - 20,000 bp long. This is based on the fact that 93% of known mouse introns in the UCSC annotation fall within this range. This is a parameter that can be adjusted if you suspect is incorrect for your model system.
It also finds junctions by using the assembly of mapped reads which imply exons.
Please see TopHat paper for more information.
Please note that with yeast, bowtie is enough. But since many of you work on other genomes we will use TopHat.
Choosing an annotation file
You can supply TopHat with an annotation file referencing the known transcripts of your genome of interest as a GTF or GFF3 formatted file.If this option is provided, TopHat will first extract the transcript sequences and use Bowtie to align reads to this virtual transcriptome first. Only the reads that do not fully map to the transcriptome will then be mapped on the genome. The reads that did map on the transcriptome will be converted to genomic mappings (spliced as needed) and merged with the novel mappings and junctions in the final tophat output.
Alternatively you could not use an annotation at all, we won't use one for our example.
You can get annotations from the UCSC Genome Browser:
Running a TopHat process
Please carefully read the TopHat manual to choose the parameters relevant to your experiment. In particular if you have paired reads, you will need to specify the expected (mean) inner distance between mate pairs (--mate-inner-distance), and
the standard deviation for the distribution on inner distances between mate pairs (--mate-std-dev).
We will only align a subset of the fastq file contained in exsub.fastq.
This file contains the first 30,000 read entries stored in ex.fastq (if you haven't done it yet, please download this file in your working directory).
-p specifies how many concurrent threads will be run in Bowtie, you should use at most the number of processors available on your computer.
The output files are in the repertory tophat_out:
It is helpful to write a script in Bash to keep track of the parameters that were used. Below is the same example in a Bash script:
Once you have written your script, save it (as something like exp1.sh), and change the permissions (chmod 755 exp1.sh) so that you can run it (./exp1.sh).
Understanding the output
The standard output will be three output files in the directory you specified (using -o or the default tophat_out/ )
accepted_hits.bam
This is where the alignments will be stored in SAM format. If you have a TopHat BAM file, you can see the code that generated the file.It will appear in the "CL:" field of the header of the file:
Samtools can be used to get some summaries about the alignments contained in a BAM file (more informative when the reads are paired which is not the case in our example).
We can also look at the alignments in the BAM file:
The accepted_hits.bam file is the file that you should later use for abundance estimation.
You can convert the BAM file into SAM format using samtools view:
$ samtools view tophat_out/accepted_hits.bam > accepted_hits.sam
junctions.bed
The junctioned.bed file contains a list of the junctions from the annotation and the novel junctions that TopHat found.
Since here our example was yeast, no junction was found.
Count summarization
Given a set of features of interest (e.g. genes or transcripts), we would like to count the number of reads mapping to these regions.
HTSeq is a python package that allows to process data from high-throughput sequencing assays.
It includes a module, htseq-count; dedicated to the count of the number of hits per region of interest.
htseq-count needs two input file: 1) the SAM file containing the alignments of the reads on the genome of interest, 2) a GTF file listing the genomic positions of the features of interest.
We can for example download the Ensembl annotations for S. cerevisiae .
You have to make sure that the names of the chromosomes in the .sam file and in the .gtf file are the same.
Here they are not the same: a prefix 'chr' is used in the chromosome names in the .sam file while only the chromosome numbers are used in the .gtf file.
We therefore need to modify the gtf file before running htseq-count. Since the chromosome is the first element of each line, we just need to paste 'chr' at the beginning of each line.
We can do that using 'sed':
The five last lines of the counts.txt file give some summaries of the hits in the features: