Quality+Control,+Mapping+and+Gene-level+Summaries

=Quality Control, Mapping, and Gene-level summary=

=The Data=

We assume you have previously downloaded the y east RNA-Seq data from Risso et al. (2011). The Sherlock Lab in the Department of Genetics at Stanford sequenced 10 strains of //Saccharomyces Cerevisiae// grown in three media, namely YPD, Delft and Glycerol each with 3-4 biological replicates. The samples were sequenced using Illumina’s Genome Analyzer II yielding 36 bp-long single-end reads


 * Download**

One example of replicate is available on the Sequence Read Archive ([|SRA] ), a public repository of high-throughput sequencing data. It can be downloaded [|here]. The default format of an archive downloaded from the SRA is the SRA format. For this one example, we prefer to download the sample directly as a [|FASTQ file]. To do so, click on the run id ([|SRR390934] ) which leads you to the "Run Browser". Go to the "Reads" tab and click on "Filtered download" (which here won't be filtered since we leave the field empty). Choose the FASTQ format and download the data.

Decompress the file and rename it as 'ex.fastq': code $ gunzip sra_data.fastq.gz $ mv sra_data.fastq ex.fastq code Note the size of the file (2.8G) which is quite large knowing that it is only for one sample and in the yeast organism that has a small genome.

A subset of the fastq file named //exsub.fastq// is available.


 * The FASTQ format**

The FASTQ format is text-based format for representing raw reads and provides three main sets of information about the data: the read sequences, their quality scores and the read physical coordinates on tiles (and lanes), as read by the sequencer. All of this information can be combined to evaluate read quality, and filter out bad-quality reads.



Let's have a look at the last three entries of our fastq file: code $ tail -n 13 ex.fastq code

code @SRR390934.14968428.1 COLUMBO:6:120:1790:432.1 length=36 GAGGAATCGATTAAAAACATGGGNGCNNNNNNNNNN +SRR390934.14968428.1 COLUMBO:6:120:1790:432.1 length=36 AC:A@?BA>;:BCC;7B################### @SRR390934.14968429.1 COLUMBO:6:120:1790:319.1 length=36 ACGCTCTTCCGATCTATCCGCACNAGNNNNNNNNNN +SRR390934.14968429.1 COLUMBO:6:120:1790:319.1 length=36 B@1@@A>=A>=@?@?B<################### @SRR390934.14968430.1 COLUMBO:6:120:1790:273.1 length=36 ACGCTCTTCCGATCTCTGAGAGANCTNNNNNNNNNN +SRR390934.14968430.1 COLUMBO:6:120:1790:273.1 length=36 BB5B>AA>A>?A@@?B?################### code

Each record corresponds to a read and comprises four lines. Line 1 begins with a “@” character and is followed by a sequence identifier (SRR390934.1.1). The line may also contain an optional description specific to the sequencing platform: here, the length of the read (length=36). Line 2 reports the sequence of nucleotides. Line 3 begins with a “+” character and is optionally followed by the same sequence identifier (and any description) as in Line 1. Line 4 reports quality scores for the sequence in Line 2. The quality scores are determined by the base-calling algorithm, here, Bustard from the standard Illumina Genome Analyzer pipeline. The Bustard quality score for a given position in the read is defined as the Phred-like score −10 log10 p, where p is the “probability” that the corresponding base-call is incorrect. The numeric scores are then encoded using ASCII characters by adding 64 to the Phred values.

[|FASTQ format] [|Score Quality] [|Encoding]

The sequence and quality parts of a .fastq entry span 4 lines per entry, putting the sequence and quality strings on a single line each. This means we can count the number of reads by dividing the number of lines in the file by four: code $ nblines=`cat ex.fastq | wc -l` $ echo $nblines $ expr $nblines / 4
 * Count the number of reads:**

code This run produced 14,968,430 reads.

=Quality control= There are various tools built for simple quality control of raw sequence reads. [|FastQC] is lightweight to download and simple to use, and also takes various data input formats. In order to run, FastQC needs your system to have a suitable Java Runtime Environment (JRE) installed. Please look at [|the installation instructions]. FastQC can be run either in interactive mode or in non-interactive mode using the command line.

code ./FastQC/fastqc
 * 1) Using the interactive mode in Linux:
 * 1) then do Open > File and choose ex.fastqc
 * 2) In MaxOSX, just drag the application from the disk to your Applications folder

$ mkdir fastq $ ~/tools/FastQC/fastqc ex.fastq -o fastqc code
 * 1) Using the command line:
 * 2) create the output directory for FastQC in the current directory
 * 1) Run FastQC on our fastq file

The FastQC report is given as an html file in fastqc/ ex_fastqc/fastqc_report.html.

Example FastQC-generated graphics ( using the // read sequences // and the // scores per base // to summarize the data):

(upper plot) **ACGT content per base** across all reads; In a random library the lines in this plot should run parallel with each other, and the relative amount of each base should reflect the overall amount of these bases in your genome. When they are strongly imbalanced from each other, it means there are strong biases across the read bases. This usually indicates an overrepresented sequence which is contaminating your library. Here we can see there are indeed biases. FastQC provides a list of the **overrepresented sequences** in the output html file. The top two represent more than 4% of the reads.

(bottom plot) boxplot of the **quality score distribution per base position**. In most experiments, quality declines with cycle.    =Mapping=

In order to align a set of reads, you need some sort of reference. In RNA-Seq you can use the **genome** or **transcriptome**. The most common way to store a reference is in FASTA format. We will cover how to download a reference in the next section. Depending on how you will be doing your analysis (transcriptome or genome) you will do your mapping differently. Below are the different pipelines: Genome pipeline: Transcriptome pipeline:

Building a Bowtie index
The first step in sequence alignment is acquiring a Bowtie index of your reference(s). You can also download the reference with the annotation: iGenomes instead of building the index and downloading the annotation yourself. To acquire an index, do the following:
 * Check if your index is already built ( Bowtie Indexes)
 * If it does, download it and unzip to your index location
 * If it doesn't, download the genome from UCSC Genome Browser
 * On the left hand side, click "Downloads"
 * Select your model organism (e.g. S. //cerevisiae//)
 * Select a build (e.g. sacCer3)
 * Click "Full data set"
 * Download "chromFa.tar.gz"
 * Unzip it

code $ cat chromFa/*.fa > sacCer3.fa $ # move the file to a specific directory named index $ mkdir index $ mv sacCer3.fa index/ code code $ ~/tools/bowtie2-2.1.0/bowtie2-build -f index/sacCer3.fa index/sacCer3 code You should now see a number of files in directory : code $ ls index sacCer3.1.bt2 sacCer3.3.bt2 sacCer3.fa sacCer3.rev.2.bt2 sacCer3.2.bt2 sacCer3.4.bt2 sacCer3.rev.1.bt2 code
 * Once you have all of your sequences, the easiest way to make an index is to pipe all of your files into a multi-FASTA file:
 * Build the actual index:

Remember to keep the original multi-FASTA file with the index (sacCer3). Programs that use the index often also require the multi-FASTA and will look for it here. If the file is not there, then it will have to rebuild the multi-FASTA which can often take a while.

How TopHat works
TopHat first uses Bowtie to map reads to the genome. After the initial mapping, there are unmapped reads. TopHat attempts to map these reads against both novel and known junctions. It does this by creating Bowtie indexes for junctions.

It finds junctions by finding all possible canonical sites (GT-AG) with introns between 70 - 20,000 bp long. This is based on the fact that 93% of known mouse introns in the UCSC annotation fall within this range. This is a parameter that can be adjusted if you suspect is incorrect for your model system.

It also finds junctions by using the assembly of mapped reads which imply exons.

Please see TopHat paper for more information.  **Please note that with yeast, bowtie is enough. B ut since many of you work on other genomes we will use TopHat. **

Choosing an annotation file
You can supply TopHat with an annotation file referencing the known transcripts of your genome of interest as a GTF or GFF3 formatted file. If this option is provided, TopHat will first extract the transcript sequences and use Bowtie to align reads to this virtual transcriptome first. Only the reads that do not fully map to the transcriptome will then be mapped on the genome. The reads that did map on the transcriptome will be converted to genomic mappings (spliced as needed) and merged with the novel mappings and junctions in the final tophat output. Alternatively you could not use an annotation at all, we won't use one for our example.

You can get annotations from the UCSC Genome Browser:
 * Go to @http://genome.ucsc.edu
 * Click "Table Browser" on the left hand side
 * Select the genome and assembly you will be working on
 * Make sure you choose the same assembly for both your genome reference and annotation
 * Select "Group: Genes and Gene Prediction Tracks"
 * Choose a track that contains the isoforms of interest. If you don't know:
 * RefSeq is rather conservative
 * Ensembl is much more complete
 * Ensure "Region: genome" is selected
 * Select "Output: GTF - gene transfer format"
 * Type a relevant "output file" name
 * e.g.: hg19_refGene.gtf
 * Click "get output"

Running a TopHat process
Please carefully read the TopHat manual to choose the parameters relevant to your experiment. In particular if you have paired reads, you will need to specify the expected (mean) inner distance between mate pairs (--mate-inner-distance), and the standard deviation for the distribution on inner distances between mate pairs (--mate-std-dev). We will only align a subset of the fastq file contained in This file contains the first 30,000 read entries stored in ex.fastq (if you haven't done it yet, please download this file in your working directory). code $ tophat -p 3 index/sacCer3 exsub.fastq code -p specifies how many concurrent threads will be run in Bowtie, you should use at most the number of processors available on your computer.

The output files are in the repertory tophat_out: code $ ls tophat_out accepted_hits.bam align_summary.txt deletions.bed insertions.bed junctions.bed logs prep_reads.info unmapped.bam

code It is helpful to write a script in Bash to keep track of the parameters that were used. Below is the same example in a Bash script: code format="c"
 * 1) !/bin/bash

EXP_NAME=exp1 DATA_DIR=~/exp1 FASTQ="ex.fastq" BWT_IDX=~/index/sacCer3 OUT_DIR=$EXP_NAME
 * 1) Change these variables to match your experiment

CMD="tophat -o $OUT_DIR $BWT_IDX $DATA_DIR/$FASTQ"

echo "Running $CMD" $CMD 2>&1 | tee $EXP_NAME.log code
 * 1) Show output on the screen AND save in $EXP_NAME.log

Once you have written your script, save it (as something like, and change the permissions so that you can run it (./exp1.sh).

Understanding the output
 The standard output will be three output files in the directory you specified (using or the default  )

This is where the alignments will be stored in SAM format. If you have a TopHat BAM file, you can see the code that generated the file. It will appear in the "CL:" field of the header of the file: code $ samtools view -H tophat_out/accepted_hits.bam code   Samtools can be used to get some summaries about the alignments contained in a BAM file (more informative when the reads are paired which is not the case in our example).

code $ samtools flagstat tophat_out/accepted_hits.bam

4698975 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 duplicates 4698975 + 0 mapped (100.00%:-nan%) 0 + 0 paired in sequencing 0 + 0 read1 0 + 0 read2 0 + 0 properly paired (-nan%:-nan%) 0 + 0 with itself and mate mapped 0 + 0 singletons (-nan%:-nan%) 0 + 0 with mate mapped to a different chr 0 + 0 with mate mapped to a different chr (mapQ>=5) code   We can also look at the alignments in the BAM file:

code $ samtools view tophat_out/accepted_hits.bam | head

SRR390934.3081264.1 16 2micron 30 50 36M * 0 0 AGCAGCAAAGGTGCATTTTTAAAATATGAAATGAAG CCCCCCCCCCCCCCCCCCCCCCCCBCCCCCCCCCCB AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:36 YT:Z:UU NH:i:1 SRR390934.13224401.1 0 2micron 31 50 36M * 0 0 GCAGCAAAGGTGCATTTTTAAAATATGAAATGAAGA BB8ABCA>BB=BBBBBBBBAB@BB@ABBABBA@?BB AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:36 YT:Z:UU NH:i:1 SRR390934.14283427.1 0 2micron 39 50 36M * 0 0 GGTGCATTTTTAAAATATGAAATGAAGATACCGCAG BB;BB@BBBBB?AB<>9BCA@BBC>3@@@?A4CA>< AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:36 YT:Z:UU NH:i:1 SRR390934.10178116.1 0 2micron 42 50 36M * 0 0 GCATTTTTAAAATATGAAATGAAGATACCGCAGTAC AA:ACCCB>BCCCACB?BCA=ABBBBBCBBB??ABB AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:36 YT:Z:UU NH:i:1

code The file is the file that you should later use for abundance estimation.   You can convert the BAM file into SAM format using samtools view: $ samtools view tophat_out/accepted_hits.bam > accepted_hits.sam  <span style="background-color: #ffffff; display: block; font-family: sans-serif; line-height: 1.5; text-align: justify;">

The file contains a list of the junctions from the annotation and the novel junctions that TopHat found. Since here our example was yeast, no junction was found. <span style="background-color: #ffffff; display: block; font-family: sans-serif; line-height: 1.5; text-align: justify;"> <span style="background-color: #ffffff; display: block; font-family: sans-serif; line-height: 1.5; text-align: justify;">

Count summarization
<span style="background-color: #ffffff; display: block; font-family: sans-serif; line-height: 1.5; text-align: justify;"> Given a set of features of interest (e.g. genes or transcripts), we would like to count the number of reads mapping to these regions. [|HTSeq] is a python package that allows to <span style="background-color: #ffffff; font-family: Arial,Helvetica,sans-serif; font-size: 12px; line-height: 1.5;">process data from high-throughput sequencing assays. It includes a module, [|htseq-count]; dedicated to the count of the number of hits per region of interest.

htseq-count needs two input file: 1) the SAM file containing the alignments of the reads on the genome of interest, 2) a GTF file listing the genomic positions of the features of interest.

We can for example download the [|Ensembl annotations for]//[|S. cerevisiae] .// You have to make sure that the names of the chromosomes in the .sam file and in the .gtf file are the same. Here they are not the same: a prefix 'chr' is used in the chromosome names in the .sam file while only the chromosome numbers are used in the .gtf file. We therefore need to modify the gtf file before running htseq-count. Since the chromosome is the first element of each line, we just need to paste 'chr' at the beginning of each line. We can do that using 'sed': code $ sed -e 's/^/chr/' Saccharomyces_cerevisiae.EF4.65.gtf > ensembl_sacCer3.gtf code code $ python -m HTSeq.scripts.count -m intersection-nonempty -s no accepted_hits.sam -t exon ensembl_sacCer3.gtf > counts.txt code <span style="background-color: #ffffff; display: block; font-family: sans-serif; line-height: 1.5; text-align: justify;">The five last lines of the counts.txt file give some summaries of the hits in the features:

code $ tail -n 5 counts.txt

no_feature   313914 ambiguous   136211 too_low_aQual   0 not_aligned   0 alignment_not_unique   1425668

code <span style="background-color: #ffffff; display: block; font-family: sans-serif; line-height: 1.5; text-align: justify;">