Galaxy,+quality+control,+mapping+and+gene+counts

=Quality Control, Mapping, and Gene-level summary=

=Getting a Galaxy Account= Galaxy is a web interface for running a lot of command line tools used for NGS analysis. It is great for beginners and helps your streamline your data analysis. You can create pipelines (workflows) and share with lab mates. You can also download other open source published workflows to do your own analysis. There is a free public galaxy at usegalaxy.org but today we setup our own private galaxy in the Amazon cloud to work through the RNA-Seq exercise. To access this galaxy first go to https://www.globus.org/SignIn and signup for an account. Activate your account and login to globus. In your globus dashboard click on Groups and search for UCB on the left, pick UCB-Training and become a member of the group.



Then go to https://ucb-training1.globusgenomics.org or https://ucb-training2.globusgenomics.org or https://ucb-training3.globusgenomics.org and sign in to galaxy. =The Data=

The Sherlock Lab in the Department of Genetics at Stanford sequenced 10 strains of //Saccharomyces Cerevisiae// grown in three media, namely YPD, Delft and Glycerol each with 3-4 biological replicates. The samples were sequenced using Illumina’s Genome Analyzer II yielding 36 bp-long single-end reads. Data is available as a data library on Galaxy. It is easy to import data into your history to work on.


 * Import raw data into your history**

Go to the Data Libraries section under Shared Data Click on the RNASeq_Workshop_RawData library. In there you will see raw fastq files for the experiment. The files named SRR are raw files downloaded from the __S__hort __R__ead __A__rchive (SRA) database and are zipped. They can be quite large and analysis on these will take some time, so let's look at the **example.fastq** file which is more manageable. Select it with the checkbox and click Go to import to current history.



Then Go back to the home page of galaxy and you should see the example.fastq file in your history


 * The FASTQ format**

The FASTQ format is text-based format for representing raw reads and provides three main sets of information about the data: the read sequences, their quality scores and the read physical coordinates on tiles (and lanes), as read by the sequencer. All of this information can be combined to evaluate read quality, and filter out bad-quality reads.



To see the contents of the example.fastq file, click on the 'EYE' icon next to it in history. This will show you the contents of the fastq file in the middle pane.

code @SRR390934.5553.1 COLUMBO:6:1:168:1184.1 length=36 ACGCTCTTCCGATCTACGCTCTTCCGATCTATCTAG +SRR390934.5553.1 COLUMBO:6:1:168:1184.1 length=36 BAAABBBBABBBBBBCBBB@B@AAB@@AAA@B@BB@ @SRR390934.5554.1 COLUMBO:6:1:168:256.1 length=36 ACGCTCTTCCGATCTGCTCTTCCGATCTCCTGTCGA +SRR390934.5554.1 COLUMBO:6:1:168:256.1 length=36 BABAAABB>AA=A@@@?@@@>>B;:=A@>@=>1@?? @SRR390934.5555.1 COLUMBO:6:1:168:1432.1 length=36 ATCCCCGAGTCCTTCAATGCTAGATATCCCGTTGAT +SRR390934.5555.1 COLUMBO:6:1:168:1432.1 length=36 BBAABBABB=BBBABAABBBBC@BB>BB@@A?A@B@ @SRR390934.5556.1 COLUMBO:6:1:168:422.1 length=36 CCTGTCGACTCCTTCAATGCTATCATTTCCTTTGAT +SRR390934.5556.1 COLUMBO:6:1:168:422.1 length=36 ABBBBBABABABBBCCCCABCBCBBBCBBABBBBBB @SRR390934.5557.1 COLUMBO:6:1:168:972.1 length=36 CCTTCACGCTTCATACGCGGGTCATAGTTGGCAAAG +SRR390934.5557.1 COLUMBO:6:1:168:972.1 length=36 BBCCCCCBBCCCBC=CCCBB??B@BBCABBABA8A? code Each record corresponds to a read and comprises four lines. Line 1 begins with a “@” character and is followed by a sequence identifier (SRR390934.5553.1). The line may also contain an optional description specific to the sequencing platform: here, the length of the read (length=36). Line 2 reports the sequence of nucleotides. Line 3 begins with a “+” character and is optionally followed by the same sequence identifier (and any description) as in Line 1. Line 4 reports quality scores for the sequence in Line 2. The quality scores are determined by the base-calling algorithm, here, Bustard from the standard Illumina Genome Analyzer pipeline. The Bustard quality score for a given position in the read is defined as the Phred-like score −10 log10 p, where p is the “probability” that the corresponding base-call is incorrect. The numeric scores are then encoded using ASCII characters by adding 33 (newer standard) or 64 (old standard) to the Phred values. code

code [|FASTQ format] [|Score Quality] [|Encoding] code

code The sequence and quality parts of a .fastq entry span 4 lines per entry, putting the sequence and quality strings on a single line each. This means we can count the number of reads by dividing the number of lines in the file by four: code
 * Count the number of reads:**

code Number of reads = Line Count/4 = 120,000/4 = 30,000 code

code =Quality control=

There are various tools built for simple quality control of raw sequence reads. [|FastQC] is lightweight to download and simple to use, and also takes various data input formats. code

code To run fastqc, type fastqc in the search tools area in the left pane of Galaxy. Select FastQC: Read QC tool and from the drop down select example.fastq as the input file and click Execute code

code code

code The FastQC report FastQC_example.fastq.html is generated in the history panel. Clicking the 'EYE' icon shows FastQC output plots. code

code code

code Example FastQC-generated graphics (using the //read sequences// and the //scores per base// to summarize the data): code

code (upper plot) **ACGT content per base** across all reads; In a random library the lines in this plot should run parallel with each other, and the relative amount of each base should reflect the overall amount of these bases in your genome. When they are strongly imbalanced from each other, it means there are strong biases across the read bases. This usually indicates an overrepresented sequence which is contaminating your library. Here we can see there are indeed biases. FastQC provides a list of the **overrepresented sequences** in the output html file. The top two represent more than 4% of the reads. code

code (bottom plot) boxplot of the **quality score distribution per base position**. In most experiments, quality declines with cycle.

code

code Read Cleanup code

code Under NGS: QC and manipulation section are various tools to clean up reads. Read cleanup often involves one or more of the following code
 * trimming ends of reads to remove adapters
 * trimming ends to remove low quality reads
 * removing reads with lots of Ns
 * collapsing PCR duplicated reads (often this is done after alignment)

code Here we will trim our reads using the tool Fastq Quality Trimmer. Sickle and Trim Galore are more popular and better suited for paired end reads where maintaining mate pair parity is important. code

code code

code This tool uses a sliding window and averages (or finds min or max, depending on the option selected) the quality score of the bases in that window to determine where to chop off the read. Change window size, step size, aggregate action and quality score threshold as desired and Execute. Clicking on the FASTQ Quality Trimmer output will show you how many reads were processed and how many reads were discarded etc. code

code code

code After you do this you can rerun FastQC on the output of Fastq Quality Groomer to see if your data looks better. code

code It does look a little better at the ends.

code

code Mapping reads to a reference In order to align a set of reads, you need some sort of reference. In RNA-Seq you can use the **genome** or **transcriptome**. The most common way to store a reference is in FASTA format. Depending on how you will be doing your analysis (transcriptome or genome) you will do your mapping differently. Below are the different pipelines: Genome pipeline: Transcriptome pipeline: To align reads to a reference, we need a reference. Let us go get the S.Cerevisiae reference from the data library section, just like we got the example.fastq.



Import both the .fa (yeast genome multi fasta) and the .gtf (yeast gene annotation file, for later) into your history. Normally we would have to make an index from the fasta file, but galaxy let's you use the fasta as is and builds an index on the fly. This is great for small genomes (like yeast) but for larger ones like human, you will usually find a list of indices already built. In the next section we will use the aligner TopHat to map the cleaned up reads to the yeast reference.

How TopHat works
TopHat first uses Bowtie to map reads to the genome. After the initial mapping, there are unmapped reads. TopHat attempts to map these reads against both novel and known junctions. It does this by creating Bowtie indexes for junctions.

It finds junctions by finding all possible canonical sites (GT-AG) with introns between 70 - 20,000 bp long. This is based on the fact that 93% of known mouse introns in the UCSC annotation fall within this range. This is a parameter that can be adjusted if you suspect is incorrect for your model system.

It also finds junctions by using the assembly of mapped reads which imply exons.

Please see TopHat paper for more information.

**Please note that with yeast, bowtie is enough. B ut since many of you work on other genomes we will use TopHat. **

Choosing an annotation file
You can supply TopHat with an annotation file referencing the known transcripts of your genome of interest as a GTF or GFF3 formatted file. code

code If this option is provided, TopHat will first extract the transcript sequences and use Bowtie to align reads to this virtual transcriptome first. Only the reads that do not fully map to the transcriptome will then be mapped on the genome. The reads that did map on the transcriptome will be converted to genomic mappings (spliced as needed) and merged with the novel mappings and junctions in the final tophat output. Alternatively you could not use an annotation at all, we won't use one for our example.

You can get annotations from the UCSC Genome Browser:
 * Go to @http://genome.ucsc.edu
 * Click "Table Browser" on the left hand side
 * Select the genome and assembly you will be working on
 * Make sure you choose the same assembly for both your genome reference and annotation
 * Select "Group: Genes and Gene Prediction Tracks"
 * Choose a track that contains the isoforms of interest. If you don't know:
 * RefSeq is rather conservative
 * Ensembl is much more complete
 * Ensure "Region: genome" is selected
 * Select "Output: GTF - gene transfer format"
 * Type a relevant "output file" name
 * e.g.: hg19_refGene.gtf
 * Click "get output"

This is how we got the saccer3.gtf file. You will need a gtf for your own projects, go get one from UCSC or other source.

Running a TopHat process
To run TopHat on the trimmed data find the TopHat tool in the NGS: RNA Analysis section in the left pane and pick the Tophat for Illumina tool.



Here pick the trimmed fastq as the input and select use a genome from history for the reference (because we don't have a pre built index). Select saccer3.fa as the reference genome and single-end alignment option. Keep the settings default for now. There are tons of options that you can change in advanced settings. Read the TopHat manual to familiarize yourself with various options. Of note, if you have paired-end data the distance between the forward and reverse reads is an important parameter to use. If you are not interested in finding indels you can turn it off in the options. If you are not interested in finding novel splice junctions, you can give TopHat a known splice junction annotation file and it will only map to those regions. You can also specify what type of library you are sequencing.

Click Execute after selecting default options. This will take some time to run for large datasets, but for our small toy data it should be fairly fast. There will be four output files, accepted_hits.bam, splice junctions, deletions and insertions.

Understanding the output
The standard output will be three output files in the directory you specified (using or the default  )

This is where the alignments will be stored in SAM format. Actually a binary SAM file or a BAM file. If you have samtools on your computer you can download the accepted_hits.bam file to your computer and look at its contents. Download using the floppy disk icon under the bam file. Download the dataset and the bam_index (they will download to your downloads folder..don't worry about moving them)



These bam alignments can be viewed in genome browsers such as IGV and IGB. Lets view them in IGB. Download IGB from [|Small memory (1 Gb)] This might or may or may not open on its own, download the linked file somewhere and double click to open it. Open IGB and select S.Cerevisiae under species on the right hand pane and Apr2011 under the genome version. Then open the .bam file that you downloaded from galaxy. Then click load data button (right under the selection info bar on the top right). This will load the alignments, you can zoom in using the slider at the top and navigate with the slider at the bottom.



Under NGS: SAM tools section you can find various tools from the samtools suite. Try Flagstat. This will give you information about the alignments.

Samtools can be used to get some summaries about the alignments contained in a BAM file (more informative when the reads are paired which is not the case in our example). code 11671 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 duplicates 11671 + 0 mapped (100.00%:-nan%) 0 + 0 paired in sequencing 0 + 0 read1 0 + 0 read2 0 + 0 properly paired (-nan%:-nan%) 0 + 0 with itself and mate mapped 0 + 0 singletons (-nan%:-nan%) 0 + 0 with mate mapped to a different chr 0 + 0 with mate mapped to a different chr (mapQ>=5) code

We can also look at the alignments in the BAM file, to do this we have to convert our BAM file to SAM (to make it human readable). Use the tool BAM-to-SAM and then click the 'EYE' icon on the output once it is done running. You will see something like this.
 * ~ QNAME ||~ FLAG ||~ RNAME ||~ POS ||~ MAPQ ||~ CIGAR ||~ MRNM ||~ MPOS ||~ ISIZE ||~ SEQ ||~ QUAL ||~ OPT ||
 * SRR390934.16427.1 ||> 272 || chrI ||> 27066 ||> 3 || 36M || * ||> 0 ||> 0 || GTTGGTACCGGTGACGGTGGTCATTTCAGTAGATGT || 6A@:-?24<@>B@@B@=9>ABBBB>7AA@BB@ABBB || NM:i:0 NH:i:2 CC:Z:= CP:i:204270 HI:i:0 ||
 * SRR390934.16629.1 ||> 0 || chrI ||> 34642 ||> 255 || 36M || * ||> 0 ||> 0 || CGGAAAGATCAAGAAAGACTACGAGAATCAATAAAC || BCA?AC>CBCCCAAA=7=AAAB9=@@?B?CB@>6<> || NM:i:0 NH:i:1 ||
 * SRR390934.27425.1 ||> 0 || chrI ||> 35428 ||> 255 || 32M || * ||> 0 ||> 0 || GGCGACCACGTGGTCGTTGATGCTGCCAGCAG || @>@;@BBBBA@@@>A<3=:>>=><@A>7;@5; || NM:i:0 NH:i:1 ||
 * SRR390934.19587.1 ||> 0 || chrI ||> 35632 ||> 255 || 31M || * ||> 0 ||> 0 || GTTCCAAAGGAAATTCCCCTAGATGTGGCTG || ?BCBBCCBABA@BBBBBABBBB@BB?BA@@6 || NM:i:1 NH:i:1 ||
 * SRR390934.26765.1 ||> 0 || chrI ||> 36115 ||> 255 || 36M || * ||> 0 ||> 0 || GAAGACTTCGAAGAAGTTGTTCGTGCCATCCACAAC || BBBABAA@BCCBBAA>@BB@A?B@A@?BAB@BBBB@ || NM:i:0 NH:i:1 ||
 * SRR390934.28030.1 ||> 0 || chrI ||> 40189 ||> 255 || 36M || * ||> 0 ||> 0 || CTGAACGATTTTTTTGTAGACGGAATAAAGAACTTA || :BAACBCCCCCCCCCBBCACBCAACCACCBABACBC || NM:i:0 NH:i:1 ||
 * SRR390934.12601.1 ||> 0 || chrI ||> 46419 ||> 255 || 36M || * ||> 0 ||> 0 || GGAAGACAGTGCAAACAAAGTATGCGGCCTGGCCCA || >CCBBCCCCBACCCABB=BB@C?A@ 0 || chrI ||> 47587 ||> 255 || 36M || * ||> 0 ||> 0 || CTTTTTGTTTTTCAGTAATTTGTTCAAGCAACCAGC || BCCCCCCBCCCCCCCBCBCCCCACCCCCBCBCBCCB || NM:i:0 NH:i:1 ||
 * SRR390934.5459.1 ||> 16 || chrI ||> 57942 ||> 255 || 36M || * ||> 0 ||> 0 || AGCTTTGTTCAGTCATCATGAACCAGTGTCTTTTCG || BBBACBBCBCBACCCCCCCBCBCCCCCBCCCCBBBB || NM:i:0 NH:i:1 ||
 * SRR390934.21101.1 ||> 16 || chrI ||> 58137 ||> 255 || 36M || * ||> 0 ||> 0 || CAGTAGATCTCGGAGGCTGACTTGACGGACTCAATG || ABBBBBABA@ACCCAA?BCBBBCBB@CCBCCCCCCB || NM:i:2 NH:i:1 ||
 * SRR390934.29353.1 ||> 0 || chrI ||> 61367 ||> 255 || 36M || * ||> 0 ||> 0 || GACGTACGAGTCCGCACCAGGGCCGGCGGGCTGATC || BBBCBCBCCC@BABBBBBBA>BAAB=AABA=A@A>= || NM:i:0 NH:i:1 ||
 * SRR390934.4848.1 ||> 0 || chrI ||> 61688 ||> 255 || 25M || * ||> 0 ||> 0 || GGGAACGGCGACGGAACCGCGCCGG || =AAAB?AB=A5=8=4<?>3>=A:<< || NM:i:0 NH:i:1 ||
 * SRR390934.23538.1 ||> 16 || chrI ||> 67033 ||> 255 || 36M || * ||> 0 ||> 0 || TTGGACGAAATCCAAAGCGCGGTCAAGGACAAGAGC || ==:A>3=@AAB@;@B???=@5@067==BAA@B@CAA || NM:i:0 NH:i:1 ||
 * SRR390934.5683.1 ||> 0 || chrI ||> 71755 ||> 255 || 36M || * ||> 0 ||> 0 || GACCAATCAAAACAAATAAAACATCATCACAATGTC || ?CABCCCCBCCBBBBBCCBBCBBB@BBAC@BBBBAB || NM:i:1 NH:i:1 ||
 * SRR390934.16435.1 ||> 0 || chrI ||> 71759 ||> 255 || 36M || * ||> 0 ||> 0 || AATCAAAACAAATAAAACATCATCACAATGTCTAGA || BABBBAB?@AABBBA?@@BBB@A@AAA@B@:AAA== || NM:i:0 NH:i:1 ||
 * SRR390934.6929.1 ||> 0 || chrI ||> 71762 ||> 255 || 36M || * ||> 0 ||> 0 || CAAAACAAATAAAACATCATCACAATGTCTAGATTA || @BCBBCCBCCCCACBCBCCBBBCCCC>9>CBCBCCB || NM:i:0 NH:i:1 ||
 * SRR390934.28084.1 ||> 0 || chrI ||> 71904 ||> 255 || 36M || * ||> 0 ||> 0 || TGAGAAAGGCTGGTTTGAACATTGTCCGTATGAACT || BCBBBBBBBBBCB?BB?ABBBB@>>A@B<<@A>A?B || NM:i:0 NH:i:1 ||
 * SRR390934.8720.1 ||> 0 || chrI ||> 71912 ||> 255 || 36M || * ||> 0 ||> 0 || GCTGGTTTGAACATTGTCCGTATGAACTTCTCTCAC || ?BBBCABBBBCABCBB@BBBACBBBBABBBBBBBBA || NM:i:0 NH:i:1 ||
 * SRR390934.9576.1 ||> 0 || chrI ||> 71943 ||> 255 || 36M || * ||> 0 ||> 0 || CTCACGGTTCTTACGAATACCACAAGTCTGTTATTG || ACCCCBC?BBCBBBCCCCCCCCCBCC?ACCBCBCC@ || NM:i:1 NH:i:1 ||
 * SRR390934.28885.1 ||> 0 || chrI ||> 71962 ||> 255 || 36M || * ||> 0 ||> 0 || CCACAAGTCTGTCATTGACAACGCCAGAAAGTCCGA || BBB>BBB:BAA=ABBBB>@BB?BB 0 || chrI ||> 71996 ||> 255 || 34M || * ||> 0 ||> 0 || GAAGAATTGTACCCAGGTAGACCATTGGCCATTG || ?BCABCBCB@CBBBBBB;BAA@AA@@BBB>@@B5 || NM:i:0 NH:i:1 ||

The file is the file that you should later use for abundance estimation.

The file contains a list of the junctions from the annotation and the novel junctions that TopHat found. Since here our example was yeast, no junction was found.

code

code

Count summarization
Given a set of features of interest (e.g. genes or transcripts), we would like to count the number of reads mapping to these regions. [|HTSeq] is a python package that allows to process data from high-throughput sequencing assays. It includes a module, [|htseq-count]; dedicated to the count of the number of hits per region of interest.

htseq-count needs two input file: 1) the SAM file containing the alignments of the reads on the genome of interest, 2) a GTF file listing the genomic positions of the features of interest. This is where our saccer3.gtf will be useful. Always make sure the chromosome names in the BAM (SAM) file and the annotation file match, otherwise tools won't know how to match the two fields to count.

htseq-count normally takes SAM files but the galaxy version can take BAM files. So find the htseq-count tool under the NGS: RNA Analysis section



Here we pick the accepted_hits.bam file as the aligned file and the saccer3.gtf file as the GFF file. For Mode pick either Union or Intersection(strict) and intersection(non-empty). These modes decide how to count a read that spans one (or multiple) features.



Since our data is not stranded change Stranded option to No. Feature type is exon because we only have exons and no introns. And ID attribute is left as gene_id. Run the program.

This outputs two files, one which has counts over the gene features (from the saccer3.gtf file)(first list below) and another which is a log telling us how many reads were able to be mapped to a feature and how many were not (second list).

code 15S_rRNA   0 21S_rRNA   0 HRA1   0 ICR1   0 LSR1   0 NME1   0 PWR1   0 Q0010   0 Q0017   0 Q0032   0 Q0045   0 Q0050   0 Q0055   0 Q0060   0 Q0065   0 Q0070   0 Q0075   0 Q0080   0 Q0085   0 Q0092   0 Q0105   0 Q0110   0 Q0115   0 code

code no_feature   553 ambiguous   321 too_low_aQual   0 not_aligned   0 alignment_not_unique   4912 code code

code Now we can take these counts for different conditions and normalize and call differential expression