Basic+raw+data+quality+control

I. **The FASTQ format**

The FASTQ format is text-based format for representing raw reads and provides three main sets of information about the data: the read sequences, their quality scores and the read physical coordinates on tiles (and lanes), as read by the sequencer. All of this information can be combined to evaluate read quality, and filter out bad-quality reads.



[|FASTQ format] [|Score Quality] [|Encoding]

II. **FastQC**

There are various tools built for simple quality control of raw sequence reads. [|FastQC] is lightweight to download and simple to use, and also takes various data input formats.

Example FastQC-generated graphics: (left) ACGT content per base across all reads; (right) boxplot of the quality score distribution per base position.



This is an example of using the //read sequences// and the //scores per base// to summarize the data.

III. **tileQC**

[|tileQC] is a tile based approach to quality control. Summarizing scores spatially could reveal areas of bad quality caused by the sequencing process or the library preparation. Unfortunately, I found tileQC quite hard to install and use, so I wrote my own script to plot read quality scores spatially (see IV). This means associating the read and its score with its position on a given tile, and summarizing scores over tile areas.

In practice, this information is on the first and third line of the FASTQ format.

Here are some examples for three different tiles of //some// dataset.




 * IV. Tile visual quality in R**

Instructions on how to run the script:

1/ Download [|tileqc.R] 2/ Download [|example_fastq_file.txt.zip] to the same directory. Unzip. Alternatively, copy your own fastq file to the same directory. Use single-lane data only. 3/ Start R in the same directory 4/ > source(tileqc.R) 5/ Now are we ready to plot tile scores. But first we have to select some inputs.

tileqc.R loads a set of functions, of which we need tileVisualQuality. That function takes 5 inputs:

-- **fastq file name**: If you're using the workshop example file, you need 'example_fastq_file.txt' -- **encoding**: If none is selected, the default encoding is 'illumnina1.5'. Other possibilities are 'sanger', 'illumina1.0', 'illumina1.3'. This is unique for your fastq data file. -- **tile**: Tile number for which you want to plot the quality scores. If none is selected, all tile plots will be generated. -- **base**: Base position for which the scores should be plotted. This can range from 1 to the length of one read. If empty, then the median of base scores for every read is plotted. -- **ngrid**: This is essentially the resolution, the size of the grid on which all reads in one tile are averaged. The default is 200.

Executing the function means entering on the prompt a variation of the following: > tileVisualQuality(fname, encoding, tile, base, ngrid)

6/ Examples:

Using the default values: > source('tileqc.R') > fname <- 'example_fastq_file.txt' > tileVisualQuality(fname,,,,)

Changing the default values (encoding and ngrid are still default): > fname <- 'example_fastq_file.txt' > tile <- 14 > base <-1 > tileVisualQuality(fname,,tile,base,)

Increasing the resolution (could be slower and grainier): > fname <- 'example_fastq_file.txt' > tile <- 14 > base <- 1 > ngrid <- 500 > tileVisualQuality(fname,'',tile,base,ngrid)

Example of low-quality base position: > fname <- 'example_fastq_file.txt' > tile <- 14 > base <- 33 > ngrid <- 200 > tileVisualQuality(fname,'',tile,base,ngrid)

7/ Disclaimer: Use this code at your own risk, especially because I am an R novice. Contact me if you want to follow the code updates. A more refined and faster version is available in Python [|here]. The current version works for single lane fastq files only.