Skip to main content
Try Wikispaces Classroom now.
Brand new from Wikispaces.
Pages and Files
Fall 2016 Workshops
Spring 2016 Workshops
Fall 2015 Workshops
Spring 2015 Workshops
Fall 2014 Workshops
Spring 2014 Workshops
Fall 2013 Workshops
Spring 2013 Workshops
Bioinformatics Pipeline for de novo Targeted Capture
You can download the virtual box image (.ova file) at
To open it in virtual box:
1) Open Virtual Box program
2) Click on File->Import Appliance
3) Find the OVA file you downloaded (click on the small folder icon next to the box) and click continue
4) Leave the appliance settings as default and click import.
5) This will take 5-10 minutes to import the virtual image into virtual box.
6) Once import is complete, select the new VM in the side bar of virtual box and select start to start up the OS. If you ever need a password use:cgrl2014
Instructor: Ke Bi
Contributors: Sonal Singhal, Ke Bi, Tyler Linderoth
For questions or to report bugs, please contact Ke Bi (
. Singhal S. 2013. De novo transcriptomic analyses for non-model organisms: an evaluation of methods across a multi-species data set. Molecular Ecology Resources 13:403-416.
. Bi K, Linderoth T, Vanderpool D, Good JM, Nielsen R and Moritz C. 2013. Unlocking the vault: next‐generation museum population genomics. Molecular Ecology 22:6018-6032.
. Bi K, Vanderpool D, Singhal S, Linderoth T, Moritz C and Good JM. 2012. Transcriptome-based exon capture enables highly cost-effective comparative genomic data collection at moderate evolutionary scales. BMC Genomics 13: e403.
The pipelines are deposited in
Scripts included in this pipeline:
Use “chmod +x script” to make each of these perl scripts executable.
*: Reformats raw sequencing reads from Illumina HiSeq or MiSeq for
Specifically, in this step we will remove reads that did not pass the Illumina quality control filters and modify the sequence identifiers.
Raw sequence data files are grouped and saved in folders named by their sample IDs. For instance, three libraries (CGRL_index1, CGRL_index15, CGRL_index40) are saved under “/home/ke/Desktop/SeqCap/data/rawdata/library/”. Compressed fastq sequence files are saved in each of these folders.
Fastq files use the following naming scheme:
<sample name>_<barcode sequence>_L<lane (0-padded to 3 digits)>_R<read number>_<set number (0-padded to 3 digits)>.fastq.gz
For example, in “CGRL_index15_CGACCTG_L006_R1_001.fastq.gz”:
sample name: CGRL_index15
barcode sequence: CGACCTG
lane (0-padded to 3 digits): 006
read number: 1
set number (0-padded to 3 digits): 001
#Make a new folder called “raw” under “~/Desktop/SeqCap/data/rawdata/”:
ke@NGS:~/Desktop/SeqCap/data/rawdata$ mkdir raw
Copy all these compressed fastq files from each folder (CGRL_index1, CGRL_index15, CGRL_index40) to “raw”
ke@NGS:~/Desktop/SeqCap/data/rawdata$ cp library/CGRL_index*/*.gz raw
#Check data files in “raw”:
ke@NGS:~/Desktop/SeqCap/data/rawdata$ ls raw/*
#cd to the working directory:
ke@NGS:~/Desktop/SeqCap/data/rawdata$ cd ..
#run 1-PreCleanup with fastq evaluation
ke@NGS:~/Desktop/SeqCap/data$ 1-PreCleanup ~/Desktop/SeqCap/data/rawdata/raw/ fastqc
Three new folders will be created under “~/Desktop/SeqCap/data/rawdata/raw/”:
- Folder “pre-clean” contains reformatted raw fastq reads.
- Folder “combined” contains merged, compressed, fastq data files (not used by the following pipeline).
- Folder “evaluation” contains fastQC results for each data file.
1. Check the sequence identifiers and the number of reads in fastq files before and after running
and compare the results.
2. Check the fastQC evaluation results for the raw data
*: Clean up raw data, which includes trimming for quality, removing adapters, merging overlapping reads, removing duplicates and reads sourced from contamination
FLASh-modified: modified version of FLASh by Filipe G. Vieira.
1. Reformatted fastq files created by
#Check the raw data files:
ke@NGS:~/Desktop/SeqCap/data/rawdata/raw/pre-clean$ ls *.fq
2. A fasta file that contains adapter sequences:
#Check the format of adapter sequence file:
ke@NGS:~/Desktop/SeqCap/denovoTargetCapture/associated_files $ less -S Adapters.fasta
Note: The header of each adapter sequence has to be named strictly as “
N” or “
N”. N is the number of index. It is OK to put all adapters in this file but your libraries only use a subset of them.
3. Library info file (Tab-delimited txt file):
#Check the format of
ke@NGS:~/Desktop/SeqCap/denovoTargetCapture/associated_files $ less -S libInfo.txt
library P7 P5
leave the “P5” column blank if you only have indexes in P7 adapters in the libraries.
4. Contaminant file:
( + human + other genome resources if desired) genome in fasta format.
This file (e_coli_K12.fasta) is saved in “~/Desktop/SeqCap/denovoTargetCapture/associated_files/ecoli/”
#Make a new folder called “cleaned_data” in “~/Desktop/SeqCap/data/”:
ke@NGS:~/Desktop/SeqCap/data$ mkdir cleaned_data
ke@NGS:~/Desktop/SeqCap/data$ 2-ScrubReads -f ~/Desktop/SeqCap/data/rawdata/raw/pre-clean/ -o ~/Desktop/SeqCap/data/cleaned_data/ -a ~/Desktop/SeqCap/denovoTargetCapture/associated_files/Adapters.fasta -b ~/Desktop/SeqCap/denovoTargetCapture/associated_files/libInfo.txt -t /home/ke/Desktop/SeqCap/programs/Trimmomatic-0.32/trimmomatic-0.32.jar -c ~/Desktop/SeqCap/denovoTargetCapture/associated_files/ecoli/e_coli_K12.fasta -e 200 -m 15 -z
Note: I use default values for most of the arguments. Users should adjust these parameters when processing the real datasets.
1. In “~/Desktop/SeqCap/data/cleaned_data/”, six .txt files per library are produced:
For example for library CGRL_index1, the six files are:
CGRL_index1_1_final.txt (left reads)
CGRL_index1_2_final.txt (right reads)
CGRL_index1_u_final.txt (merged or unpaired reads)
CGRL_index1.contam.out (headers of reads aligned to bacteria)
CGRL_index1.duplicates.out (headers of duplicated reads)
CGRL_index1.lowComplexity.out (headers of low complexity reads)
In “~/Desktop/SeqCap/data/cleaned_data/evaluation/”, you can find fastQC results for cleaned reads from each library.
1. Check the %reads that are exact duplicates, %reads that are likely derived from microbial genome and %reads that contain low complexity.
2. Check the fastQC evaluation results of cleaned reads and then compare them to those of raw reads. Is the quality improved?
*: Assemble sequence capture data using ABySS.
We use a multiple-kmer approach to assemble our data. If there is even coverage and even polymorphism levels across the assembled genome, there should (in theory) be one k-mer that best assembles the data. In reality, coverage and polymorphism vary across captured loci, and using multiple k-mers is a way to bet hedge and get good assemblies for all loci. In assembling your data, it is important to consider which samples to use in your assembly. Ideally, you could assemble across multiple individuals to increase your read depth, and thus, assembly contiguity and continuity. However, for many projects, more individuals can also mean increased polymorphism. While we have found the assemblers are more robust to polymorphism than the program writers themselves often suggest, increased polymorphism does lead to shorter contigs and increased misassemblies. With these sample data, we assembled across all in-group samples – this seemed like the best balance between having enough data to power assembly while not introducing too much polymorphism.
ABySS (compiled with OpenMPI and Google sparsehash):
Concatenated cleaned reads from libraries that you would like to assemble together. The libraries to be assembled together have to be genetically similar: ideally, samples from the same population. In this example we want to assemble CGRL_index1, CGRL_index15 and CGRL_index40 together.
#Make a new folder called “raw_assembly” under “~/Desktop/SeqCap/data/”:
ke@NGS:~/Desktop/SeqCap/data$ mkdir raw_assembly
#Concatenate cleaned reads and save them in “raw_assembly”:
ke@NGS:~/Desktop/SeqCap/data$ cat cleaned_data/CGRL_index*_1_final.txt > raw_assembly/combined_1_final.txt
ke@NGS:~/Desktop/SeqCap/data$ cat cleaned_data/CGRL_index*_2_final.txt > raw_assembly/combined_2_final.txt
ke@NGS:~/Desktop/SeqCap/data$ cat cleaned_data/CGRL_index*_u_final.txt > raw_assembly/combined_u_final.txt
#Inside “raw_assemblies” make a new folder “results”:
ke@NGS:~/Desktop/SeqCap/data$ mkdir raw_assembly/results
#Run ABySS on two processors using kmer sizes of 21, 31, 41, 51, 61, and 71.
ke@NGS:~/Desktop/SeqCap/data$ 3-GenerateAssemblies abyss -reads ~/Desktop/SeqCap/data/raw_assembly/ -mpi /usr/bin/mpirun -out ~/Desktop/SeqCap/data/raw_assembly/results/ -kmer 21 31 41 51 61 71 -np 2
Note: Your labtop will not be able to handle memory intensive
There are a lot of intermediate files created in “~/Desktop/SeqCap/data/raw_assembly/results/combined/”.
#To show the assemblies that we need for the next step:
ke@NGS:~/Desktop/SeqCap/data$ cd raw_assembly/results/combined
ke@NGS:~/Desktop/SeqCap/data/raw_assembly/results/combined$ ls *-contigs.fa
#Combine all the raw assemblies and write the result to a new file called “all_assemblies.fasta”:
ke@NGS:~/Desktop/SeqCap/data/raw_assembly/results/combined$ cat combined_*_cov_default-contigs.fa > all_assemblies.fasta
#Make a new folder called “merge_assemblies” under “~/Desktop/SeqCap/data/”:
ke@NGS:~/Desktop/SeqCap/data $ mkdir merge_assemblies
opy “all_assemblies.fasta” into “merge_assemblies/”:
ke@NGS:~/Desktop/SeqCap/data $ cp raw_assembly/results/combined/all_assemblies.fasta merge_assemblies
*:Combining assembled contigs across multiple k-mers to generate a final assembly introduces a lot of redundancy into the final assembly. To address this, we use a lightweight assembler cap3 and other programs (blat,
cd-hit-est) to merge contigs and to remove redundancies.
Concatenated raw assemblies “~/Desktop/SeqCap/data/merge_assemblies/all_assemblies.fasta” produced by
ke@NGS:~/Desktop/SeqCap/data$ 4-FinalAssembly -a ~/Desktop/SeqCap/data/merge_assemblies/ -c 1000
Note: when analyzing real data, users should test these parameters (-d -e -b) for optimal results.
Several files are created in “~/Desktop/SeqCap/data/merge_assemblies/”. The data file that we need for the next step is “all_assemblies.fasta.final”.
ke@NGS:~/Desktop/SeqCap/data/merge_assemblies$ mv all_assemblies.fasta.final all_assemblies_final.fasta
*: identify contigs that are stemmed from the targeted loci and use these contigs as a reference (aka. a pseudo-reference)
Here, we suggest taking a very conservative approach to define the reference genome against which you will align your reads. You will likely get many multiples more contigs than loci you targeted. Some of these might be junk; some might be real. Rather than try to identify which of the extraneous contigs are junk or real, we suggest using only those contigs which match to the original targets. To do so, we implement a BLAST approach, which identifies which contig has the best-hit match to one’s targeted loci.
1. “~/Desktop/SeqCap/data/merge_assemblies/all_assemblies_final.fasta” produced by
2. Targeted loci fasta file: “~/Desktop/SeqCap/denovoTargetCapture/original_target /targeted_loci.fasta” contains loci/genes/exons from which probes are designed.
#Make a new folder called “in_target_assemblies”: under “~/Desktop/SeqCap/data/”:
ke@NGS:~/Desktop/SeqCap/data$ mkdir in_target_assemblies
ke@NGS:~/Desktop/SeqCap/data/in_target_assemblies$ 5-FindingTargets -t ~/Desktop/SeqCap/denovoTargetCapture/original_target/targeted_loci.fasta -a ~/Desktop/SeqCap/data/merge_assemblies/all_assemblies_final.fasta -o in_target.fasta -e in_target.captured
Two files are created and stored in “~/Desktop/SeqCap/data/in_target_assemblies/”:
1. “in_target.fasta”: A fasta sequence file containing contigs that are stemmed from the targeted loci.
2. “in_target.captured”: A txt file containing percent captured for each target (tab-delimited).
* (Optional): function “BASIC” evaluates the quality of in-target assemblies by reporting basic stats: mean, median, total length, gc%, N50 etc. It also generates a distribution of contigs by binned lengths.
In reality, “BASIC” can evaluates quality of any assemblies.
This script also assesses the quality of transcriptome/targeted capture assemblies form several other aspects. For example:
“COVERAGE” calculates error rate, average quality score of the aligned bases and its variance/std, and average base coverage. Users need to generate alignment first.
“FIX” fixes assembly errors. Users need to generate alignment first.
For more details please execute “
” in a terminal window.
“~/Desktop/SeqCap/data/in_target_assemblies/in_target.fasta” produced by
ke@NGS:~/Desktop/SeqCap/data $ 6-AssemblyEvaluation BASIC -a ~/Desktop/SeqCap/data/in_target_assemblies
Two files are created in “~/Desktop/SeqCap/data/in_target_assemblies/”:
1. “in_target.hist”: distribution of contigs by binned lengths
#Display first few lines of the file:
ke@NGS:~/Desktop/SeqCap/data/in_target_assemblies$ head in_target.hist
2. “basic_evaluation.out”: results of assembly evaluation
#Display first few lines of the file:
ke@NGS:~/Desktop/SeqCap/data/in_target_assemblies$ head basic_evaluation.out
Note: users might want to compare the metric between in_target assemblies to original targeted loci and see how much flanking are captured and assembled. Under most circumstances, the mean, median, N50 etc should be much higher in in_target assemblies than in original targeted loci.
The example dataset that is used for the purpose of demonstration, however, should not show this pattern indicated above since it is assembled from a tiny fraction of data.
: aligning cleaned reads against the pseudo-reference using Novoalign
Novoalign “is an aligner for single-ended and paired-end reads from the Illumina. Novoalign finds global optimum alignments using full Needleman-Wunsch algorithm with affine gap penalties.”
“Question: How does Novoalign compare to programs like BWA, Bowtie, ELAND and BFAST?
Novoalign was designed to be an accurate short read aligner that combines fast K-mer index searching with dynamic programming. In terms of speed Novoalign will be slower than Burrows-Wheeler transform aligners e.g. BWA, Bowtie and in some cases faster than BFAST. In terms of accuracy Novoalign is in most cases more sensitive than these tools because it uses full dynamic programming to find the best alignment of a short read to a genome sequence.”
According to Heng Li, author of SAMTools & MAQ, Novoalign “is the most accurate aligner to date”.
1. A pseudo-reference genome, “~/Desktop/SeqCap/data/ in_target_assemblies/in_target.fasta”, generated by
#make a new directory called “reference” under “~/Desktop/SeqCap/data/”:
ke@NGS:~/Desktop/SeqCap/data$ mkdir reference
#Copy “in_target.fasta” to “~/Desktop/SeqCap/data/reference/”:
ke@NGS:~/Desktop/SeqCap/data$ cp in_target_assemblies/in_target.fasta reference/
2. Cleaned reads generated by
Cleaned reads are saved in "~/Desktop/SeqCap/data/cleaned_data/".
#Take a look at these reads:
ke@NGS:~/Desktop/SeqCap/data/cleaned_data$ ls *.txt
#Make a new folder called “alignment” under “~/Desktop/SeqCap/data/”:
ke@NGS:~/Desktop/SeqCap/data$ mkdir alignment
ke@NGS:~/Desktop/SeqCap/data$ 7-Alignment -f ~/Desktop/SeqCap/data/reference/in_target.fasta -r ~/Desktop/SeqCap/data/cleaned_data/ -o ~/Desktop/SeqCap/data/alignment/ -i 200 -v 20 -t 90
Note: do not set t for alignment of very divergent genomes.
BAMS and indexed bam files.
#Take a look at these files:
* (Optional): Function “Evaluation” provides evaluation for capture efficiency: %reads mapped, %target captured, average sequence depth, etc.
Note: %reads mapped (specificity), %target captured (sensitivity), and average sequence depth are typically reported in papers.
1. A pseudo-reference “in_target.fasta” generated by
You can find this file in “~/Desktop/SeqCap/data/reference/”.
2. Cleaned reads generated by
These reads are located in “~/Desktop/SeqCap/data/cleaned_data/”:
3. Raw reads generated by
These data are located in “~/Desktop/SeqCap/data/rawdata/raw/pre-clean/”:
4. All bam (alignment) files generated by
The bams (sorted and indexed) are located in “~/Desktop/SeqCap/data/alignment/”:
5. A .bed file generated by
A BED file (.bed) is a tab-delimited text file that defines a feature track of each locus. In this case, this file defines targeted region in each assembled contig.
For example if the length of contig125 is 1000bp, but the targeted region starts from position 120 and ends by 350, then the correct expression is:
Contig125 119 350 (note: in bed the start position is one less than it's actual value)
For more details of BED format please go to:
#Make a new folder called “ExonCapEval” under “~/Desktop/SeqCap/data/”:
ke@NGS:~/Desktop/SeqCap/data$ mkdir ExonCapEval
ke@NGS:~/Desktop/SeqCap/data$ 8-ExonCaptureEvaluation Evaluation -genome ~/Desktop/SeqCap/data/reference/in_target.fasta -cleanDir ~/Desktop/SeqCap/data/cleaned_data/ -rawDir ~/Desktop/SeqCap/data/rawdata/raw/pre-clean/ -bamDir ~/Desktop/SeqCap/data/alignment/ -InstrID HS -resDir ~/Desktop/SeqCap/data/ExonCapEval/ -readlen 100
Note: If you just evaluate how targeted regions worked, you should provide a bed file (generated by
) while running 8-ExonCaptureEvaluation Evaluation.
“data_metrics.txt” under “~/Desktop/SeqCap/data/ExonCapEval/”
You can use “less” to check the results reported in this file.
bed” generates a bed for exonic region(s) from each contig in in-target assemblies (aka. the reference) and a bed for all assembled contigs (start position is 0 and the end position is the length of the contig); “
a list of contigs that fall outside the desired coverage percentiles;
” also produces base coverage values at different level of percentile.
1.Targeted loci: “
2. “~/Desktop/SeqCap/data/reference/in_target.fasta” generated by
#Make a new folder called “bed_files” under “~/Desktop/SeqCap/data/”:
ke@NGS:~/Desktop/SeqCap/data$ mkdir bed_files
#cd to this folder:
ke@NGS:~/Desktop/SeqCap/data$ cd bed_files
ke@NGS:~/Desktop/SeqCap/data/bed_files$ 9-preFiltering bed ~/Desktop/SeqCap/denovoTargetCapture/original_target/targeted_loci.fasta ~/Desktop/SeqCap/data/reference/in_target.fasta
Two file under “~/Desktop/SeqCap/data/bed_files/”:
1. “final.bed” is used as input for
2. “All_contig.bed” is used as input for
1. Make a new folder called “pre-filtering” under “~/Desktop/SeqCap/data/” and cd to this folder:
ke@NGS:~/Desktop/SeqCap/data$ mkdir pre-filtering
ke@NGS:~/Desktop/SeqCap/data$ cd pre-filtering
2. In “~/Desktop/SeqCap/data/pre-filtering/”, g
enerate a merged, sorted bam for all samples:
ke@NGS:~/Desktop/SeqCap/data/pre-filtering$ samtools merge merge.bam ~/Desktop/SeqCap/data/alignment/*.bam
ke@NGS:~/Desktop/SeqCap/data/pre-filtering$ samtools sort merge.bam merge_sorted
“~/Desktop/SeqCap/data/pre-filtering/merge_sorted.bam” is the input bam.
3. A bed file:
“~/Desktop/SeqCap/data/bed_files/final.bed” is generated by
ke@NGS:~/Desktop/SeqCap/data/pre-filtering$ 9-preFiltering percentile -b ~/Desktop/SeqCap/data/pre-filtering/merge_sorted.bam -o CGRL -B ~/Desktop/SeqCap/data/bed_files/final.bed
In the folder “~/Desktop/SeqCap/data/pre-filtering/” there are a couple of files created:
1. “CGRL_gene_outside_percentile.txt” shows a list of contigs having coverage <X% or >Y% percentiles of the data. X and Y are defined by users. This file will be used in
2. “CGRL_site_depth_percentile.txt” shows base coverage at different level of percentiles. The information in this file will be used by
3. “CGRL_gene_depth_percentile.txt” shows average base coverage at different level of percentiles.
4. “CGRL_gene_depth.txt” shows average coverage of each contig. If you want to know more about empirical coverage distribution of your data then you take the coverage value from this file and use R to plot it.
5. “CGRL_site_depth.txt” shows per-base coverage of the data. You can plot it to get a sense of empirical distribution of base coverage.
6. “CGRL_gene_outside_sd_filter.txt”: shows a list of contigs all outside N standard deviation of the mean. Users set N when running the command.
Note: users might want to perform filtering based on other criteria such as 3 standard deviations of the mean. However, this method usually requires a normal distribution of the data. In reality per-base depth of exon capture data rarely follows a normal distribution.
: Raw variant filtering and generates a “keep” file for the following SNP/genotype calling by ANGSD. This script is mainly for filtering data at contig and site levels. Users need to perform individual-level filtering before running this script. See below for more details.
Before we call SNPs /genotypes and estimate allele frequencies using ANGSD,
we usually employ three levels of filtering on the data sets in a hierarchical order: individual level, contig level and site level. The filters in each step of the hierarchy are applied only to the subset of data that pass the quality control thresholds at all previous levels. The first filters applied are the individual-level filters to remove entire individuals deviating excessively from the average across-individual coverage and error rate. Contig-level filters, followed by site-level filters, are then applied to remove entire contigs and sites, respectively, that appeared to be quality outliers. All individual specimens, contigs and sites should be filtered on multiple aspects of quality (e.g. potential cross-sample DNA contamination, sequencing errors, paralogy).
1. Filtering at individual level
a. Remove individuals having extremely low or high coverage. Individual coverage can be estimated using
The file you want to examine is
ke@NGS:~/Desktop/SeqCap/data$ less –S
b. Remove individuals with excessively high sequencing error rates measured as the percentage of mismatched bases out of the total number of aligned bases in the mitochondrial genome. Empirical error can be estimated using
you need first to generate pileup files for mitochondrial locus for each sample.
Usage 6-AssemblyEvaluation COVERAGE [options]
-p DIR folder containing all pileup
files generated by "samtools
mpileup -f ref.fa sample1.bam
-c INT coverage cutoff 
-q INT base quality cutoff 
2. Filtering at contig level
a. Remove contigs that show extremely low or high coverage based on the empirical coverage distribution across all contigs.
can be used to generate a list of contigs that show extreme coverage based on percentile values (for example: 1% and 99%; 5% and 95% etc.). This list can then be used as one of input files in
for the purpose of filtering.
b. Remove contigs with at least one SNP having allele frequencies highly deviating from Hardy–Weinberg equilibrium expectations. Done by
. Note this is a very stringent filter even for exon capture dataset and not suitable at all for genomic dataset. To use this filter you need to provide “~/Desktop/SeqCap/data/bed_files/All_contig.bed” generated by
3. Filtering at site level
a. Remove sites with excessively low or high coverage based on the empirical coverage distribution. To determine high (e.g. 99% or 95%) and low (e.g. 1% or 5%) percentiles of base coverage you need run
b. Remove sites having allele frequencies highly deviating from Hardy–Weinberg equilibrium expectations (exact test). Done by
. This filter can be combined with the contig HWE filter (2.b).
c. Remove sites with biases associated with reference and alternative allele Phred quality, mapping quality and distance of alleles from the ends of reads. Also remove sites that show a bias towards SNPs coming from the forward or reverse strand. These will be done by
-> Strand Bias: Tests if variant bases tend to come from one strand.
-> End Distance Bias: Tests if variant bases tend to occur at a fixed distance from the end of reads, which is usually an indication of misalignment.
-> Base Quality Bias: Tests if variant bases tend to occur with a Phred-scale quality bias.
-> Mapping Quality Bias: Tests if variant bases tend to occur with a mapping quality bias.
d. Remove sites for which there are not at least M of the individuals sequenced at N coverage each. This makes sure that the remaining data matrix does not contain too much missing data. This will be done by
e. Remove sites with a root mean square (RMS) mapping quality for SNPs across all samples below a certain threshold. It is a measure of the variance of quality scores. This will be done by
f. (optional) For historic samples, characterize the pattern of base mis-incorporation first. Sometimes it is necessary to remove C to T and G to A SNPs from the dataset. This can be done by
make sure that
individual-level filtering is finished.
1. “~/Desktop/SeqCap/data/pre-filtering/CGRL_gene_outside_percentile.txt” by
2. “~/Desktop/SeqCap/data/bed_files/All_contig.bed” generated by
3. “~/Desktop/SeqCap/data/pre-filtering/CGRL_site_depth_percentile.txt” by
is ready and will be used to guide the site-level coverage filtering.
4. Create a new folder “SNPcleaning” under “~/Desktop/SeqCap/data/” and inside this folder generate a raw vcf that contains all sites from all individual samples that pass individual-level filters:
ke@NGS:~/Desktop/SeqCap/data$ mkdir SNPcleaning
ke@NGS:~/Desktop/SeqCap/data$ cd SNPcleaning/
ke@NGS:~/Desktop/SeqCap/data/SNPcleaning$ samtools mpileup -B -D -I -S -uf ~/Desktop/SeqCap/data/reference/in_target.fasta ~/Desktop/SeqCap/data/alignment/*sorted.bam | bcftools view -cg - > raw.vcf
-D output per-sample DP in BCF
-B disable BAQ computation
-I do not perform indel calling
-S output per-sample strand bias P-value in BCF
ke@NGS:~/Desktop/SeqCap/data/SNPcleaning$ 10-SNPcleaner -d 2 -D 7 -k 2 -u 1 -a 0 -B CGRL.bed -p CGRL_filtered -r ~/Desktop/SeqCap/data/pre-filtering/CGRL_gene_outside_percentile.txt -X ~/Desktop/SeqCap/data/bed_files/All_contig.bed -g -v raw.vcf> out.vcf
Note: for “-D 7”, 7 is the 99% percentile of the base coverage. We get this number from “~/Desktop/SeqCap/data/pre-filtering/CGRL_site_depth_percentile.txt”.
In “~/Desktop/SeqCap/data/SNPcleaning/”, several files are created:
1. “CGRL.bed” contains sites (potentially variable and non-variable) passing all filters.
#To generate a keep file for ANGSD:
ke@NGS:~/Desktop/SeqCap/data/SNPcleaning$ cut -f1,2 CGRL.bed > CGRL.keep
2. “CGRL_filtered” (dumped with option -p) contains all sites that failed to pass certain filters. Characters in front of filtered sites indicate filters that the site failed to pass.
#To view this file:
ke@NGS:~/Desktop/SeqCap/data/SNPcleaning$ bunzip2 -c CGRL_filtered | less -S
3. “out.vcf” is the resulting vcf that contains sites (both variable and non-variable) passed all filters
1. Check how many sites are present before and after filtering?
2. Check why some sites are filtered out by examining “CGRL_filtered”.
help on how to format text
Turn off "Getting Started"