UNIX_Fall2012_session2_solutions

**Solutions**
1. A different option for running bowtie ignores quality scores and requires reads to align end-to-end with no more than a user-specified number of mismatches. Use the command-line help and/or the bowtie manual (on the bowtie website) to figure out how to run bowtie with this option. What might be a good number of mismatches to allow under the scenario outlined above? Rerun bowtie with these options and save the output in a new file.

On the command line, typing "bowtie" or "bowtie --help" gives a short synopsis of all the options available to you. Scrolling down to the group of options under the heading "Alignment" you can see that the first line reads:

-v report end-to-end hits w/ <=v mismatches; ignore qualities

We are mapping sequence reads from a single basepair point mutant back to the wild-type strain that is otherwise genetically identical to this mutant. Therefore, in the absence of sequencing errors, we would expect all reads to map back with at most, a single mismatch. So allowing a maximum of one mismatch would be a logical starting point. The full command to save the output to a new file called "S288C_mutant.bowtie.v1" would therefore be:

$ **bowtie** **-v 1** **bowtie-0.12.7/indexes/S288C /global/courses/unixworkshop/S288C_mutant.fastq** **S288C_mutant.bowtie.v1**

2. Using the new bowtie output, use one of the unix utilities you learned about this morning to figure out which chromosome this 20 kb region resides on.

Out of curiosity, you might want to just take a look at the output, using the spacebar to scroll through a few pages:

$ **less** **S288C_mutant.bowtie.v1** Frag_2 - chromosome_V 517225 ACTTCACCCAAGGAAACACAAGATCTTCTTCAATCGCCCCAATTT N[SFMFrag_3 - chromosome_V 504829 TCCTGATATCATTACTGATATAAGATTATGTCAATTATTGATACC 1>CFrag_4 + chromosome_V 503427 ACCCGTTTTGTTCTCTTACGTTCAAATACTCTCTAACCTCTTGAG IFH;NEU?Frag_8 - chromosome_V 500404 TATATTCAAATTACACTAGTTATTTTATTGTATTCATATATTAAT >07.A;(KEGF>SHKRCM2C@KPLJPV=GFTGABTBFRPV5N;OMTDLGKLR 0 97:T>A ... ...

It looks like most of the reads are mapping to chromosome 5. Let's see if any map to any other chromosomes:

$ **grep -v chromosome_V S288C_mutant.bowtie.v1** //NO OUTPUT//

No results, which implies that there are no lines in the file that do not match the word 'chromosome_V'. I'd say we're pretty safe assuming that this region is on chromosome_V.

It actually doesn't matter in this case, but if there were reads also mapping to chromosome_VI in the bowtie results file, the above grep command still would not have returned any lines. Why not? As you may have figured out, it's because it can still find 'chromosome_V' in 'chromosome_VI'. To prevent this sneaky behavior from interfering with what you are trying to do, you can use the "-w" argument to tell grep it has to match the entire word. So the proper answer to the above question would be:
 * IMPORTANT NOTE**

$ **grep -v -w chromosome_V S288C_mutant.bowtie.v1**

3. Using a different utility, or a combination of several utilities to estimate the coordinates of this region on this chromosome.

We should be able to estimate the coordinates by looking for the smallest and largest position on chromosome_V where reads align. One way to do that would be to extract the column with that information from the bowtie output file, sort it, and look at the beginning and end of the list:

$ **cut -f 4 | sort | head** //499999// //500000// //500001// //500001// //500001// //500002// //500003// //500006// //500006// //500008//

$ **cut -f 4 | sort | tail** //519889// //519891// //519892// //519894// //519894// //519894// //519895// //519895// //519896// //519899//

Roughly, it looks like the PCR amplified region lies between 499999 and 519899 on chromosome_V

By default, sort does not sort numerically! We lucked out this time because, for the coordinates we're dealing with, alphanumeric and numeric end up being the same order.
 * IMPORTANT NOTE**

Bonus question
What is the mutation (i.e. a change from which base to which base)? Hint: you can infer this by using a single unix utility that you learned about this morning, twelve different times, on the bowtie output from Excercise #1.

In a perfect world where sequencing errors do not exist, all of our reads would map back to the genome without any mismatches, except for the one region with the point mutation. Unfortunately, we do not live in that perfect world which means that the point mutation is hidden amongst other mismatches that occur due to sequence errors. However, the errors should be fairly randomly distributed across the region and should occur as all possible types of mismatches. To make this a little more clear and to refine our expectation, let's estimate the sequence coverage across this region: Bowtie tells us we mapped 19,803 101bp reads. 19803 x 101 = 2000103 We also know that the region itself is approx. 20,000 bp. 2000103/20000 = ~100 So our expected coverage for this region is 100X. This means that, since there will be one true mismatch (due to the point mutation), we should expect to see roughly 100 instances of that particular mismatch beyond those generated by sequencing errors. It turns out that, if you enumerate all possible mismatches, there are twelve of them. You can use "grep -c" to count how many times each of these occurs in the bowtie output.

For example: $ **cut -f 8 | grep -c "T>A" S288C_mutant.bowtie.v1** //318//

You'll end up with a list that looks like this: C>G 141 G>C 150

C>T 186 G>A 268

T>C 301 A>G 329

T>A 307 A>T 344

G>T 356 C>A 365

T>G 533 A>C 546

If you group them by their complements as above, you can see that there is only one pair where one of the two complementary mismatches occurs (roughly) an extra 100 times: G>A

What is the coordinate of the mutation? Hint: it's easier to figure this out if you already know the mutation, but not necessary.

Knowing the mutation, you can grep for it and look at all the start coordinates for read alignments that contain that mismatch. It's fairly obvious that there are a bunch of coordinates between 505293 and 505392.

$ **grep "G>A" S288C_mutant.bowtie.v1 | cut -f 4 | sort** //**...**// //505373// //505378// //505378// //505380// //505380// //505383// //505387// //505387// //505388// //505388// //505389// //...//

Grep for one of these coordinates and use the position of the mismatch within the read versus the align start position to infer the position of the mutation: $ **grep 505391 S288C_mutant.bowtie.v1** Frag_3018 - chromosome_V 505391 ATAGCAGCCAAGGATATTTCTACTGAA... 6ISMSIMKMS6JKJQ@/QUG2ISKLSFTI... 0 98:G>A Frag_3671 + chromosome_V 505391 ATAGCAGCCAAGGATATTTCTACTGAA... V[SRWUJRHRYT;,RSZXS@XHSRSTWZS... 0 2:G>A Frag_6331 + chromosome_V 505391 ATAGCAGCCAAGGATATTTCTACTGAA... 7SUGTRDKBGQLAR;SLROPUOMNQCSGP... 0 2:G>A

Accounting for plus vs minus strand alignments, you can infer that the mutation is at position 505393.

IMPORTANT NOTE Bowtie coordinates are zero-based by default.