UNIX_Fall2012_session1_solutions


 * ==Solutions==


 * 1) Yeast Genome**

Unzip all the files with a wildcard match.

$ **cd fasta_files** $ **gunzip *.fa.gz**

Create the file from all the chromosomes with another wildcard match.

$ **cat *.fa > cerevisiae_genome.fasta**

Count the number of chromosomes

$ **grep -c '>' cerevisiae_genome.fasta**


 * Watch out for searching for just > without the quotes... you may overwrite your genome with a blank file.**

Lookup the command **wc. How do we use it?**

$ **man wc**

Use **wc** to count the length of the genome.

$ **wc cerevisiae_genome.fasta**

What does each column mean?

There are other characters besides nucleotides, so how do we get rid of them?

$ grep -v '>' cerevisiae_genome.fasta | wc

2) SGD Features
$ **wget ftp://genome-ftp.stanford.edu/pub/yeast/chromosomal_feature/SGD_features.tab**

The -O flag allows you to specify the filename of the downloaded file.

The number of ORFs in the file:

$ **grep -c ORF SGD_features.tab**

And the Verified ones:

$ **grep ORF SGD_features.tab | grep -c Verified**

And the Dubious ones:

$ **grep ORF SGD_features.tab | grep -c Dubious**

Now the real number of listed ORFs:

$ **cut -f 2 SGD_features.tab | grep -c ORF**

Genomic features:

$ **cut -f 2 SGD_features.tab | sort | uniq**

3) In depth with grep
Adding the -o flag tells grep to only report the section of the line that matches the pattern

egrep allows you to use more complicated regular expressions for your pattern, which allows you to be more flexible about what you are matching.

Combining egrep with the -o argument, we can search for any number of GA repeats:


 * $ egrep -o '(GA)+' genome.fasta | sort | uniq -c**


 * 657488 GA **
 *  38100 GAGA **
 *  2278 GAGAGA **
 *  159 GAGAGAGA **
 *  23 GAGAGAGAGA **
 *  2 GAGAGAGAGAGA **
 *  2 GAGAGAGAGAGAGA **
 *  1 GAGAGAGAGAGAGAGA **
 *  1 GAGAGAGAGAGAGAGAGA **
 *  1 GAGAGAGAGAGAGAGAGAGA **
 *  1 GAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGA ** ||  ||