Solutions


1) Yeast Genome

Unzip all the files with a wildcard match.

$ cd fasta_files
$ gunzip *.fa.gz

Create the file from all the chromosomes with another wildcard match.

$ cat *.fa > cerevisiae_genome.fasta

Count the number of chromosomes

$ grep -c '>' cerevisiae_genome.fasta

Watch out for searching for just > without the quotes... you may overwrite your genome with a blank file.

Lookup the command wc. How do we use it?

$ man wc

Use wc to count the length of the genome.

$ wc cerevisiae_genome.fasta

What does each column mean?

There are other characters besides nucleotides, so how do we get rid of them?

$ grep -v '>' cerevisiae_genome.fasta | wc

2) SGD Features


$ wget ftp://genome-ftp.stanford.edu/pub/yeast/chromosomal_feature/SGD_features.tab

The -O flag allows you to specify the filename of the downloaded file.

The number of ORFs in the file:

$ grep -c ORF SGD_features.tab

And the Verified ones:

$ grep ORF SGD_features.tab | grep -c Verified

And the Dubious ones:

$ grep ORF SGD_features.tab | grep -c Dubious

Now the real number of listed ORFs:

$ cut -f 2 SGD_features.tab | grep -c ORF

Genomic features:

$ cut -f 2 SGD_features.tab | sort | uniq

3) In depth with grep


Adding the -o flag tells grep to only report the section of the line that matches the pattern

egrep allows you to use more complicated regular expressions for your pattern, which allows you to be more flexible about what you are matching.

Combining egrep with the -o argument, we can search for any number of GA repeats:

$ egrep -o '(GA)+' genome.fasta | sort | uniq -c

657488 GA
38100 GAGA
2278 GAGAGA
159 GAGAGAGA
23 GAGAGAGAGA
2 GAGAGAGAGAGA
2 GAGAGAGAGAGAGA
1 GAGAGAGAGAGAGAGA
1 GAGAGAGAGAGAGAGAGA
1 GAGAGAGAGAGAGAGAGAGA
1 GAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGA