Skip to main content
guest
Join
|
Help
|
Sign In
CGRL
Home
guest
|
Join
|
Help
|
Sign In
Wiki Home
Recent Changes
Pages and Files
Members
Spring 2018 Workshops
Fall 2017 Workshops
Spring 2017 Workshops
Fall 2016 Workshops
Spring 2016 Workshops
Fall 2015 Workshops
Spring 2015 Workshops
Fall 2014 Workshops
Spring 2014 Workshops
Fall 2013 Workshops
Spring 2013 Workshops
UNIX_Fall2012_session1_solutions
Edit
0
2
…
0
Tags
No tags
Notify
RSS
Backlinks
Source
Print
Export (PDF)
Solutions
1) Yeast Genome
Unzip all the files with a wildcard match.
$
cd fasta_files
$
gunzip *.fa.gz
Create the file from all the chromosomes with another wildcard match.
$
cat *.fa > cerevisiae_genome.fasta
Count the number of chromosomes
$
grep -c '>' cerevisiae_genome.fasta
Watch out for searching for just > without the quotes... you may overwrite your genome with a blank file.
Lookup the command
wc. How do we use it?
$
man wc
Use
wc
to count the length of the genome.
$
wc cerevisiae_genome.fasta
What does each column mean?
There are other characters besides nucleotides, so how do we get rid of them?
$ grep -v '>' cerevisiae_genome.fasta | wc
2) SGD Features
$
wget
ftp://genome-ftp.stanford.edu/pub/yeast/chromosomal_feature/SGD_features.tab
The -O flag allows you to specify the filename of the downloaded file.
The number of ORFs in the file:
$
grep -c ORF SGD_features.tab
And the Verified ones:
$
grep ORF SGD_features.tab | grep -c Verified
And the Dubious ones:
$
grep ORF SGD_features.tab | grep -c Dubious
Now the real number of listed ORFs:
$
cut -f 2 SGD_features.tab | grep -c ORF
Genomic features:
$
cut -f 2 SGD_features.tab | sort | uniq
3) In depth with grep
Adding the -o flag tells grep to only report the section of the line that matches the pattern
egrep allows you to use more complicated regular expressions for your pattern, which allows you to be more flexible about what you are matching.
Combining egrep with the -o argument, we can search for any number of GA repeats:
$ egrep -o '(GA)+' genome.fasta | sort | uniq -c
657488 GA
38100 GAGA
2278 GAGAGA
159 GAGAGAGA
23 GAGAGAGAGA
2 GAGAGAGAGAGA
2 GAGAGAGAGAGAGA
1 GAGAGAGAGAGAGAGA
1 GAGAGAGAGAGAGAGAGA
1 GAGAGAGAGAGAGAGAGAGA
1 GAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGA
Javascript Required
You need to enable Javascript in your browser to edit pages.
help on how to format text
Turn off "Getting Started"
Home
...
Loading...
Solutions
1) Yeast Genome
Unzip all the files with a wildcard match.
$ cd fasta_files
$ gunzip *.fa.gz
Create the file from all the chromosomes with another wildcard match.
$ cat *.fa > cerevisiae_genome.fasta
Count the number of chromosomes
$ grep -c '>' cerevisiae_genome.fasta
Watch out for searching for just > without the quotes... you may overwrite your genome with a blank file.
Lookup the command wc. How do we use it?
$ man wc
Use wc to count the length of the genome.
$ wc cerevisiae_genome.fasta
What does each column mean?
There are other characters besides nucleotides, so how do we get rid of them?
$ grep -v '>' cerevisiae_genome.fasta | wc
2) SGD Features
$ wget ftp://genome-ftp.stanford.edu/pub/yeast/chromosomal_feature/SGD_features.tab
The -O flag allows you to specify the filename of the downloaded file.
The number of ORFs in the file:
$ grep -c ORF SGD_features.tab
And the Verified ones:
$ grep ORF SGD_features.tab | grep -c Verified
And the Dubious ones:
$ grep ORF SGD_features.tab | grep -c Dubious
Now the real number of listed ORFs:
$ cut -f 2 SGD_features.tab | grep -c ORF
Genomic features:
$ cut -f 2 SGD_features.tab | sort | uniq
3) In depth with grep
Adding the -o flag tells grep to only report the section of the line that matches the pattern
egrep allows you to use more complicated regular expressions for your pattern, which allows you to be more flexible about what you are matching.
Combining egrep with the -o argument, we can search for any number of GA repeats:
$ egrep -o '(GA)+' genome.fasta | sort | uniq -c
657488 GA
38100 GAGA
2278 GAGAGA
159 GAGAGAGA
23 GAGAGAGAGA
2 GAGAGAGAGAGA
2 GAGAGAGAGAGAGA
1 GAGAGAGAGAGAGAGA
1 GAGAGAGAGAGAGAGAGA
1 GAGAGAGAGAGAGAGAGAGA
1 GAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGA