Recent Changes

Sunday, September 17

  1. page Introduction to basic Unix commands edited ... [cgrlunix@poset ~]$ awk '$3 - $2 + 1 > 1000' gene.bed We can also chain patterns, by using…
    ...
    [cgrlunix@poset ~]$ awk '$3 - $2 + 1 > 1000' gene.bed
    We can also chain patterns, by using logical operators && (and), || (or), and ! (not). For example, if we want to extract all records derived from chromosome 1 if the length is greater than 1000:
    ...
    - $2 + 1 > 1000'
    The first pattern, $1 ~ /^1$/, is how we specify a regular expression. Regular expressions are in slashes. Here, we’re matching the first field, $1, against the regular expression 1. The tilde, ~ means match; to not much the regular expression we would use !~ (or !($1 ~ /^1$/)).
    - Pattern and action
    We can combine patterns and more complex actions than just printing the entire record. For example, if we want to add a column with the length of this feature (end position - start position) for only mitochondrial, we could use:
    ...
    - $2 }'+ 1}' gene.bed
    So far, these exercises illustrate how Awk is useful in (1) filtering data using rules that can combine regular expressions and arithmetic, and (2) reformatting the columns of data using arithmetic. These two applications alone make Awk an extremely handy tool in bioinformatics, and a huge time saver.
    Now let’s look at some slightly more advanced use cases. We’ll start by introducing two special patterns: BEGIN and END.
    The BEGIN pattern, specifies what to do before the first record is read in, and END specifies what to do after the last record’s processing is complete. BEGIN is useful to initialize and set up variables, and END is useful to print data summaries at the end of file processing. For example, we want to calculate the mean length of all the record in gene.bed. We would have to take the sum of all lengths, and then divide the sum by the total number of records. We can do this with:
    ...
    s += ($3-$2)($3 - $2 + 1) }; END
    ...
    gene.bed
    mean: 177.881224.604
    NR is the current record number, so on the last record NR is set to the total number of records processed. In the example above, we’ve initialized a variable s to 0 in BEGIN (variables you define do not need a dollar sign). Then, for each record we increment s by the length of the feature. At the end of the records, we print this sum s divided by the number of records NR, giving the mean.
    While Awk is designed to work with whitespace-separated tabular data, it’s easy to set a different field separator: simply specify which separator to use with the -F argument. For example, we could work with a CSV file in Awk by starting with awk -F",".
    (view changes)
    7:05 pm
  2. page Introduction to basic Unix commands edited ... - Pattern only Let’s now look at how we can incorporate simple pattern matching. Suppose we w…
    ...
    - Pattern only
    Let’s now look at how we can incorporate simple pattern matching. Suppose we wanted to write a filter that only output lines where the length of the feature (end position - start position) is greater than 1,000. Awk supports arithmetic with the standard operators +, -, *, /, % (remainder), and ^ (exponentiation). We can subtract within a pattern to calculate the length of a feature, and filter on that expression:
    ...
    - $2 + 1 > 1000'
    We can also chain patterns, by using logical operators && (and), || (or), and ! (not). For example, if we want to extract all records derived from chromosome 1 if the length is greater than 1000:
    [cgrlunix@poset ~]$ awk '$1 ~ /^1$/ && $3 - $2 > 1000' gene.bed
    (view changes)
    6:59 pm
  3. page Introduction to basic Unix commands edited ... remove the first and last lines [cgrlunix@poset ~]$ sed '1d;$d' Homo_sapiens.GRCh38.81.gtf | …
    ...
    remove the first and last lines
    [cgrlunix@poset ~]$ sed '1d;$d' Homo_sapiens.GRCh38.81.gtf | less -S
    ...
    or transliterate)
    tr reads from stdin or file (one line at a time) and replaces or removes or compresses specific characters
    1. replacement:
    For example, we can use tr to transform lowercase to uppercase
    [cgrlunix@poset ~]$ echo "hello world" | tr 'h' 'H'
    ...
    to uppercase
    [cgrlunix@poset ~]$ echo "hello world" | tr 'a-z' 'A-Z'
    use tr for searching and replacing specific characters
    ...
    [cgrlunix@poset ~]$ echo "hello world" | tr -d 'o'
    3. compress:
    ...
    identical characters intoto one
    [cgrlunix@poset ~]$ echo "hello world" | tr -s 'l'
    [cgrlunix@poset ~]$ echo "hello world" | tr -s ' '
    (view changes)
    2:57 pm
  4. page Introduction to basic Unix commands edited ... remove the first and last lines [cgrlunix@poset ~]$ sed '1d;$d' Homo_sapiens.GRCh38.81.gtf | …
    ...
    remove the first and last lines
    [cgrlunix@poset ~]$ sed '1d;$d' Homo_sapiens.GRCh38.81.gtf | less -S
    tr (translate or transliterate)
    tr reads from stdin or file (one line at a time) and replaces or removes or compresses specific characters
    1. replacement:
    For example, we can use tr to transform lowercase to uppercase
    [cgrlunix@poset ~]$ echo "hello world" | tr 'h' 'H'
    use tr to transform all lowercase to uppercase
    [cgrlunix@poset ~]$ echo "hello world" | tr 'a-z' 'A-Z'
    use tr for searching and replacing specific characters
    [cgrlunix@poset ~]$ echo "hello world" | tr 'eo' '12'
    2. delete:
    we can use tr -d to search for characters and delete them
    [cgrlunix@poset ~]$ echo "hello world" | tr -d 'o'
    3. compress:
    tr -s can be used to compress consecutive identical characters into one
    [cgrlunix@poset ~]$ echo "hello world" | tr -s 'l'
    [cgrlunix@poset ~]$ echo "hello world" | tr -s ' '
    [cgrlunix@poset ~]$ echo "hello world" | tr -s ' ' ','

    awk:
    Awk is a specialized language that allows you to do a variety of text processing tasks with ease. It mainly help us extract data from and manipulate tabular plaintext files. Throughout the workshop, we’ve seen how we can use simple Unix tools like grep, cut, and sort to inspect and manipulate plaintext tabular data in the shell. For many trivial bioinformatics tasks, these tools allow us to get the job done quickly and easily (and often very efficiently). Still, some tasks are slightly more complex and require a more expressive and powerful tool. This is where the language and tool Awk excels.
    (view changes)
    2:56 pm
  5. page Introduction to basic Unix commands edited ... CCTGCTaGGGCTTTCTGTTGCCaaGaGGCCTCTCTGGaGaCaGGCaTCTaTGCaaaGTGGGaaGGaCaCCaCTGaGCaaGaaaTTCTGaaaGCT…
    ...
    CCTGCTaGGGCTTTCTGTTGCCaaGaGGCCTCTCTGGaGaCaGGCaTCTaTGCaaaGTGGGaaGGaCaCCaCTGaGCaaGaaaTTCTGaaaGCTaT
    CaaCaTCaaTTCCTTTGCaGaGTGTGGCaTCaaTTTaTTCCaTGaGaGTGTaTCTaaaTCaGCCCTGaGCCaaGaaTTCGaaGCTTTCTTTCGT
    Now, suppose we want to capture the chromosome name, and start and end positions in a string containing a genomic region in the format "chr1:28427874-28425431", and output these as three columns (bed format). We could use:
    [cgrlunix@poset ~]$ echo chr1:28427874-28425431 |sed 's/:/\t/' | sed 's/-/\t/'
    chr1 28427874 28425431
    Or we can combine the two sed commands:
    [cgrlunix@poset ~]$ echo chr1:28427874-28425431 | sed 's/[:-]/\t/g'
    chr1 28427874 28425431
    [ ] (Square Brackets) — Matches any of a set of characters inside the bracket

    Substitutions make up the majority of sed’s usage cases, but this is just scratching the surface of sed’s capabilities. It’s also possible to select and print certain ranges of lines with sed. In this case, we’re not doing pattern matching, so we don’t need slashes. To print the lines 5 through 8 of the fastq file, we use:
    [cgrlunix@poset ~]$ sed -n '5,8p' data.fastq
    (view changes)
    1:18 pm
  6. page Introduction to basic Unix commands edited ... 11:CGTCGACTGATCGTAGCTGATCGTACGTCGACTGATCGTAGCT The above command grep each line starting with…
    ...
    11:CGTCGACTGATCGTAGCTGATCGTACGTCGACTGATCGTAGCT
    The above command grep each line starting with "C", and ending with "T", with any letters in between them, and then print out the line number of that line. . means any single character (except newline), * means that the preceding item will be matched zero or more times.
    ...
    work on file1.txt.file1.fasta. It is
    [workshop@poset ~]$ grep -A 1 ">Transcript10" file1.fasta
    -A N[int] means that it will print N lines of trailing context after matching lines.
    (view changes)
    12:33 pm
  7. page Introduction to basic Unix commands edited ... Here, we specify the columns (and their order) we want to sort by as -k arguments. In technica…
    ...
    Here, we specify the columns (and their order) we want to sort by as -k arguments. In technical terms, -k specifies the sorting keys and their order. Each -k argument takes a range of columns as start, end, so to sort by a single column we use start, start. In the example above, we first sorted by the first column (chromosome), since the first -k argument was -k1,1. Sorting by the first column alone leads to many ties in rows with the same chromosomes (e.g. “1” and “MT”). Adding a second -k argument with a different column tells sort how to break these ties. In our example, -k2,2n tells sort to sort by the second column (start position), treating this column as numerical data (since there’s an n in -k2,2n).
    \b matches the empty string at the edge of a word. It sets a boundary to the matches.
    ...
    the -r argument::
    [cgrlunix@poset ~]$ tail -n +6 Homo_sapiens.GRCh38.81.gtf | grep "\bexon\b" | cut -f1,4,5 | sort -k1,1 -k2,2nr | less -S
    uniq (when fed a text file, outputs the file with consecutive identical lines collapsed to one)
    (view changes)
    11:53 am
  8. page Introduction to basic Unix commands edited ... #!genome-build-accession NCBI:GCA_000001405.18 #!genebuild-last-updated 2015-06 ... sort t…
    ...
    #!genome-build-accession NCBI:GCA_000001405.18
    #!genebuild-last-updated 2015-06
    ...
    sort these feature tracksrecords by chromosomes
    ...
    then store these three columnsthe results in a
    ...
    file called gene.bed-"gene.bed" - this is
    [cgrlunix@poset ~]$ tail -n +6 Homo_sapiens.GRCh38.81.gtf | grep "\bexon\b" | cut -f1,4,5 | sort -k1,1 -k2,2n > gene.bed
    Here, we specify the columns (and their order) we want to sort by as -k arguments. In technical terms, -k specifies the sorting keys and their order. Each -k argument takes a range of columns as start, end, so to sort by a single column we use start, start. In the example above, we first sorted by the first column (chromosome), since the first -k argument was -k1,1. Sorting by the first column alone leads to many ties in rows with the same chromosomes (e.g. “1” and “MT”). Adding a second -k argument with a different column tells sort how to break these ties. In our example, -k2,2n tells sort to sort by the second column (start position), treating this column as numerical data (since there’s an n in -k2,2n).
    (view changes)
    11:42 am
  9. page Introduction to basic Unix commands edited ... [cgrlunix@poset ~]$ ls Homo_sapiens.GRCh38.81.gtf biglist.txt data.fastq file1.fasta file…
    ...
    [cgrlunix@poset ~]$ ls
    Homo_sapiens.GRCh38.81.gtf biglist.txt data.fastq file1.fasta file2.fasta list1.txt list2.txt test.txt unixstuff
    If you want to gzip a file and keep the original file, us -c (Write output on standard output; keep original files unchanged.)
    [cgrlunix@poset ~]$ gzip -c file1.fasta > testfile.gz

    tar
    tar means tape archive: we use tar to combine a few files from one or more than one folder into a single file for easy storage and distribution. tar and gzip can be used together to compress directories.
    (view changes)
    10:10 am

Friday, September 15

  1. page Introduction to basic Unix commands edited ... -h means that it prints sizes in human readable format (e.g., 1K 234M 2G) nproc ... how ma…
    ...
    -h means that it prints sizes in human readable format (e.g., 1K 234M 2G)
    nproc
    ...
    how many processingcomputation units (e.g. cores) are in
    [cgrlunix@poset ~]$ nproc
    24
    (view changes)
    3:32 pm

More