Your instructor is Ke Bi

Important note: most of the course material including examples and lectures are modified based on or directly borrowed from some free online Perl tutorials:

http://qntm.org/perl (I used a lot of Sam Hughes' original lectures in this workshop)

http://www.perlmonks.org/

http://perlmaven.com/

http://www.tutorialspoint.com/perl/

http://www.perlfect.com/

http://perldoc.perl.org/

http://perl101.org/

http://www.tizag.com/perlT/

http://stackoverflow.com/

etc..


Perl (Practical Extraction and Report Language) is a dynamic, dynamically-typed, high-level, scripting (interpreted) language most comparable with PHP and Python. Perl's syntax owes a lot to ancient shell scripting tools, and it is famed for its overuse of confusing symbols, the majority of which are impossible to Google for. Perl's shell scripting heritage makes it great for writing glue code: scripts which link together other scripts and programs. Perl is ideally suited for processing text data and producing more text data. Perl is widespread, popular, highly portable and well-supported. Perl was designed with the philosophy "There's More Than One Way To Do It" (TMTOWTDI) (contrast with Python, where "there should be one - and preferably only one - obvious way to do it").

Please refer to https://wiki.python.org/moin/PerlPhrasebook for a nice comparison of syntax between Perl vs. Python

The first Perl script: Hello world

A Perl script is a text file with the extension .pl.

Type the following lines of code into a text editor and save it as "firstscript.pl":

#! /usr/bin/perl
use strict;
use warnings;
 
#the following line will print "Hello, World!"
print "Hello, World!", "\n";

Perl scripts are interpreted by the Perl interpreter, perl or perl.exe. To run it, in a terminal window type:

$ perl firstscript.pl


The script starts with a shebang (#!) (number sign + exclamation mark followed by the full path of the interpreter) line which tells the script the absolute path of the Perl interpreter should be used. Sometimes you have multiple version of Perl installed in the computer but you want to use a particular version then you can use the shebang to direct the script to use that interpreter.

After the shebang line you can see two other statements, each ended with semicolon: use warnings; and use strict; They are put on the top of each script and called "pragmas". A pragma sends a signal to Perl interpreter at the stage of initial syntactic validation, before the program starts running. These lines have no effect when the interpreter encounters them at run time. The pragmas use warnings and use strict are used to detect certain types of coding errors (track typos, restricts unsafe constructs and variables naming collisions, etc.).

print is a build-in perl function which prints "Hello, World!" as STDOUT (writes to the shell). "\n" is a special character which means adding a newline (return) after "Hello World!". Double quotation marks, "", are used to enclose data that needs to be interpolated before processing. The semicolon, ; , is the statement terminator. The number sign # begins a comment that will help programmer document what this line of code means in the current context. The number sign can also comment out a line of code for debugging purposes. A comment lasts until the end of the line.

Whitespaces in Perl

A Perl program does not care about whitespaces as long as they are not inside the quoted strings. The following program works perfectly fine:
#! /usr/bin/perl
use strict;
use warnings;
 
#the following line will print "Hello, World!"
print                                       "Hello, World!", "\n";

But if spaces are inside the quoted strings, then they would be printed as is. For example:
#! /usr/bin/perl
use strict;
use warnings;
 
print "Hello,
World!", "\n";

Variables

Perl variables come in three types: scalars, arrays and hashes. Each type has its own sigil: $ (dollar sign), @ (at sign) and % (percent sign), respectively.

- Scalars are simple variables. A scalar is either a number, a string, or a reference. a Perl reference is a scalar data type that holds the memory location of another variable which could be a scalar, an array, or a hash. Because of its scalar nature, a reference can be used anywhere a scalar can be used.

- Arrays are ordered lists of scalars that you access with a numeric index which starts with 0.

- Hashes are unordered sets of key/value pairs that you access values using the keys as subscripts.

Note:

1. Perl is a case sensitive programming language. Thus $results and $Results are two different variables.

2. Avoid variable naming collisions. Even experienced programmers make errors in variable names. A common case is forgetting to rename an instance of a variable when cleaning up or refactoring code. Use strict "forces" programmers to put "my" in front of a variable to declare it whenever they first use this variable, and this variable remains "visible" (or valid) in scope until the end of the enclosing block (by {} ) or script.


Scalar variables

Again, a scalar is either a number, a string, or a reference. A reference is an address of another variable.

Now let’s try doing some practice by creating and changing scalar variables.


#! /usr/bin/perl
use strict;
use warnings;
 
my $x = 6;  # Although "=" is an equal sign but it really means "assigned to"
my $y = 4;
 
my $c = $x + $y;
print $c, "\n";
 
my $d = $x - $y;
print $d , "\n";
 
my $e = $x * $y;
print $e, "\n";
 
my $f = $x / $y;
print $f, "\n";
 
my $g = $x ** $y; ## ** is power of sign. In this case it is 6 power of 4
print $g, "\n";
 
my $h = $x % $y; ## % is a modulus operator. The value of the expression 6 % 4
                 ## is the remainder when 6 is divided by 4, which is 2
print $h, "\n";
 
###############
 
 
$x += 1; # this is equivalent to $x = $x + 1
print $x , "\n";
 
$x -= 1; # this is equivalent to $x = $x - 1
print $x , "\n";
 
$x *=2; # this is equivalent to $x = $x *2
print $x , "\n";
 
$x /= 2; # this is equivalent to $x = $x /2
print $x , "\n";
 
###############
 
my $j = $x; #assign the value of $x to a new variable $j
$j += 2; # increment $j by 2
print $x, "\n";
print $j, "\n";
 
##############
 
# String concatenation using the string concatenation operator . :
my $k = $x . $y;
print $k, "\n";
 
$k = $k . $x . $y;
print $k, "\n";
 
my $l = "Hello";
my $m = "World";
my $n = $l . "\t".  $m . "!"; #"\t" is a special character meaning a tab
print $n, "\n";
 
# .= is a concatenation assignment operator, which means appending a string to an existing string
$l .= "," . "\t". "my name is Ke!";
print $l, "\n";
 
##############
 
#to get the length of a scalar variable use the function length
my $o = length ($l);
print $o, "\n";
 
###############
 
# substr means extracting a portion from a string. The syntax is substr(string, startPosition, len).
# This function starts counting from 0, not 1!
my $new = "Hello, my name is Ke!";
my $p = substr ($new, 18, 2);
print $p, "\n";
 
##we can also use substr to replace a substring within the string
substr ($new, 18, 2) = "Adam";
print $new, "\n";

Array variables

An array is declared by an @ sign and contains a parenthesised list of scalars indexed by intergers beginning at 0.

#! /usr/bin/perl
use strict;
use warnings;
 
my @array = ("print", "these", "strings", "out", "for", "me");
 
print "@array", "\n"; #In perl, when you print an array inside double quotes,
                      #the array elements are printed with spaces inserted between them.
 
#what if printing an array without double quotes
print @array, "\n";
 
# You have to use a dollar sign to access a value from an array,
# because the value being retrieved is not an array but a scalar:
print $array[0], "\n"; #print the first item of this array
print $array[1], "\n"; #print the second item of this array
print $array[6], "\n"; #this item doesn’t exist since it is outside the indexes.
                       #you will get a error message
print "@array[1..3]","\n"; #print the 2nd to 4th items in order
 
# You can use negative indices to retrieve entries starting from the end and working backwards
print $array[-1], "\n"; #print the last item of this array
print $array[-2], "\n"; #print the second last item of this array
 
#how many items in the array?
print "This array has ", scalar @array, " elements", "\n";
 
#print the last populated index of this array:
print "The last populated index is ", $#array, "\n";
 
#manipulating arrays
#push: Pushes the values of the list onto the end of the array.
 
#push a scalar to the end of @array;
push @array, "!";
print "@array", "\n";
 
#push a new array to the end
push @array, ("!","!");
print "@array", "\n";
 
#pop: Pops off the last item of the array.
pop @array;
print "@array", "\n";
 
#shift: Shifts the first value of the array off
#shortening the array by 1 and moving everything down.
shift @array;
print "@array","\n";
 
#unshift: Prepends a list (scalars or arrays) to the front of the array
unshift @array, "print";
print "@array","\n";
 
#splice: splice removes and returns an array slice.
#splice (@array, StartingIndex, NumElements);
print splice (@array,2,2), "\n"; #in this case it cuts off a chunk starting with the element
                                 #in the 3nd position and ending 2 elements later.
print "@array","\n"; # print out the rest elements in the array
 
#reverse: returns elements in an array in reverse order.
my @revarray = reverse @array;
print "@array","\n";
print "@revarray","\n";
 
#split and join
#split: split function is to break up strings into a list of substrings by user-defined delimiters
#and we can place the resulting list of substrings into an array.
 
my $string = "perl is powerful but complicated!";
my @test1 = split (/\s+/, $string); #Splits this string and uses one or more whitespace
                                    #as the delimiter. Store the list in an array called @test1.
print "@test1", "\n";
 
my @test2 = split (/o/, $string); #Splits line and uses letter "o" as the delimiter.
                                  #"o" is discarded, returning only what is found to either
                                  #side of the delimiters
print "@test2", "\n";
 
 
#Preserving delimiters after splitting:
#If you want to keep the delimiters, here's an example of how.
 
my @test3 = split ( /(o)/, $string ); #Splits line and uses letter "o" as the delimiter.
                                      #"o" is kept as an independent item. The parenthesis caused                                                       #the delimiters to be captured into the list passed to
                                      #@test3 right alongside the stuff between the delimiters.
print "@test3", "\n";
 
#The null delimiter: delimiter is indicated to be a null string (a string of zero characters).
my $title = "doctor";
my @letters = split ( //, $title ); #Now @letters contains a list of six letters, "d", "o", "c,
                                    #"t", "o" and "r". If split is given a null string as a
                                    #delimiter, it splits on each null position in the string,
                                    #or in other words, every character boundary. The effect is
                                    #that the split returns a list broken into individual
                                    #characters of $string.
print "@letters\n";
 
#join: join function is in some ways the inverse of split.
#It takes a list of strings in an array and joins them together with a delimiter and returns that new string.
 
my @names = ("my", "name", "is", "adam");
my $joined = join ("\t" , @names); #it concatenates each item of the list in the array,
                                   #separate each by a tab and join them into one new string.
 
print $joined, "\n";
 
my $joined2 = join ("," , @names); #it concatenates each item in the array,
                                   #separate each by a command join them to be a string.
print $joined2, "\n";
 
#map: takes an array as input and applies an operation to every item ($_) in this array.
#It then constructs a new list out of the results. This list can be stored in an array.
#This is provided in the form of a single expression inside braces:
my @map_result = map {uc $_ } @array;
print "@map_result","\n";
 
#grep: This function takes an array as input and returns a filtered list as output.
#The syntax is similar to map. This time, the argument is evaluated for each scalar $_
#in the input array. If a boolean true value is returned, the scalar is put into
#the output list which can be stored in an array, otherwise not.
my @grep_result = grep { length $_ == 5 } @array;
print "@grep_result", "\n";
 
 

Caution. Some day you will put somebody's email address inside a string, "jeff@gmail.com". This will cause Perl to look for an array variable called @gmail to interpolate into the string, and not find it, resulting in a runtime error. Interpolation can be prevented in two ways: by backslash-escaping the sigil, or by using single quotes instead of double quotes.

the backslash can perform one of two tasks: it either takes away the special meaning of the character following it (for instance, \@gmail matches character @gmail, it's not an array @gmail), or it is the start of a backslash or escape sequence (\n, \t).

#the following line will print "Hello, World!"
print "@array", "\n"; # this print the list saved in @array
print "\@array","\n"; # this print a string @array
 

Hash variables

A hash is an un-ordered group of key-value pairs. The keys are unique strings and the values are scalar values (either a number, a string, or a reference) .

Some people think that hashes are like arrays (the old name 'associative array' also indicates this, and in some other languages, such as PHP, there is no difference between arrays and hashes.), but there are two major differences between arrays and hashes. Arrays are ordered, and you access an element of an array using its numerical index. Hashes are un-ordered and you access a value using a unique key which is a string.

#! /usr/bin/perl
use strict;
use warnings;
 
#Some examples:
#create an empty hash;
my %hash1;
#to insert key-value pairs, the basic syntax is $hash1{key} = value. key is placed in {}.
#to access a specific value using a key, we use $ sign (not a % sign) because value is a scalar.
 
#Insert 4 key-value pairs into a hash
$hash1{"apple"} = "green";
$hash1{"banana"} = "yellow";
$hash1{"strawberry"} = "red";
$hash1{"grape"} = "purple";
 
#now this hash contains four key-values pairs, the values are strings.
 
#print the color of apple
print $hash1{"apple"} , "\n";
 
#print all the keys
print keys %hash1, "\n"; # produce a list of the keys without delimiter separating them.
 
#use function join to join these keys by tabs
print join ("\t", keys %hash1), "\n";
 
#print all the values
print values %hash1, "\n";
 
#use function join to join these values by tabs
print join ("\t", values %hash1), "\n";
 
#Note: The order of keys %hash1 and values %hash1 is effectively random.
#They will differ between runs of the program.
 
#If the key does not exist, we'll get a warning about uninitialized value.
print $hash1{"orange"}, "\n";
 
#We could have key-value pairs simultaneously passing to the hash a list of key-value pairs:
 
my %hash2 = ("blueberry" => "blue", "orange" => "orange", "cherry" => "red");
#=> is called the fat arrow or fat comma, and it is used to indicate pairs of elements.
 
#print the size (key-values pairs) of the hash.
print scalar (keys %hash2), "\n";
 
#delete a key-value pair from the hash.
delete $hash2{"cherry"};
print scalar (keys %hash2), "\n"; #check how many key-value pairs are left in this hash

To recap, you have to use square brackets to retrieve a value from an array, but you have to use braces to retrieve a value from a hash. The square brackets are effectively a numerical operator and the braces are effectively a string operator.

Multi-dimensional arrays and hashes (nested data structures)

Note: Perl arrays and hashes CAN NOT contain other arrays and hashes as elements. They can only contain scalars. To manage complicated data structures like multidimensional arrays and nested hashes, Perl introduced a feature called "reference", and using references is the key to managing complicated, structured data in Perl.


#use () for actual arrays and hashes
my @actual_array = ("a", "b", "c");
my %actual_hash = ("a"=>"1", "b"=>"2", "c"=> "3");
 
#to get reference of target arrays and hashes, use backslash \. note: references are scalars
my $arrayref = \@actual_array;
my $hashref = \%actual_hash;
 
#to de-reference arrays and hashes
@{$arrayref};
%{$hashref};
 
#essentially @$arrayref is exactly the same as @actual_array, and %$hashref is the same as %actual_hash
 
#in nested data structures we use [] for array references and {} for hash references
$arrayref = ["a", "b", "c"];
$hashref = {"a"=> "1", "b"=> "2", "c"=> "3"};

array of arrays: Each element can have an internal array (indeed a reference to an array). And each element of the internal array can have its own internal array and so on.
#! /usr/bin/perl
use strict;
use warnings;
 
my @aoa1 = (
             [ "one", "two", "three"],
             [ "4", "5", "6", "7" ],
             [ "alpha", "beta" ]
           );
 
#in this example, the outer array has 3 internal array references (enclosed using square brackets []). In this case ["one","two","three"] is a reference to array ("one","two","three").
 
print $aoa1[1], "\n";
#it prints ARRAY(0x7fc12382d128). As mentioned, Perl does not have multi-dimensional arrays.
#What you see here is that the first element of the @aoa1 array is a reference to an internal,
#so-called anonymous array that holds the actual values. The ARRAY(0x7fc12382d128) is the address
#of that internal address in the memory.
 
#to print items of the entire array
print "@{$aoa1[1]}", "\n";
 
#To access the third element of the second array, we need to:
print $aoa1[1][2], "\n"; #or do print $aoa[1]->[2]. If your reference is a reference
                         #to an array or hash variable, you can get data using the more
                         #popular arrow operator, ->. -> can be omitted between subscripts ([][])
 
#to construct array of arrays
my @aoa2;
$aoa2[0][0] = "one";
$aoa2[0][1] = "two";
$aoa2[0][2] = "three";
$aoa2[1][0] = "4";
$aoa2[1][1] = "5";
$aoa2[1][2] = "6";
$aoa2[1][3] = "7";
$aoa2[2][0] = "alpha";
$aoa2[2][1] = "beta";
 
#you can build arrays more than 2 dimension ( not covered by the workshop)

hash of arrays

#! /usr/bin/perl
use strict;
use warnings;
 
my %hoa = (
          "fruits" => [ "banana", "apple", "orange" ],
          "vegetables" => [ "pepper", "lettuce", "spinch"]
          );
 
#print the array reference:
print $hoa{"fruits"}, "\n";
 
#print the entire array of the fruit category
print join (" ", @{$hoa{"fruits"}}), "\n";
 
#or you can simply do:
print "@{$hoa{fruit}}", "\n"; # omit "" for key
 
#now let's access the second item in the vegetables category.
print $hoa{"vegetables"}[1], "\n"; #or print $hoa{"vegetables"}->[1]

array of hashes

#! /usr/bin/perl
use strict;
use warnings;
 
my @aoh = (
          {
            "husband" => "adam",
            "wife" => "betty",
            "son" => "john",
          },
          {
            "husband" => "george",
            "wife" => "jane",
            "son" => "peter",
          },
          {
           "husband" => "leo",
           "wife" => "marry",
           "son" => "jeremy",
          }
          );
 
 
#print the second hash reference
print $aoh[1],"\n";
 
#print all key-values stored in the second hash
print join (" ", %{$aoh[1]}), "\n";
 
#print name of the husband of the second family
print $aoh[1]{"husband"}, "\n";
 
#add another hash (indeed a reference to the hash) to this array
push @aoh, { "husband" => "fred", "wife"=> "jen", "daughter" => "kate" };
 
 

hash of hashes

#! /usr/bin/perl
use strict;
use warnings;
 
my %hoh = (
          "fruits" => {
                     "banana" => "yellow",
                     "apple" => "red",
                     "orange" => "orange"
                     },
          "vegetables" => {
                     "pepper" => "green",
                     "lettuce" => "green",
                     "spinch" => "green"
                     }
          );
 
#print the first hash reference
print $hoh{"fruits"},"\n";
 
#print all key-value pairs of the first inner hash
print join (" ", %{$hoh{"fruits"}}), "\n";
 
#to extract the color of apple
print $hoh{"fruits"}{"apple"}, "\n";

Conditionals

if ... elsif ... else ...

#! /usr/bin/perl
use strict;
use warnings;
 
my $word = "antidisestablishmentarianism";
my $strlen = length $word;
 
if ($strlen >= 15) {
   print "'", $word, "' is a very long word!", "\n";
}
elsif (10 <= $strlen && $strlen < 15) { #<= means smaller than or equal to, && (double ampersand)
                                        #means "and", you can also do elsif (10 <= $strlen and
                                        #strlen <15). || means or.
   print "'", $word, "' is a medium-length word!", "\n";
}
else {
   print "'", $word, "' is a short word!","\n";
}
 
#Perl provides a shorter "statement if condition" syntax
#which is highly recommended for short statements
print "'", $word, "' is actually enormous", "\n" if ($strlen >= 20);

unless ... else ...


#! /usr/bin/perl
use strict;
use warnings;
 
my $temperature = 20;
unless ($temperature > 30) {
  print $temperature, " degrees Celsius is not very hot!","\n";
}
else {
  print $temperature, " degrees Celsius is actually pretty hot!\n";
}
 
#This, by comparison, is highly recommended because it is so easy to read:
print "Oh no it's too cold!", "\n" unless ($temperature > 15);

Loops

while loop

#! /usr/bin/perl
use strict;
use warnings;
 
#an example:
my $counter = 20;
while ($counter > 0) {
   print $counter, "\n";
   $counter = $counter - 2;
}
 
print "done!\n";

The while loop has a condition, it will keep going until this condition is not true. in our case checking if the variable $counter is larger than 0, and then a block of code wrapped in curly braces. When the execution first reaches the beginning of the while loop it checks if the condition is true or false. If it is FALSE the block is skipped and the next statement, in our case printing 'done' is executed. If the condition of the while is TRUE, the block gets executed, and then the execution goes back to the condition again. It is evaluated again. If it is false the block is skipped and the 'done' is printed. If it is true the block gets executed and we are back to the condition ... This goes on as long as the condition is true or in sort-of English: while (the-condition-is-true) { do-something }

for loop

#! /usr/bin/perl
use strict;
use warnings;
 
#1. C-style for loop
# the basic syntax of for C-style loop is
# for (initialization; condition; iterator) {
# BODY;
#}
 
#example
for (my $i = 0 ; $i <= 10 ; $i++) { ## ++ means increment by 1 each time
  print $i, "\n";
}
 
#Inside the c-style for loop, there are 3 components, separated by semicolons.
#These are: the starting statement, the continuation condition, and the iterating statement.
#the starting statement is usually just an assignment. The second statement is a continuation condition.
#This will be evaluated at the beginning of every iteration.
#The first time it evaluates to false, the loop terminates. The third statement is an iterator.
 
#count down
for (my $i = 10; $i >= 0; $i--) {
  print "$i", "\n";
}
 
#A C-style loop can be translated into the form of a standard for loop (below)
 
 
#2. standard for loop
for my $i (0..10) { #range operator
  print $i , "\n";
}
 
#count down
for my $i (reverse 0..10) {
  print $i , "\n";
}
 
#Another way to iterate
for (0..10) {
  print $_, "\n";
}
#$_ is a special variable and the default input and pattern-searching space.
#What is means in each iteration of the loop, the current string is placed in $_,
#and is used by default by print.
 
#standard for loop is equivalent to the following constructs. for and foreach can be used interchangeably
foreach my $number (0..10) {
  print $number, "\n";
}
 
 

until loop

#! /usr/bin/perl
use strict;
use warnings;
 
my $num1 = 5;
# until loop execution
until( $num1 > 10 ){
  print "Value of num1: ", $num1, "\n";
  $num1++;
}
 
# An until loop statement repeatedly executes a target statement as long as a given condition is false.
 
 

do .. until loop

#! /usr/bin/perl
use strict;
use warnings;
 
my $num2 = 5;
do {
  print "Value of num2: $num2\n";
  $num2++;
   } until ($num2 > 10);

loop through arrays and hashes
#! /usr/bin/perl
use strict;
use warnings;
 
#one dimensional array
 
my @numbers = ("2", "4", "6", "8", "10", "12");
 
#now we want to generate a new array where each item is greater than its original by 1.
 
my @new;
foreach my $item (@numbers) {
  push @new, $item+1;
}
 
print "@new","\n";
 
 
#nested arrays
my @aoa3 = (
           ["one", "two", "three"],
           ["4", "5", "6", "7"],
           ["alpha", "beta"]
           );
 
#print each item
 
foreach my $array_ref (@aoa3) {
  foreach my $item (@{$array_ref}) {
  #in this case the @ symbol is essentially an array dereference operator.
  #It can dereference any value which is an array reference.
 
  print $item, "\n";
 }
}
 
#one dimensional hash
my %hash3 = ("blueberry" => "blue", "orange" => "orange", "cherry" => "red");
 
#print out each fruit and its color:
 
foreach my $fruit (keys %hash3) {
  print "the color of ", $fruit, " is ", $hash3{$fruit},"\n";
}
 
#to sort the fruits (keys) alphabetically
 
foreach my $fruit (sort { $a cmp $b} keys %hash3) { #the keys to be compared are passed into the sort
                                                    #subroutine as the package global variables $a and $b
 
  print "the color of ", $fruit, " is ", $hash3{$fruit},"\n";
}
 
#to sort the color (values) alphabetically
 
foreach my $fruit (sort {$hash3{$a} cmp $hash3{$b}} keys %hash3 ) {
  print "the color of ", $fruit, " is ", $hash3{$fruit},"\n";
}
 
 
#NOTE: when sort numbers then use sort {$a <=> $b}. <=> is also called spaceship operator
 
#nested hash
my %hoh2 = (
           "fruit" => {
                      "banana" => "yellow",
                      "apple" => "red",
                      "orange" => "orange"
                      },
           "vegetables" => {
                      "pepper" => "green",
                      "lettuce" => "green",
                      "spinch" => "green"
                      }
                      );
 
#print each item (keys in the nested "hash") in each category (keys in the outter/top-level hash),
#sort the item alphabetically
 
foreach my $category (sort {$a cmp $b} keys %hoh2) {
   foreach my $item (sort {$a cmp $b} keys %{$hoh2{$category}}) {#% is dereference a hash reference
        print "the color of ", $item, " is ", $hoh2{$category}{$item},"\n";
   }
}
 
#what if you only want to print out color for vegetables. You need to add a condition using if
 
foreach my $category (sort {$a cmp $b} keys %hoh2) {
  if ($category eq "vegetables") {
    foreach my $item (sort {$a cmp $b} keys %{$hoh2{$category}}) {
       print "the color of ", $item, " is ", $hoh2{$category}{$item},"\n";
    }
  }
}

User-defined subroutines

Perl has many many built-in functions such as sort, split, shift, pop, etc. Perl also allows the user to define their own functions, called subroutines. The simplest way for reusing code is building subroutines.

Subroutines are declared using the sub keyword. In contrast with built-in functions, user-defined subroutines always accept the same input: a list of scalars. Subroutines can not accept arrays and hashes, but it can accept array references and hash references. Inside the subroutines you can dereference them to get the actual arrays and hashes. Subroutines should be invoked using parenthesis, even when called with no arguments. This makes it clear that a subroutine call is happening.

When you call a subroutine you can pass any number of arguments (scalars) to that subroutine, and the values will be placed in the local array @_ .

#! /usr/bin/perl
use strict;
use warnings;
 
#example: you want write a simple function that adds two values together to get the total
my $a1 = 15;
my $a2 = 20;
my $all = getsum ($a1, $a2); #$a1 and $a2 are stored in @_;
print "The sum of ", $a1, " and ", $a2, " is ", $all, "\n";
 
sub getsum { #start the subroutine, enclosed using {}
  my ($x, $y) = @_; #create two scalar variables to get the values from @_;
  my $sum = $x + $y; #a variable declared with my is visible only
                     #within the block in which it is declared.
  return ($sum);
}
#Another example: we want to print the sum of all elements in an array
#! /usr/bin/perl
use strict;
use warnings;
my @aon = ("15", "20", "2", "9" );
my $soa = getsum2 (\@aon); #If you put a \ in front of a variable, you get a reference to that variable.
                           #Then you can pass the reference to the subroutine.
print "The sum of all elements is ", $soa, "\n";
 
sub getsum2 {
 my ($x) = @_;
 my @array = @{$x};
 my $total; #
 foreach (@array) {
   $total += $_;
 }
 return ($total);
}
#now let’s write a script to covert meters to feet (1 meters = 3.28084 feet)
#and feet to meters (1 foot = 0.3048 meters) by taking take input from the command line (shell)
 
#! /usr/bin/perl
use warnings;
use strict;
 
die "
Usage: perl converter.pl <number> <unit>
number:  provide a number to convert
unit:    m(meters) or f(feet)?
 
examples:
1. to convert 2 meters to feet
perl converter.pl 2 m
 
2. to convert 2 feet to meters
perl converter.pl 2 f
 
"
unless (scalar @ARGV == 2); #@ARGV is a perl special variable that contains the arguments
                            #given to the program, as ordered by the shell.
 
 
my $result = convert ($ARGV[0], $ARGV[1]);
if ($ARGV[1] eq "m") {
  print $ARGV[0], " meters equals ", $result, " feet.", "\n";
}
if ($ARGV[1] eq "f") {
  print $ARGV[0], " feet equals ", $result, " meters.", "\n"
}
 
sub convert {
  my ($num, $t) = @_;
  my $out;
  if ($t eq "f") {
    $out = $num * 0.3048;
  }
  if ($t eq "m") {
    $out = $num * 3.28084;
  }
  return ($out);
}


Regular expression

Perl's text processing power comes from its use of regular expressions. A regular expression (regex or regexp) is a string of characters that can be used to define the pattern or patterns you are viewing. Regular expressions are often used in conditionals.

#! /usr/bin/perl
use strict;
use warnings;
 
 
#a simple example:
 
my $text1 = "Chatfield";
print "Found a hat!","\n" if ($text1 =~ m/hat/);

The match operator (m, abbreviated ) identifies a regular expression—in this example, hat. This pattern is not a word. Instead it means "the h character, followed by the a character, followed by the t character." Each character in the pattern is an indivisible element, or atom. It matches or it doesn't.

The regex binding operator (=~) is an infix operator (Fixity) which applies the regex of its second operand to a string provided by its first operand. When evaluated in scalar context, a match evaluates to a true value if it succeeds. The negated form of the binding operator (!~) evaluates to a true value unless the match succeeds


Some commonly used special characters in regex


\n # A newline

\t # A tab

\w # Any word character (a number and letter plus “_”).

\W # Any non-word character.

\d # Any digit. The same as [0-9]

\D # Any non-digit. The same as [^0-9]

\s # Any whitespace character: space,tab,newline, etc

\S # Any non-whitespace character

\b # A word boundary

\B # No word boundary

. # Any single character except a newline

^ # The beginning of the line or string

$ # The end of the line or string

* # Zero or more of the last character

+ # One or more of the last character

? # Zero or one of the last character

[] # alternative match

| #alternative match
\ #escape charater

Clearly characters like $, |, [, ), \, / and so on are peculiar cases in regular expressions. If you want to match for one of those then you have to precede it by a backslash. So:

\| # Vertical bar

\[ # An open square bracket

\) # A closing parenthesis

\* # An asterisk

\^ # A carat symbol

\/ # A slash

\\ # A backslash

#! /usr/bin/perl
use strict;
use warnings;
 
#another example:
my $str = "Usage:524/1000 messages; Usage:666/1000 messages";
if ( $str =~ m/^Usage:(\d+)/) {
   my $used = $1;
   print "The first user used ", $used, " messages!","\n";
}
 
#Parentheses perform sub-matches. After a successful match operation is performed,
#the sub-matches get stuffed stored into the built-in variables $1, $2, $3, ...:
 
my $text2 = "Hello world";
if ($text2 =~ m/(\w+)\s+(\w+)/) {
   print "success!","\n";
   print $1, "\t";
   print $2, "\n";
}
Substitution operations are performed using =~ s/A/B/g. Its first operand is a regular expression to match when used with the regex binding operator. The second operand is a substring used to replace the matched portion of the first operand used with the regex binding operator.
#! /usr/bin/perl
use strict;
use warnings;
 
#I want to replace all "o" and "e" in "Hello world" with "r".
my $text3 = "Hello world";
$text3 =~ s/[oe]/r/g;
print $text3, "\n"
 
#In this case, an =~ s///g call performs a global search/replace
tr/ABC/abc/ means transliteration. It is not a regular expression operator. It is suitable (and faster than s///) for substitutions of one single character with another single character
#! /usr/bin/perl
use strict;
use warnings;
 
my $text4 = "a1ab2c3";
$text4 =~ tr/abc/123/;
print $text4, "\n";

Perl file handling: open, read, write and close files

The basics of handling files in Perl are simple: you associate a filehandle with a file and then use a variety of operators and functions within Perl to read and update the data stored within the data stream associated with the filehandle. In other words, filehandle is essentially a reference to a specific location inside a specific file. All filehandles are capable of read/write access, so you can read from and update any file or device associated with a filehandle.

A file handle can be represented by a scalar variable.

Please download the example fastq file "data.fastq" from the CGRL wiki

In a terminal you can do
wget http://cgrlucb.wikispaces.com/file/view/data.fastq
or
curl -O http://cgrlucb.wikispaces.com/file/view/data.fastq

Read a file:
#! /usr/bin/perl
use strict;
use warnings;
 
#let's read a file saved in the computer disk and print line by line from this file
my $fastq = "/Users/kebi/Desktop/PerlWorkshop/data.fastq";
open (my $fh, "<", $fastq) || die "can not open $fastq!\n";
#open the file and read it into a file handle.
#open means open a channel for your program to "talk to" the file.
#For this Perl provides the open function. "<" means read in and ">" means write out.
 
while (<$fh>) { #Iterate over each line in the file handle,
                #Note that the <$fh> (angle brackets) expression reads in the file entirely
                #in one go in an array. in this case you can think of this
                #array contains lines in this this file.
   chomp (my $line = $_); #saving the current line to the scalar variable
                          #$line and remove the ending newline character
   print $line , "\n";
 
}
 
close $fh; #close a filehandle, and therefore disassociate the filehandle from the corresponding file.
           #Until a file-handle is closed, it is possible that there’s some data out there which has not
           #been written to disk. Other applications will not see that data yet. Closing a filehandle
           #releases the filehandle resource. Furthermore, closing a filehandle improves code readability.
           #It tells future readers "I'm done with that. Although in many cases without closing a
           #filehandle could be fine, it is generally believed to be a good practice to close the
           #filehandle after you done working with it.

Read and write files:
#! /usr/bin/perl
use strict;
use warnings;
 
#let's read a file saved in the computer disk and print line by line into an outfile
 
my $fastq = "/Users/kebi/Desktop/PerlWorkshop/data.fastq";
my $outfile = "/Users/kebi/Desktop/PerlWorkshop/data.fastq_copy";
 
open (my $fh, "<", $fastq) || die "can not open $fastq!\n";
open (my $out, ">", $outfile); #open for output, link a file handle $out to the outfile $outfile
 
while (<$fh>) {
   chomp (my $line = $_);
   print $out $line , "\n"; #print this line (by adding a newline character)
                            #into the outfile filehandle
}
close $fh;
close $out;

Now let's write a script converting this fastq to fasta format
#! /usr/bin/perl
use strict;
use warnings;
 
my $fastq = "/Users/kebi/Desktop/PerlWorkshop/data.fastq";
my $fasta = "/Users/kebi/Desktop/PerlWorkshop/data.fasta";
 
open (my $fh, "<", $fastq) || die "can not open $fastq!\n";
open (my $out, ">", $fasta) || die "can not open $fasta!\n";
 
while (<$fh>) {
  chomp (my $line = $_);
  if ($line =~ m /^@(\S+\/1$)/) {
    chomp (my $seq = <$fh>);
    print $out ">", $1, "\n"; #print the header
    print $out $seq, "\n";
  }
}
close $fh;
close $out;

Now let's write a script to reverse complement the DNA sequence in fasta file, read the files from command line:
hint: using @ARGV to store arguments from the shell
#! /usr/bin/perl
use warnings;
use strict;
 
die (qq/
 
Usage: perl revcomp.pl <fasta_file>
fasta_file: provide a sequence file in fasta format
 
\n/) unless (scalar @ARGV == 1);
#qq can be used instead of double quotes
 
my $fasta = $ARGV[0];
my $revcomp_fasta = $fasta. "_revcomp";
 
open (my $fh, "<", $fasta) || die "can not open $fasta!\n";
open (my $out, ">", $revcomp_fasta) || die "can not open $revcomp_fasta!\n";
 
while (<$fh>) {
  chomp (my $line = $_);
  if ($line =~ m /^>\S+/) {
    chomp (my $seq = <$fh>);
    my $revcomp = reverse $seq; #reverse the DNA sequence
    $revcomp =~ tr/ACGTacgt/TGCAtgca/; #replace a nucleotide to its reverse-complement form
    print $out $line, "\n"; #print the header
    print $out $revcomp, "\n";
  }
}
close $fh;
close $out;
System calls

In a Perl script, you can call any external programs or other scripts like you would from the command line using a system call. We use the function system
Now let's write another simple script called "system_call.pl" to call revcomp.pl
#! /usr/bin/perl
use warnings;
use strict;
 
die (qq/
Usage: perl system_call.pl <fasta_file>
fasta_file:  provide a sequence file in fasta format!
\n/) unless (@ARGV);
 
system ("perl rev_comp.pl $ARGV[0]");