Thursday, May 1, 2008

Perl: Regular Expressions

Regular Expressions in Perl are used for
String comparisons: $string =~ m/text/;
String selections : $string =~ m/whatever(sought_text)whatever2/; $soughtText = $1;
String replacements: $string =~ tr/originaltext/newtext/;

Wildcards
.: Match any character
\w: Match "word" character (alphanumeric plus "_")
\W: Match non-word character
\s: Match whitespace character
\S: Match non-whitespace character
\d: Match digit character
\D: Match non-digit character
\t: Match tab
\n: Match newline
\r: Match return
\f: Match formfeed
\a: Match alarm (bell, beep, etc)
\e: Match escape
\021: Match octal char ( in this case 21 octal)
\xf0: Match hex char ( in this case f0 hexidecimal)

Repetitions
*: Match 0 or more times
+: Match 1 or more times
?: Match 1 or 0 times
{n}: Match exactly n times
{n,}: Match at least n times
{n,m}: Match at least n but not more than m times

Using Groups ( ) in Matching
Groups are regular expression characters surrounded by parentheses. They have two major uses:
- To allow alternative phrases as in /(Clinton|Bush|Reagan)/i. Note that for single character alternatives, you can also use character classes.
- As a means of retrieving selected text in selection, translation and substitution, used with the $1, $2, etc scalers.
This section will discuss only the first use along with some examples. To see more about the second use, click here.

Detect all strings containing vowels
if($string =~ m/(A|E|I|O|U|Y|a|e|i|o|u|y)/) {print "String contains a vowel!\n"}

Detect if the line starts with any of the last three presidents
if($string =~ m/^(Clinton|Bush|Reagan)/i) {print "$string\n"};

Using Character Classes [ ]
Character classes are alternative single characters within square bracketsternative.
Character classes have three main advantages:
- Shorthand notation, as [AEIOUY] instead of (A|E|I|O|U|Y). This advantage is minor at best.
- Character Ranges, such as [A-Z].
- One to one mapping from on class to another, as in tr/[a-z]/[A-Z]

An uparrow (^) at immediately following the opening square bracket means "Anything but these characters", and effectively negates the character class.
To match anything that is not a vowel
if($string =~ /[^AEIOUYaeiouy]/){print "This string contains a non-vowel"}

To match anything that is a vowel
if($string !~ /[AEIOUYaeiouy]/){print "This string contains no vowels at all"}

Matching
Print everyone whose last name is Clinton, Bush or Reagan
if($string =~ m/^\S+\s+(Clinton|Bush|Reagan)/i) {print "$string\n"};
Each element of list is first name, blank, last name, and possibly more blanks and more info after the last name

Print every line with a valid phone number
if($string =~ m/[\)\s\-]\d{3}-\d{4}[\s\.\,\?]/) {print "Phone line: $string\n"};

Substitutions
Replace every "Bill Clinton" with an "Al Gore"
$string =~ s/Bill Clinton/Al Gore/;
Now do it ignoring the case of bIlL ClInToN
$string =~ s/Bill Clinton/Al Gore/i;

Translations
Translations are like substitutions, except they happen on a letter by letter basis instead of substituting a single phrase for another single phrase.

Make all vowels upper case
$string =~ tr/[a,e,i,o,u,y]/[A,E,I,O,U,Y]/;

Change everything to upper case
$string =~ tr/[a-z]/[A-Z]/;

Change all vowels to numbers to avoid "4 letter words" in a serial number
$string =~ tr/[A,E,I,O,U,Y]/[1,2,3,4,5]/;

Greedy and Ungreedy Matching
Perl regular expressions normally match the longest string possible. For instance:
my($text) = "mississippi";$text =~ m/(i.*s)/;print $1 . "\n";

Run the preceding code, and here's what you get
ississ

It matches the first i, the last s, and everything in between them. But what if you want to match the first i to the s most closely following it? Use this code:
my($text) = "mississippi";$text =~ m/(i.*?s)/;print $1 . "\n";
Now look what the code produces:
is

Symbol Explanations
=~
This operator appears between the string var you are comparing, and the regular expression you're looking for (note that in selection or substitution a regular expression operates on the string var rather than comparing). Here's a simple example:
$string =~ m/Bill Clinton/; #return true if var $string contains the name of the president
$string =~ tr/Bill Clinton/Al Gore/; #replace the president with the vice president

!~
Just like =~, except negated. With matching, returns true if it DOESN'T match. I can't imagine what it would do in translates, etc.

/
This is the usual delimiter for the text part of a regular expression. If the sought-after text contains slashes, it's sometimes easier to use pipe symbols (|) for delimiters, but this is rare. Here are simple examples:
$string =~ m/Bill Clinton/; #return true if var $string contains the name of the president
$string =~ tr/Bill Clinton/Al Gore/; #replace the president with the vice president

m
The match operator. Coming before the opening delimiter, this is the "match" operator. It means read the string expression on the left of the =~, and see if any part of it matches the expression within the delimiters following the m. Note that if the delimiters are slashes (which is the normal state of affairs), the m is optional and often not included. Whether it's there or not, it's still a match operation. Here are some examples:
$string =~ m/Bill Clinton/; #return true if var $string contains the name of the president
$string =~ /Bill Clinton/; #same result as previous statement

^
This is the "beginning of line" symbol. When used immediately after the starting delimiter, it signifies "at the beginning of the line". For instance:
$string =~ m/^Bill Clinton/; #true only when "Bill Clinton" is the first text in the string

$
This is the "end of line" symbol. When used immediately before the ending delimiter, it signifies "at the end of the line". For instance:
$string =~ m/Bill Clinton$/; #true only when "Bill Clinton" is the last text in the string

i
This is the "case insensitivity" operator when used immediately after the closing delimiter. For instance:
$string =~ m/Bill Clinton/i; #true when $string contains "Bill Clinton" or BilL ClInToN"

No comments:

Post a Comment