[Chapter 7] 7.4 More on the Matching Operator

7.4 More on the Matching Operator

We have already looked at the simplest uses of the matching operator (a regular expression enclosed in slashes). Now let's look at a zillion ways to make this operator do something slightly different.

7.4.1 Selecting a Different Target (the `=~` Operator)

Usually the string you'll want to match your pattern against is not within the $_ variable, and it would be a nuisance to put it there. (Perhaps you already have a value in $_ you're quite fond of.) No problem. The =~ operator helps us here. This operator takes a regular expression operator on the right side, and changes the target of the operator to something besides the $_ variable - namely, some value named on the left side of the operator. It looks like this:

$a = "hello world";
$a =~ /^he/;         # true
$a =~ /(.)\l/;       # also true (matches the double l)
if ($a =~ /(.)\l/) { # true, so yes...
                     # some stuff
}

The target of the =~ operator can be any expression that yields some scalar string value. For example, <STDIN> yields a scalar string value when used in a scalar context, so we can combine this with the =~ operator and a regular expression match operator to get a compact check for particular input, as in:

print "any last request? ";
if (<STDIN> =~ /^[yY]/) {          # does the input begin with a y?
    print "And just what might that request be? ";
    <STDIN>;                       # discard a line of standard input
    print "Sorry, I'm unable to do that.\n";
}

In this case, <STDIN> yields the next line from standard input, which is then immediately used as the string to match against the pattern ^[yY]. Note that you never stored the input into a variable, so if you wanted to match the input against another pattern, or possibly echo the data out in an error message, you'd be out of luck. But this form frequently comes in handy.

7.4.2 Ignoring Case

In the previous example, we used [yY] to match either a lower- or uppercase y. For very short strings, such as y or fred, this is easy enough, as in [fF][oO][oO]. But what if the string you want to match is the word "procedure" in either lower- or uppercase?

In some versions of grep, a -i flag indicates "ignore case." Perl also has such an option. You indicate the ignore-case option by appending a lowercase i to the closing slash, as in /somepattern/i. This says that the letters of the pattern will match letters in the string in either case. For example, to match the word procedure in either case at the beginning of the line, use /^procedure/i.

Now our previous example looks like this:

print "any last request? ";
if (<STDIN> =~ /^y/i) { # does the input begin with a y?
    # yes! deal with it
    ...
}

7.4.3 Using a Different Delimiter

If you are looking for a string with a regular expression that contains slash characters (/), you must precede each slash with a backslash (\ ). For example, you can look for a string that begins with /usr/etc like this:

$path = <STDIN>; # read a pathname (from "find" perhaps?)
if ($path =~ /^\/usr\/etc/) {
    # begins with /usr/etc...
}

As you can see, the backslash-slash combination makes it look like there are little valleys between the text pieces. Doing this for a lot of slash characters can get cumbersome, so Perl allows you to specify a different delimiter character. Simply precede any nonalphanumeric, nonwhitespace character[5] (your selected delimiter) with an m, then list your pattern followed by another identical delimiter character, as in:

/^\/usr\/etc/     # using standard slash delimiter
m@^/usr/etc@      # using @ for a delimiter
m#^/usr/etc#      # using # for a delimiter (my favorite)

[5] If the delimiter happens to be the left character of a left-right pair (parentheses, braces, angle bracket, or square bracket), the closing delimiter is the corresponding right of the same pair. But otherwise, the characters are the same for begin and end.

You can even use slashes again if you want, as in m/fred/. So the common regular-expression matching operator is really the m operator; however, the m is optional if you choose slash for a delimiter.

7.4.4 Using Variable Interpolation

A regular expression is variable interpolated before it is considered for other special characters. Therefore, you can construct a regular expression from computed strings rather than just literals. For example:

$what = "bird";
$sentence = "Every good bird does fly.";
if ($sentence =~ /\b$what\b/) {
    print "The sentence contains the word $what!\n";
}

Here we have used a variable reference to effectively construct the regular expression operator /\bbird\b/.

Here's a slightly more complicated example:

$sentence = "Every good bird does fly.";
print "What should I look for? ";
$what = <STDIN>;
chomp($what);
if ($sentence =~ /$what/) { # found it!
    print "I saw $what in $sentence.\n";
} else {
    print "nope... didn't find it.\n";
}

If you enter bird, it is found. If you enter scream, it isn't. If you enter [bw]ird, that's also found, showing that the regular expression pattern-matching characters are indeed still significant.

How would you make them insignificant? You'd have to arrange for the non-alphanumeric characters to be preceded by a backslash, which would then turn them into literal matches. That sounds hard, unless you have the \Q quoting escape at your disposal:

$what = "[box]";
foreach (qw(in[box] out[box] white[sox])) {
    if (/\Q$what\E/) {
        print "$_ matched!\n";
    }
}

Here, the \Q$what\E construct turns into \[box\], making the match look for a literal pair of enclosing brackets, instead of treating the whole thing as a character class.

7.4.5 Special Read-Only Variables

After a successful pattern match, the variables $1, $2, $3, and so on are set to the same values as \1, \2, \3, and so on. You can use this to look at a piece of the match in later code. For example:

$_ = "this is a test";
/(\w+)\W+(\w+)/; # match first two words
                 # $1 is now "this" and $2 is now "is"

You can also gain access to the same values ($1, $2, $3, and so on) by placing a match in a list context. The result is a list of values from $1 up to the number of memorized things, but only if the regular expression matches. (Otherwise the variables are undefined.) Taking that last example in another way:

$_ = "this is a test";
($first, $second) = /(\w+)\W+(\w+)/; # match first two words
     # $first is now "this" and $second is now "is"

Other predefined read-only variables include $&, which is the part of the string that matched the regular expression; $`, which is the part of the string before the part that matched; and $', which is the part of the string after the part that matched. For example:

$_ = "this is a sample string";
/sa.*le/; # matches "sample" within the string
          # $` is now "this is a "
          # $& is now "sample"
          # $' is now " string"

Because these variables are set on each successful match, you should save the values elsewhere if you need them later in the program.[6]

[6] See O'Reilly's Mastering Regular Expressions for performance ramifications of using these variables.


7.3 Patterns		7.5 Substitutions