Recipe 6.6. Matching Multiple Lines

6.6. Matching Multiple Lines

Problem

You want to use regular expressions on a string containing more than one line, but the special characters . (any character but newline), ^ (start of string), and $ (end of string) don't seem to work for you. This might happen if you're reading in multiline records or the whole file at once.

Solution

Use /m, /s, or both as pattern modifiers. /s lets . match newline (normally it doesn't). If the string had more than one line in it, then /foo.*bar/s could match a "foo" on one line and a "bar" on a following line. This doesn't affect dots in character classes like [#%.], since they are regular periods anyway.

The /m modifier lets ^ and $ match next to a newline. /^=head[1-7]$/m would match that pattern not just at the beginning of the record, but anywhere right after a newline as well.

Discussion

A common, brute-force approach to parsing documents where newlines are not significant is to read the file one paragraph at a time (or sometimes even the entire file as one string) and then extract tokens one by one. To match across newlines, you need to make . match a newline; it ordinarily does not. In cases where newlines are important and you've read more than one line into a string, you'll probably prefer to have ^ and $ match beginning- and end-of-line, not just beginning- and end-of-string.

The difference between /m and /s is important: /m makes ^ and $ match next to a newline, while /s makes . match newlines. You can even use them together - they're not mutually exclusive options.

Example 6.2 creates a filter to strip HTML tags out of each file in @ARGV and send the results to STDOUT. First we undefine the record separator so each read operation fetches one entire file. (There could be more than one file, because @ARGV has several arguments in it. In this case, each read would get a whole file.) Then we strip out instances of beginning and ending angle brackets, plus anything in between them. We can't use just .* for two reasons: first, it would match closing angle brackets, and second, the dot wouldn't cross newline boundaries. Using .*? in conjunction with /s solves these problems - at least in this case.

Example 6.2: killtags

#!/usr/bin/perl
# killtags - very bad html tag killer
undef $/;           # each read is whole file
while (<>) {        # get one whole file at a time
    s/<.*?>//gs;    # strip tags (terribly)
    print;          # print file to STDOUT
}

Because this is just a single character, it would be much faster to use s/<[^>]*>//gs, but that's still a naīve approach: It doesn't correctly handle tags inside HTML comments or angle brackets in quotes (<IMG SRC="here.gif" ALT="<<Ooh la la!>>">). Recipe 20.6 explains how to avoid these problems.

Example 6.3 takes a plain text document and looks for lines at the start of paragraphs that look like "Chapter 20: Better Living Through Chemisery". It wraps these with an appropriate HTML level one header. Because the pattern is relatively complex, we use the /x modifier so we can embed whitespace and comments.

Example 6.3: headerfy

#!/usr/bin/perl
# headerfy: change certain chapter headers to html
$/ = '';
while ( <> ) {              # fetch a paragraph
    s{
        \A                  # start of record
        (                   # capture in $1
            Chapter         # text string
            \s+             # mandatory whitespace
            \d+             # decimal number
            \s*             # optional whitespace
            :               # a real colon
            . *             # anything not a newline till end of line
        )
    }{<H1>$1</H1>}gx;
    print;
}

Here it is as a one-liner from the command line if those extended comments just get in the way of understanding:

% perl -00pe 's{\A(Chapter\s+\d+\s*:.*)}{<H1>$1</H1>}gx' datafile

This problem is interesting because we need to be able to specify both start-of-record and end-of-line in the same pattern. We could normally use ^ for start-of-record, but we need $ to indicate not only end-of-record, but also end-of-line as well. We add the /m modifier, which changes both ^ and $. So instead of using ^ to match beginning-of-record, we use \A instead. (We're not using it here, but in case you're interested, the version of $ that always matches end-of-record even in the presence of /m is \Z.)

The following example demonstrates using both /s and /m together. That's because we want ^ to match the beginning of any line in the paragraph and also want dot to be able to match a newline. (Because they are unrelated, using them together is simply the sum of the parts. If you have the questionable habit of using "single line" as a mnemonic for /s and "multiple line" for /m , then you may think you can't use them together.) The predefined variable $. represents the record number of the last read file. The predefined variable $ARGV is the file automatically opened by implicit <ARGV> processing.

$/ = '';            # paragraph read mode for readline access
while (<ARGV>) {
    while (m#^START(.*?)^END#sm) {  # /s makes . span line boundaries
                                    # /m makes ^ match near newlines
        print "chunk $. in $ARGV has <<$1>>\n";
    }
}

If you've already committed to using the /m modifier, you can use \A and \Z to get the old meanings of ^ and $ respectively. But what if you've used the /s modifier and want to get the original meaning of .? You can use [^\n]. If you don't care to use /s but want the notion of matching any character, you could construct a character class that matches any one byte, such as [\000-\377] or even [\d\D]. You can't use [.\n] because . is not special in a character class.


6.5. Finding the Nth Occurrence of a Match		6.7. Reading Records with a Pattern Separator