Sun, 30 Nov 2008

Regexes strike back


Permanent link

NAME

"Perl 5 to 6" Lesson 19 - Regexes strike back

SYNOPSIS

    # normal matching:
    if 'abc' ~~ m/../ {
        say $/;                 # ab
    }

    # match with implicit :sigspace modifier
    if 'ab cd ef'  ~~ ms/ (..) ** 2 / {
        say $0[1];              # cd
    }

    # substitute with the :samespace modifier
    my $x = "abc     defg";
    $x ~~ ss/c d/x y/;
    say $x;                     # abx     yefg

DESCRIPTION

Since the basics of regexes are already covered in lesson 07, here are some useful (but not very structured) additional facts about Regexes.

Matching

You don't need to write grammars to match regexes, the traditional form m/.../ still works, and has a new brother, the ms/.../ form, which implies the :sigspace modifier. Remember, that means that whitespaces in the regex are substituted by the <.ws> rule.

The default for the rule is to match \s+ if it is surrounded by two word-characters (ie those matching those \w), and \s* otherwise.

In substitutions the :samespace modifier takes care that whitespaces matched with the ws rule are preserved. Likewise the :samecase modifier, short :ii (since it's a variant of :i) preserves case.

    my $x = 'Abcd';
    $x ~~ s:ii/^../foo/;
    say $x;                     # Foocd
    $x = 'ABC';
    $x ~~ s:ii/^../foo/;
    say $x                      # FOOC

This is very useful if you want to globally rename your module Foo, to Bar, but for example in environment variables it is written as all uppercase. With the :ii modifier the case is automatically preserved.

It copies case information on a character by character. But there's also a more intelligent version; when combined with the :sigspace (short :s) modifier, it tries to find a pattern in the case information of the source string. Recognized are .lc, .uc, .lc.ucfirst, .uc.lcfirst and .lc.capitaliz (Str.capitalize uppercases the first character of each word). If such a pattern is found, it is also applied to the substitution string.

    my $x = 'The Quick Brown Fox';
    $x ~~ s :s :ii /brown.*/perl 6 developer/;
    # $x is now 'The Quick Perl 6 Developer'

Alternations

Alternations are still formed with the single bar |, but it means something else than in Perl 5. Instead of sequentially matching the alternatives and taking the first match, it now matches all alternatives in parallel, and takes the longest one.

    'aaaa' ~~ m/ a | aaa | aa /;
    say $/                          # aaa

While this might seem like a trivial change, it has far reaching consequences, and is crucial for extensible grammars. Since Perl 6 is parsed using a Perl 6 grammar, it is responsible for the fact that in ++$a the ++ is parsed as a single token, not as two prefix:<+> tokens.

The old, sequential style is still available with ||:

    grammar Math::Expression {
        token value {
            | <number>
            | '(' 
              <expression> 
              [ ')' || { fail("Parenthesis not closed") } ]
        }

        ...
    }

The { ... } execute a closure, and calling fail in that closure makes the expression fail. That branch is guaranteed to be executed only if the previous (here the ')') fails, so it can be used to emit useful error messages while parsing.

There are other ways to write alternations, for example if you "interpolate" an array, it will match as an alternation of its values:

    $_ = '12 oranges';
    my @fruits = <apple orange banana kiwi>;
    if m:i:s/ (\d+) (@fruits)s? / {
        say "You've got $0 $1s, I've got { $0 + 2 } of them. You lost.";
    }

There is yet another construct that automatically matches the longest alternation: multi regexes. They can be either written as multi token name or with a proto:

    grammar Perl {
        ...
        proto token sigil { * }
        token sigil:sym<$> { <sym> }
        token sigil:sym<@> { <sym> }
        token sigil:sym<%> { <sym> }
        ...

       token variable { <sigil> <twigil>? <identifier> }
   }

This example shows multiple tokens called sigil, which are parameterized by sym. When the short name, ie sigil is used, all of these tokens are matched in an alternation. You may think that this is a very inconvenient way to write an alternation, but it has a huge advantage over writing '$'|'@'|'%': it is easily extensible:

    grammar AddASigil is Perl {
        token sigil:sym<!> { <sym> }
    }
    # wow, we have a Perl 6 grammar with an additional sigil!

Likewise you can override existing alternatives:

    grammar WeirdSigil is Perl {
        token sigil:sym<$> { '°' }
    }

In this grammar the sigil for scalar variables is °, so whenever the grammar looks for a sigil it searches for a ° instead of a $, but the compiler will still know that it was the regex sigil:sym<$> that matched it.

In the next lesson you'll see the development of a real, working grammar with Rakudo.

[/perl-5-to-6] Permanent link