Fri, 26 Sep 2008

Regexes (also called "rules")

NAME

"Perl 5 to 6" Lesson 07 - Regexes (also called "rules")

SYNOPSIS

    grammar URL {
        token TOP {
            <schema> '://' 
            [<ip> | <hostname> ]
            [ ':' <port>]?
            '/' <path>?
        }
        token byte {
            (\d**1..3) <?{ $0 < 256 }>
        }
        token ip {
            <byte> [\. <byte> ] ** 3
        }
        token schema {
            \w+
        }
        token hostname {
            (\w+) ( \. \w+ )*
        }
        token port {
            \d+
        }
        token path {
            <[ a..z A..Z 0..9 \-_.!~*'():@&=+$,/ ]>+
        }
    }

    my $match = URL.parse('http://perl6.org/documentation/');
    say $match<hostname>;       # perl6.org

DESCRIPTION

Regexes are one of the areas that has been improved and revamped most in Perl 6. We don't call them regular expressions anymore because they are even less regular than they are in Perl 5.

There are three large changes and enhancements to the regexes

Syntax clean up

Many small changes make rules easier to write. For example the dot . matches any character now, the old semantics (anything but newlines) can be achieved with \N.

Modifiers now go at the start of a regex, and non-capturing groups are [...], which are a lot easier to read and write than Perl 5 (?:...).

Nested captures and match object

In Perl 5, a regex like this (a(b))(c) would put ab into $1, b into $2 and c into $3 upon successful match. This has changed. Now $0 (enumeration starts at zero) contains ab, and $0[0] or $/[0][0] contains b. $1 holds c. So each nesting level of parenthesis is reflected in a new nesting level in the result match object.

All the match variables are aliases into $/, which is the so-called Match object, and it actually contains a full match tree.

Named regexes and grammars

You can declare regexes with names just like you can with subs and methods. You can refer to these inside other rules with <name>. And you can put multiple regexes into grammars, which are just like classes and support inheritance and composition

These changes make Perl 6 regexes and grammars much easier to write and maintain than Perl 5 regexes.

All of these changes go quite deep, and only the surface can be scratched here.

Syntax clean up

Letter characters (ie underscore, digits and all Unicode letters) match literally, and have a special meaning (they are metasyntactic) when escaped with a backslash. For all other characters it's the other way round - they are metasyntactic unless escaped.

    literal         metasyntactic
    a  b  1  2      \a \b \1 \2
    \* \: \. \?     *  :  .  ?

Not all metasyntactic tokens have a meaning (yet). It is illegal to use those without a defined meaning.

There is another way to escape strings in regexes: with quotes.

    m/'a literal text: $#@!!'/

The change in semantics of . has already been mentioned, and also that [...] now construct non-capturing groups. Character classes are <[...]>, and negated char classes <-[...]>. ^ and $ always match begin and end of the string respectively, to match begin and end of lines use ^^ and $$.

This means that the /s and /m modifiers are gone. Modifiers are now given at the start of a regex, and are given in this notation:

    if "abc" ~~ m:i/B/ {
        say "Matched a B.";
    }

... which happens to be the same as the colon pair notation that you can use for passing named arguments to routines.

Modifiers have a short and a long form. The old /x modifier is now the default, i.e. white spaces are ignored.

    short   long            meaning
    -------------------------------
    :i      :ignorecase     ignore case (formerly /i)
    :m      :ignoremark     ignore marks (accents, diaeresis etc.)
    :g      :global         match as often as possible (/g)
    :s      :sigspace       Every white space in the regex matches
                            (optional) white space
    :P5     :Perl5          Fall back to Perl 5 compatible regex syntax
    :4x     :x(4)           Match four times (works for other numbers as well)
    :3rd    :nth(3)         Third match
    :ov     :overlap        Like :g, but also consider overlapping matches
    :ex     :exhaustive     Match in all possible ways
            :ratchet        Don't backtrack

The :sigspace needs a bit more explanation. It replaces all whitespace in the pattern with <.ws> (that is it calls the rule ws without keeping its result). You can override that rule. By default it matches one or more whitespaces if it's enclosed in word characters, and zero or more otherwise.

(There are more new modifiers, but probably not as important as the listed ones).

The Match Object

Every match generates a so-called match object, which is stored in the special variable $/. It is a versatile thing. In boolean context it returns Bool::True if the match succeeded. In string context it returns the matched string, when used as a list it contains the positional captures, and when used as a hash it contains the named captures. The .from and .to methods contain the first and last string position of the match respectively.

    if 'abcdefg' ~~ m/(.(.)) (e | bla ) $<foo> = (.) / {
        say $/[0][0];           # d
        say $/[0];              # cd
        say $/[1];              # e
        say $/<foo>             # f
    }

$0, $1 etc are just aliases for $/[0], $/[1] etc. Likewise $/<x> and $/{'x'} are aliased to $<x>.

Note that anything you access via $/[...] and $/{...} is a match object (or a list of Match objects) again. This allows you to build real parse trees with rules.

Named Regexes and Grammars

Regexes can either be used with the old style m/.../, or be declared like subs and methods.

    regex a { ... }
    token b { ... }
    rule  c { ... }

The difference is that token implies the :ratchet modifier (which means no backtracking, like a (?> ... ) group around each part of the regex in perl 5), and rule implies both :ratchet and :sigspace.

To call such a rule (we'll call them all rules, independently with which keyword they were declared) you put the name in angle brackets: <a>. This implicitly anchors the sub rule to its current position in the string, and stores the result in the match object in $/<a>, ie it's a named capture. You can also call a rule without capturing its result by prefixing its name with a dot: <.a>.

If you want to refer to a rule outside of a Grammar, you need to call them with a routine sigil, like <&other>.

A grammar is a group of rules, just like a class (see the SYNOPSIS for an example). Grammars can inherit, override rules and so on.

    grammar URL::HTTP is URL {
        token schema { 'http' }
    }

MOTIVATION

Perl 5 regexes are often rather unreadable, the grammars encourage you to split a large regex into more readable, short fragments. Named captures make the rules more self-documenting, and many things are now much more consistent than they were before.

Finally grammars are so powerful that you can parse about every programming language with them, including Perl 6 itself. That makes the Perl 6 grammar easier to maintain and to change than the Perl 5 one, which is written in C and not changeable at parse time.

Categories

Posts in this category