Categories
Posts in this category
- Introduction
- Strings, Arrays, Hashes;
- Types
- Basic Control Structures
- Subroutines and Signatures
- Objects and Classes
- Contexts
- Regexes (also called "rules")
- Junctions
- Comparing and Matching
- Containers and Values
- Where we are now - an update
- Changes to Perl 5 Operators
- Laziness
- Custom Operators
- The MAIN sub
- Twigils
- Enums
- Unicode
- Scoping
- Regexes strike back
- A grammar for (pseudo) XML
- Subset Types
- The State of the implementations
- Quoting and Parsing
- The Reduction Meta Operator
- The Cross Meta Operator
- Exceptions and control exceptions
- Common Perl 6 data processing idioms
- Currying
Sun, 30 Nov 2008
Regexes strike back
Permanent link
NAME
"Perl 5 to 6" Lesson 19 - Regexes strike back
SYNOPSIS
# normal matching: if 'abc' ~~ m/../ { say $/; # ab } # match with implicit :sigspace modifier if 'ab cd ef' ~~ ms/ (..) ** 2 / { say $0[1]; # cd } # substitute with the :samespace modifier my $x = "abc defg"; $x ~~ ss/c d/x y/; say $x; # abx yefg
DESCRIPTION
Since the basics of regexes are already covered in lesson 07, here are some useful (but not very structured) additional facts about Regexes.
Matching
You don't need to write grammars to match regexes, the traditional form m/.../
still works, and has a new brother, the ms/.../
form, which implies the :sigspace
modifier. Remember, that means that whitespaces in the regex are substituted by the <.ws>
rule.
The default for the rule is to match \s+
if it is surrounded by two word-characters (ie those matching those \w
), and \s*
otherwise.
In substitutions the :samespace
modifier takes care that whitespaces matched with the ws
rule are preserved. Likewise the :samecase
modifier, short :ii
(since it's a variant of :i
) preserves case.
my $x = 'Abcd'; $x ~~ s:ii/^../foo/; say $x; # Foocd $x = 'ABC'; $x ~~ s:ii/^../foo/; say $x # FOOC
This is very useful if you want to globally rename your module Foo
, to Bar
, but for example in environment variables it is written as all uppercase. With the :ii
modifier the case is automatically preserved.
It copies case information on a character by character. But there's also a more intelligent version; when combined with the :sigspace
(short :s
) modifier, it tries to find a pattern in the case information of the source string. Recognized are .lc
, .uc
, .lc.ucfirst
, .uc.lcfirst
and .lc.capitaliz
(Str.capitalize
uppercases the first character of each word). If such a pattern is found, it is also applied to the substitution string.
my $x = 'The Quick Brown Fox'; $x ~~ s :s :ii /brown.*/perl 6 developer/; # $x is now 'The Quick Perl 6 Developer'
Alternations
Alternations are still formed with the single bar |
, but it means something else than in Perl 5. Instead of sequentially matching the alternatives and taking the first match, it now matches all alternatives in parallel, and takes the longest one.
'aaaa' ~~ m/ a | aaa | aa /; say $/ # aaa
While this might seem like a trivial change, it has far reaching consequences, and is crucial for extensible grammars. Since Perl 6 is parsed using a Perl 6 grammar, it is responsible for the fact that in ++$a
the ++
is parsed as a single token, not as two prefix:<+>
tokens.
The old, sequential style is still available with ||
:
grammar Math::Expression { token value { | <number> | '(' <expression> [ ')' || { fail("Parenthesis not closed") } ] } ... }
The { ... }
execute a closure, and calling fail
in that closure makes the expression fail. That branch is guaranteed to be executed only if the previous (here the ')'
) fails, so it can be used to emit useful error messages while parsing.
There are other ways to write alternations, for example if you "interpolate" an array, it will match as an alternation of its values:
$_ = '12 oranges'; my @fruits = <apple orange banana kiwi>; if m:i:s/ (\d+) (@fruits)s? / { say "You've got $0 $1s, I've got { $0 + 2 } of them. You lost."; }
There is yet another construct that automatically matches the longest alternation: multi regexes. They can be either written as multi token name
or with a proto
:
grammar Perl { ... proto token sigil { * } token sigil:sym<$> { <sym> } token sigil:sym<@> { <sym> } token sigil:sym<%> { <sym> } ... token variable { <sigil> <twigil>? <identifier> } }
This example shows multiple tokens called sigil
, which are parameterized by sym
. When the short name, ie sigil
is used, all of these tokens are matched in an alternation. You may think that this is a very inconvenient way to write an alternation, but it has a huge advantage over writing '$'|'@'|'%'
: it is easily extensible:
grammar AddASigil is Perl { token sigil:sym<!> { <sym> } } # wow, we have a Perl 6 grammar with an additional sigil!
Likewise you can override existing alternatives:
grammar WeirdSigil is Perl { token sigil:sym<$> { '°' } }
In this grammar the sigil for scalar variables is °
, so whenever the grammar looks for a sigil it searches for a °
instead of a $
, but the compiler will still know that it was the regex sigil:sym<$>
that matched it.
In the next lesson you'll see the development of a real, working grammar with Rakudo.