Published 2008-05-29, preview for #perl6 on 2008-05-24
A Mutable Grammar For Perl 6
Did you ever wonder why modifying perl 5 syntax is so hard? Do you want to know what Perl 6 will do to make it easier?
The Perl 5 parsing problem
Parsing Perl 5 code is hard. The part of perl that parses Perl is large, and most hackers don't really want to touch it.
wc perly.y toke.c 1347 4338 36247 perly.y 12917 39097 333537 toke.c
Actually you can't parse Perl without executing it and since there's no other implementation for Perl 5 than perl5, you have to rely on that one implementation.
So if you want to change perl 5's syntax you have to dig into C and yacc/bison code and recompile perl, or write a source filter. But if you want to write a correct source filter, you have to parse perl - which you can't really, as just discussed.
But that's not the only reason why source filters are frowned upon. The second reason is that you can't really combine two ore more source filters, unless they are explicitly written for it.
You might wonder why perl is so hard to parse. The reason is that perl is designed to be easy to write, whereas most other programming languages are designed with ease of parsing in mind.
Goals for Perl 6
Learning from perl 5's problems, there were additionally design goals set for Perl 6 syntax:
- There should be a uniform syntax across different implementations
- It should be easy to transform the syntax, especially without writing a hand-rolled parser
- Multiple syntax mutations should be compatible (unless they change the same syntactical element, of course)
- Perl 6 programs and external tools (IDEs, syntax hilighting engines etc.) should get access to the parse tree
How to achieve it
The Perl 6 syntax isn't very easy either, but now there are tools in place that can handle it. And these tools are grammars, rules, bottom-up parsing, proto regexes and longest-token matching (LTM).
Rules
Rules are just like perl 5's regexes, only better. They are declared like subs or methods, and they can call other rules (and of course recurse directly into itself as well). For ease of use and efficiency there are additional techniques for handling whitespaces and backtracking control.
Here is a made-up example of how you could parse basic variable names:
grammar Perl6 { # among other stuff: # token alpha is a pre-defined rule token identifier { <alpha> \w+ } # match a full-qualified identifier # [ ... ] are non-capturing groups token name { <identifier> [ '::' <identifier> ] * } # .. | .. are alternations. The longest match wins. token sigil { '$' | '@' | '&' | '%' | '::' } # <rule> calls the named rule, implicit anchored # at the current position token variable { <sigil> <name> } }
(You'll notice that the grammar for parsing Perl 6 is written in Perl 6 itself - which is very neat for extensibility, it just requires an extra boot-strapping step.)
Grammars
A grammar is very much the same as a class, with rules instead of methods. Or to put it another way, grammars are collections of rules that support inheritance.
With rules and grammars in place you can already build a basic, modifiable parser. Let's try to make a simple syntax change. For example a lolspeak-like Perl 6 should only allow upper case variable names:
# we inherit from the original grammar... grammar PERL6 is Perl6 { # ... and override that parsing rule that we want to change token identifier { # char classes are <[ ... ]> in Perl 6 <[A..Z]> <[A..Z0..9_]>* } }
Now we just have to tell our compiler to use grammar PERL6
instead
of our default grammar. Every call to the subrule identifier
is handled
by our custom rule, so we're done.
However there's a pitfall. Suppose you want to change a sigil, for example
$
to ¢
(because you don't have enough $$$ for all
these variables, right?). Seems simple:
grammar LowBudgetPerl6 is Perl 6 { token sigil { '¢' | '@' | '&' | '%' | '::' } }
The parsing works fine with our new grammar, but everything after that is
bound to fail. When the compiler sees a sigil
match in the parse tree,
it has to find out which one it is - which means it has to check the literal
value of the matched text, and won't know what to do with ¢
.
So we need even more technologies...
Proto Regexes
A proto regex is a set of regexes/rules with the same name, that can be distinguished nonetheless.
The current Perl 6 grammar uses this construct:
proto token sigil {*} # ... token sigil:sym<$> { <sym> } token sigil:sym<@> { <sym> } token sigil:sym<%> { <sym> } token sigil:sym<&> { <sym> } token sigil:sym<::> { <sym> }
This creates a group (proto
) named sigil
,
and five rules in that group (they belong to the group because they have the
same name as the group) that are
parameterized with the sym
identifier. The first one sets
sym
to $
and then matches this symbol (with
<sym>
). The second one matches @
etc.
Now if you call the rule
<sigil>
,
you get the "or"-associated list of all those five rules. So it still matches
the same as the regex
'$' | '@' | '%' | '&' | '::'
, but is easier to extend.
If you want to add a new sigil, the only change in the grammar is another
sigil:
rule:
grammar SigilRichP6 is Perl6 { token sigil:sym<ħ> { <sym> } # physicists will love you }
Or to bring us back to the original example, you can override existing rules:
grammar LowBudgetPerl6 is Perl 6 { token sigil:sym<$> { '¢' } }
Now that's a grammar that uses a different sigil for scalars, but since
it's the same rule with the same parameter (sigil:sym<$>
)
as in the original grammar, the compiler still knows what to do with it.
So with proto regexes in place we can extend and modify grammars quite easily.
Bottom-up parsing
There's one thing left to ultimate extensibility: introducing new precedence levels.
Traditional top-down grammars introduce one rule for each precedence level, like this:
token category:add_op { <sym> } proto token add_op {...} token category:mul_op { <sym> } proto token mul_op {...} rule expression { <term> [<add_op> <term>]* } rule term { <factor> [<mul_op> <factor>]* } rule factor { | ['(' <expression> ')' ] | <value> } token add_op:sym<+> { <sym> } token add_op:sym<-> { <sym> } token mul_op:sym<*> { <sym> } token mul_op:sym</> { <sym> } token mul_op:sym<%> { <sym> }
Now if we want a derived grammar with a precedence level in between +
and *
, we have to modify two rules (expression
and term
),
which breaks any other changes made to these rules. (Remember, we're ambitious
and what to allow multiple local grammar changes).
Also notice that we produced very similar code for each precedence level, and that's boring.
What we really want is a list of operators and their precedence (and
associativity), and that the parser figures out the rest for us. The pattern is
just <term> [ <operator> <term>]+
, and once we
are able to parse that we just have to make a tree out of it.
That's called bottom-up parsing, and Perl 6 uses it a lot. You can add new operators and precedence levels without altering other rules:
token infix:sym<bla> is tighter:<+> { <sym> }
This introduces the bla
operator, which has tighter precedence
than +
, but looser than *
.
Longest-Token Matching
We're mostly there now, small changes and enhancements work like a charm. The example syntax changes took great care not to conflict with existing Perl 6 syntax. For larger changes you can't avoid that.
It seems you have to define an order in which the rules from the original grammar and the derived rules are tried, for example "derived first". But that isn't very robust if you have multiple grammar modifications (for example multiple mixins).
So instead of defining an order, all rules that could possibly match in an alternation are matched in parallel, and the one that matches the most text wins.
For example if you want to introduce a +++
postfix operator that
increments a variable by two, you'd just introduce a new rule for it, and when
the parser expects a postfix
token, both ++
and +++
match, but since the latter matches more text, it wins.
LTM is a tad more complicated than you might think from reading this, but there are some gory details involved.
A nice syntax for grammar changes
Now the ordinary Perl hacker who wants a new operator shouldn't have to know all this stuff about mutable grammars, they want a nice syntax that hides all that.
Being foremost practical, Perl 6 provides that of course. You can add the following lines in your normal Perl 6 code, and immediately afterwards you have the newly defined operators in place.
# previous example multi sub postfix:<+++> is equiv:<++> ( Num $x is rw ) { $x += 2; } # define the factorial operator multi sub postfix:<!> is equiv:<++> ( Int $x ) { return [*] 1 .. $x; } # test the thing: my $x = 2; $x+++; say $x; # prints 4 say $x!; # prints 24
And It Really Works
You might think this is all fine and cool, but you don't believe it's actually implementable.
Larry Wall put quite some effort into writing both the standard Perl 6 grammar and some programs that translate it into Perl 5 code. As of August 2008 this translated grammar can parse both itself and nearly all files of the official test suite. (The one file that it fails to parse actually adds an operator at compile time, which needs a real compiler backend to be implemented).
Internally it modifies copies of itself to parse various DSLs such as regexes and quoted constructs.
It is quite slow at the time of writing, but that's not really surprising because it's not optimized for speed yet.
Open Problems
As far as I can tell there are still open problems with the parser. For example how exactly macros will work. For example it's not yet clear (at least to me) how macros can be bound to a particular precedence level, or to a particular grammar rule.
Also it would be nice to make macros work on the AST level rather than on the source code directly. Quasiquotes seem to tackle that problem, but so far I don't see how it actually solves it.
The biggest open problem is, of course, that you have to actually execute
compile time statements (like use
, BEGIN
, and
macros), so you need a real Perl 6 compiler. That's what many people are
working on, but it's not an easy task.
Summary
With lots of new ideas grammars become easier to write and maintain, and can be easily modified with understanding only a small subset of the grammar, or in simple cases without it.