Published on 2010-09-01
How To Debug a Perl 6 Grammar
When a programmer starts to learn his craft, he spends a lot of time making small, stupid mistakes that prevent his programs from running. With a bit of practice, he learns how to make fewer errors, and write more runnable code at once.
With grammars, it's the same all over again. In the author's experience, even expert programmers start with silly mistakes when they begin to write grammars. It's just vastly different from writing ordinary code, and requires a similar learning experience.
Here are some instructions that help you to write and debug grammars.
Start with small steps
Start with small steps, and test along the way.
Start with a simple, single parsing rule, and test cases for it. Keep expanding the test cases and the grammar simultaneously. Only add more features when all tests that you expect to pass actually do.
Test rules individually
If you can't understand certain behavior, test rules individually. That way you can figure out if a rule is wrong, wrongly (or never) called, or interacts badly with other rules.
grammar MyGrammar { token TOP { ^ [ <comment> | <chunk> ]* $ } token comment { '#' \N* $$ } token chunk { ^^(\S+) \= (\S+) $$ } } # try to parse the whole thing say ?MyGrammar.parse("#a comment\nfoo = bar"); # 0 # and now one by one say ?MyGrammar.parse("#a comment\n", :rule<comment>); # 1 say ?MyGrammar.parse("foo = bar", :rule<chunk>); # 0
The example above shows a simple grammar that doesn't match a test string,
due to a stupid thinko. The last two lines test the rules individually,
identifying token chunk
as the faulty one.
Debug with print
or say
Just like ordinary code, you can sprinkle your grammar rules with calls to
say()
. You just need to embed them in curly braces, so that they
get executed as ordinary code.
grammar MyGrammar { token chunk { { say "chunk: called" } ^^ { say "chunk: found start of line" } (\S+) { say "chunk: found first identifier: $0" } \= { say "chunk: found =" } (\S+) $$ } } say ?MyGrammar.parse("foo = bar", :rule<chunk>); # output: # # chunk: called # chunk: found start of line # chunk: found fist identifer: foo # 0
You can see that the rule matched the start of the line, and
foo
, but not the equals sign. What's between the two? A space.
For which there is no rule to match it. Making chunk
a rule
instead of a token fixes this problem.
Remember that backtracking can cause a single block to be executed multiple times, even if not part of a quantified construct.
$ perl6 -e '"aabcd" ~~ /^ (.*) { say $0 } b /' aabcd aabc aab aa
Be careful with backtracking control
Programmers who are familiar with Perl 5 regexes or similar regex engines are used to backtracking: If the "most obvious" way to match a string does not work out, the regex engine tries all possible other ways.
This is what many expect for small regexes, but when writing a grammar that has several nesting levels, it can be deeply confusing.
Most day-to-day parsing problems can be formulated in a way that requires little or no backtracking, and it should be done that way, both for efficiency and programmer sanity.
Some constructs are easier with backtracking, but if you use them, embed
them in a non-backtracking rule (ie token
or rule
,
which have the :ratchet
modifier implicitly set):
rule verbatim { '[%' ~ '%]' verbatim # switch on backtracking from here on, # to the bottom of the rule :!ratchet .*? '[%' endverbatim '%]' }
This uses backtracking inside the regex, but once it found a possible
match, it will never try another, because here verbatim
is a
rule, which (like token) suppresses backtracking into itself.
Regex::Tracer for Rakudo Grammars
Jonathan Worthington's excellent Regex::Tracer module in the Regex::Grammar distribution is a very useful tool for debugging Regexes. It is limited to Rakudo only.
If you use Regex::Tracer;
, all grammars in that lexical scope
will emit debug information at run time. The debug output is colored, and
shows which rules tried to match, and whether they succeed or failed. The
Perl 6 advent calendar has an entry with more details.