Categories

Posts in this category

Sun, 12 Feb 2017

Perl 6 By Example: Generating Good Parse Errors from a Parser


Permanent link

This blog post is part of my ongoing project to write a book about Perl 6.

If you're interested, either in this book project or any other Perl 6 book news, please sign up for the mailing list at the bottom of the article, or here. It will be low volume (less than an email per month, on average).


Good error messages are paramount to the user experience of any product. Parsers are no exception to this. Consider the difference between the message "Square bracket [ on line 5 closed by curly bracket } on line 5", in contrast to Python's lazy and generic "SyntaxError: invalid syntax".

In addition to the textual message, knowing the location of the parse error helps tremendously in figuring out what's wrong.

We'll explore how to generate better parsing error messages from a Perl 6 grammar, using the INI file parse from the previous blog posts as an example.

Failure is Normal

Before we start, it's important to realize that in a grammar-based parser, it's normal for regex to fail to match. Even in an overall successful parse.

Let's recall a part of the parser:

token block { [<pair> | <comment>]* }
token section { <header> <block> }
token TOP { <block> <section>* }

When this grammar matches against the string

key=value
[header]
other=stuff

then TOP calls block, which calls both pair and comment. The pair match succeeds, the comment match fails. No big deal. But since there is a * quantifier in token block, it tries again to match pair or comment. neither succeeds, but the overall match of token block still succeeds.

A nice way to visualize passed and failed submatches is to install the Grammar::Tracer module (zef install Grammar::Tracer or panda install Grammar::Tracer), and simple add the statement use Grammar::Tracer before the grammar definition. This produces debug output showing which rules matched and which didn't:

TOP
|  block
|  |  pair
|  |  |  key
|  |  |  * MATCH "key"
|  |  |  ws
|  |  |  * MATCH ""
|  |  |  ws
|  |  |  * MATCH ""
|  |  |  value
|  |  |  * MATCH "value"
|  |  |  ws
|  |  |  * MATCH ""
|  |  |  ws
|  |  |  * MATCH ""
|  |  * MATCH "key=value\n"
|  |  pair
|  |  |  key
|  |  |  * FAIL
|  |  * FAIL
|  |  comment
|  |  * FAIL
|  * MATCH "key=value\n"
|  section
...

Detecting Harmful Failure

To produce good parsing error messages, you must distinguish between expected and unexpected parse failures. As explained above, a match failure of a single regex or token is not generally an indication of a malformed input. But you can identify points where you know that once the regex engine got this far, the rest of the match must succeed.

If you recall pair:

rule pair { <key>  '='  <value> \n+ }

we know that if a key was parsed, we really expect the next character to be an equals sign. If not, the input is malformed.

In code, this looks like this:

rule pair {
    <key> 
    [ '=' || <expect('=')> ]
     <value> \n+
}

|| is a sequential alternative, which first tries to match the subregex on the left-hand side, and only executes the right-hand side if that failed. On the other hand, | executes all alternatives notionally in parallel, and takes the long match.

So now we have to define expect:

method expect($what) {
    die "Cannot parse input as INI file: Expected $what";
}

Yes, you can call methods just like regexes, because regexes really are methods under the hood. die throws an exception, so now the malformed input justakey produces the error

Cannot parse input as INI file: Expected =

followed by a backtrace. That's already better than "invalid syntax", though the position is still missing. Inside method expect, we can find the current parsing position through method pos, a method supplied by the implicit parent class Grammar that the grammar declaration brings with it.

We can use that to improve the error message a bit:

method expect($what) {
    die "Cannot parse input as INI file: Expected $what at character {self.pos}";
}

Providing Context

For larger inputs, we really want to print the line number. To calculate that, we need to get hold of the target string, which is available as method target:

method expect($what) {
    my $parsed-so-far = self.target.substr(0, self.pos);
    my @lines = $parsed-so-far.lines;
    die "Cannot parse input as INI file: Expected $what at line @lines.elems(), after '@lines[*-1]'";
}

This brings us from the "meh" realm of error messages to quite good.

IniFile.parse(q:to/EOI/);
key=value
[section]
key_without_value
more=key
EOI

now dies with

Cannot parse input as INI file: Expected = at line 3, after 'key_without_value'

You can refine method expect more, for example by providing context both before and after the position of the parse failure.

And of course you have to apply the [ thing || <expect('thing')> ] pattern at more places inside the regex to get better error messages.

Finally you can provide different kinds of error messages too. For example when parsing a section header, once the initial [ is parsed, you likely don't want an error message "expected rest of section header", but rather "malformed section header, at line ...":

rule pair {
    <key> 
    [ '=' || <expect('=')> ] 
    [ <value> || <expect('value')>]
     \n+
}
token header { 
     '[' 
     [ ( <-[ \[ \] \n ]>+ )  ']'
         || <error("malformed section header")> ]
}
...

method expect($what) {
    self.error("expected $what");
}

method error($msg) {
    my $parsed-so-far = self.target.substr(0, self.pos);
    my @lines = $parsed-so-far.lines;
    die "Cannot parse input as INI file: $msg at line @lines.elems(), after '@lines[*-1]'";
}

Since Rakudo Perl 6 uses grammars to parse Perl 6 input, you can use Rakudo's own grammar as source of inspiration for more ways to make error reporting even better.

Summary

To generate good error messages from a parser, you need to distinguish between expected and unexpected match failures. The sequential alternative || is a tool you can use to turn unexpected match failures into error messages by raising an exception from the second branch of the alternative.

Subscribe to the Perl 6 book mailing list

* indicates required

[/perl-6] Permanent link