Posts in this category

Sun, 05 Feb 2017

Perl 6 By Example: Improved INI Parsing with Grammars

This blog post is part of my ongoing project to write a book about Perl 6.

If you're interested, either in this book project or any other Perl 6 book news, please sign up for the mailing list at the bottom of the article, or here. It will be low volume (less than an email per month, on average).

Last week we've seen a collection of regexes that can parse a configuration file in the INI format that's popular in world of Microsoft Windows applications.

Here we'll explore grammars, a feature that groups regexes into a class-like structure, and how to extract structured data from a successful match.

Grammars

A grammar is class with some extra features that makes it suitable for parsing text. Along with methods and attributes you can put regexes into a grammar.

This is what the INI file parser looks like when formulated as a grammar:

grammar IniFile {
    token key     { \w+ }
    token value   { <!before \s> <-[\n;]>+ <!after \s> }
    token pair    { <key> \h* '=' \h* <value> \n+ }
    token header  { '[' <-[ \[ \] \n ]>+ ']' \n+ }
    token comment { ';' \N*\n+  }
    token block   { [<pair> | <comment>]* }
    token section { <header> <block> }
    token TOP     { <block> <section>* }
}

You can use it to parse some text by calling the parse method, which uses regex or token TOP as the entry point:

my $result = IniFile.parse($text);

Besides the standardized entry point, a grammar offers more advantages. You can inherit from it like from a normal class, thus bringing even more reusability to regexes. You can group extra functionality together with the regexes by adding methods to the grammar. And then there are some mechanisms in grammars that can make your life as a developer easier.

One of them is dealing with whitespace. In INI files, horizontal whitespace is generally considered to be insignificant, in that key=value and key = value lead to the same configuration of the application. So far we've dealt with that explicitly by adding \h* to token pair. But there are place we haven't actually considered. For example it's OK to have a comment that's not at start of the line.

The mechanism that grammars offer is that you can define a rule called ws, and when you declare a token with rule instead of token (or enable this feature in regex through the :sigspace modifier), Perl 6 inserts implicit <ws> calls for you where there is whitespace in the regex definition:

grammar IniFile {
    token ws { \h* }
    rule pair { <key>  '='  <value> \n+ }
    # rest as before
}

This might not be worth the effort for a single rule that needs to parse whitespace, but when there are more, this really pays off by keeping whitespace parsing in a singles space.

Note that you should only parse insignificant whitespace in token ws. For example for INI files, newlines are significant, so ws shouldn't match them.

Extracting Data from the Match

So far the IniFile grammar only checks whether a given input matches the grammar or not. But when it does match, we really want the result of the parse in a data structure that's easy to use. For example we could translate this example INI file:

key1=value2

[section1]
key2=value2
key3 = with spaces
; comment lines start with a semicolon, and are
; ignored by the parser

[section2]
more=stuff

Into this data structure of nested hashes:

{
    _ => {
        key1 => "value2"
    },
    section1 => {
        key2 => "value2",
        key3 => "with spaces"
    },
    section2 => {
        more => "stuff"
    }
}

Key-value pairs from outside of any section show up in the _ top-level key.

The result from the IniFile.parse call is a Match object that has (nearly) all the information necessary to extract the desired match. If you turn a Match object into a string, it becomes the matched string. But there's more. You can use it like a hash to extract the matches from named submatches. For example if the top-level match from

token TOP { <block> <section>* }

produces a Match object $m, then $m<block> is again a Match object, this one from the match of the call of token block´. And$m

is a list
of

Matchobjects from the repeated calls to tokensection. So aMatch` is really a tree of matches.

We can walk this data structure to extract the nested hashes. Token header matches a string like "[section1]\n", and we're only interested in"section1". To get to the inner part, we can modify tokenheader` by inserting a pair of round parenthesis around the subregex whose match we're interested in:

token header { '[' ( <-[ \[ \] \n ]>+ ) ']' \n+ }
#                  ^^^^^^^^^^^^^^^^^^^^  a capturing group

That's a capturing group, and we can get its match by using the top-level match for header as an array, and accessing its first element. This leads us to the full INI parser:

sub parse-ini(Str $input) {
    my $m = IniFile.parse($input);
    unless $m {
        die "The input is not a valid INI file.";
    }

    sub block(Match $m) {
        my %result;
        for $m<block><pair> -> $pair {
            %result{ $pair<key>.Str } = $pair<value>.Str;
        }
        return %result;
    }

    my %result;
    %result<_> = hash-from-block($m);
    for $m<section> -> $section {
        %result{ $section<header>[0].Str } = hash-from-block($section);
    }
    return %result;
}

This top-down approach works, but it requires a very intimate understanding of the grammar's structure. Which means that if you change the structure during maintenance, you'll have a hard time figuring out how to change the data extraction code.

So Perl 6 offers a bottom-up approach as well. It allows you to write a data extraction or action method for each regex, token or rule. The grammar engine passes in the match object as the single argument, and the action method can call the routine make to attach a result to the match object. The result is available through the .made method on the match object.

This execution of action methods happens as soon as a regex matches successfully, which means that an action method for a regex can rely on the fact that the action methods for subregex calls have already run. For example when the rule pair { <key> '=' <value> \n+ } is being executed, first token key matches successfully, and its action method runs immediately afterwards. Then token value matches, and its action method runs too. Then finally rule pair itself can match successfully, so its action method can rely on $m<key>.made and $m<value>.made being available, assuming that the match result is stored in variable $m.

Speaking of variables, a regex match implicitly stores its result in the special variable $/, and it is custom to use $/ as parameter in action methods. And there is a shortcut for accessing named submatches: instead of writing $/<key>, you can write $<key>. With this convention in mind, the action class becomes:

class IniFile::Actions {
    method key($/)     { make $/.Str }
    method value($/)   { make $/.Str }
    method header($/)  { make $/[0].Str }
    method pair($/)    { make $<key>.made => $<value>.made }
    method block($/)   { make $<pair>.map({ .made }).hash }
    method section($/) { make $<header>.made => $<block>.made }
    method TOP($/)     {
        make {
            _ => $<block>.made,
            $<section>.map: { .made },
        }
    }
}

The first two action methods are really simple. The result of a key or value match is simply the string that matched. For a header, it's just the substring inside the brackets. Fittingly, a pair returns a Pair object, composed from key and value. Method block constructs a hash from all the lines in the block by iterating over each pair submatch, extracting the already attached Pair object. One level above that in the match tree, section takes that hash and pairs it with the name of section, extracted from $<header>.made. Finally the top-level action method gathers the sectionless key-value pairs under they key _ as well as all the sections, and returns them in a hash.

In each method of the action class, we only rely on the knowledge of the first level of regexes called directly from the regex that corresponds to the action method, and the data types that they .made. Thus when you refactor one regex, you also have to change only the corresponding action method. Nobody needs to be aware of the global structure of the grammar.

Now we just have to tell Perl 6 to actually use the action class:

sub parse-ini(Str $input) {
    my $m = IniFile.parse($input, :actions(IniFile::Actions));
    unless $m {
        die "The input is not a valid INI file.";
    }

    return $m.made
}

If you want to start parsing with a different rule than TOP (which you might want to do in a test, for example), you can pass a named argument rule to method parse:

sub parse-ini(Str $input, :$rule = 'TOP') {
    my $m = IniFile.parse($input,
        :actions(IniFile::Actions),
        :$rule,
    );
    unless $m {
        die "The input is not a valid INI file.";
    }

    return $m.made
}

say parse-ini($ini).perl;

use Test;

is-deeply parse-ini("k = v\n", :rule<pair>), 'k' => 'v',
    'can parse a simple pair';
done-testing;

To better encapsulate all the parsing functionality within the grammar, we can turn parse-ini into a method:

grammar IniFile {
    # regexes/tokens unchanged as before

    method parse-ini(Str $input, :$rule = 'TOP') {
        my $m = self.parse($input,
            :actions(IniFile::Actions),
            :$rule,
        );
        unless $m {
            die "The input is not a valid INI file.";
        }

        return $m.made
    }
}

# Usage:

my $result = IniFile.parse-ini($text);

To make this work, the class IniFile::Actions either has to be declared before the grammar, or it needs to be pre-declared with class IniFile::Action { ... } at the top of the file (with literal three dots to mark it as a forward declaration).

Summary

Match objects are really a tree of matches, with nodes for each named submatch and for each capturing group. Action methods make it easy to decouple parsing from data extraction.

Next we'll explore how to generate better error messages from a failed parse.

[/perl-6] Permanent link

Categories