Posts in this category

Sun, 29 Jan 2017

Perl 6 By Example: Parsing INI files

This blog post is part of my ongoing project to write a book about Perl 6.

If you're interested, either in this book project or any other Perl 6 book news, please sign up for the mailing list at the bottom of the article, or here. It will be low volume (less than an email per month, on average).

You've probably seen .ini files before; they are quite common as configuration files on the Microsoft Windows platform, but are also in many other places like ODBC configuration files, Ansible's inventory files and so on.

This is what they look like:

key1=value2

[section1]
key2=value2
key3 = with spaces
; comment lines start with a semicolon, and are
; ignored by the parser

[section2]
more=stuff

Perl 6 offers regexes for parsing, and grammars for structuring and reusing regexes.

Regex Basics

A regex is a piece of code that acts as a pattern for strings with a common structure. It's derived from the computer science concept of a regular expression, but adopted to allow more constructs than pure regular expressions allow, and with some added features that make them easier to use.

We'll use named regexes to match the primitives, and then use regexes that call these named regexes to build a parser for the INI files. Since INI files have no universally accepted, formal grammar, we have to make stuff up a we go.

Let's start with parsing value pairs, like key1=value1. First the key. It may contain letters, digits and the underscore _. There's a shortcut for match such characters, \w, and matching more at least one works by appending a + character:

use v6;

my regex key { \w+ }

multi sub MAIN('test') {
    use Test;
    ok 'abc'    ~~ /^ <key> $/, '<key> matches a simple identifier';
    ok '[abc]' !~~ /^ <key> $/, '<key> does not match a section header';
    done-testing;
}

my regex key { \w+ } declares a lexically (my) scoped regex called key that matches one or more word character.

There is a long tradition in programming languages to support so-called Perl Compatible Regular Expressions, short PCRE. Many programming languages support some deviations from PCRE, including Perl itself, but common syntax elements remain throughout most implementations. Perl 6 still supports some of these elements, but deviates substantially in others.

Here \w+ is the same as in PCRE, but the fact that white space around the \w+ is ignored is not. In the testing routine, the slashes in 'abc' ~~ /^ <key> $/ delimit an anonymous regex. In this regex, ^ and $ stand for the start and the end of the matched string, respectively, which is familiar from PCRE. But then <key> calls the named regex key from earlier. This again is a Perl 6 extension. In PCRE, the < in a regex matches a literal <. In Perl 6 regexes, it introduces a subrule call.

In general, all non-word characters are reserved for "special" syntax, and you have to quote or backslash them to get the literal meaning. For example \< or '<' in a regex match a less-then sign. Word characters (letters, digits and the underscore) always match literally.

Parsing the INI primitives

Coming back the INI parsing, we have to think about what characters are allowed inside a value. Listing allowed characters seems to be like a futile exercise, since we are very likely to forget some. Instead, we should think about what's not allowed in a value. Newlines certainly aren't, because they introduce the next key/value pair or a section heading. Neither are semicolons, because they introduce a comment.

We can formulate this exclusion as a negated character class: <-[ \n ; ]> matches any single character that is neither a newline nor a semicolon. Note that inside a character class, nearly all characters lose their special meaning. Only backslash, whitespace and the closing bracket stand for anything other than themselves. Inside and outside of character classes alike, \n matches a single newline character, and \s and whitespace. The upper-case inverts that, so that for example \S matches any single character that is not whitespace.

This leads us to a version version of a regex for match a value in an ini file:

my regex value { <-[ \n ; ]>+ }

There is one problem with this regex: it also matches leading and trailing whitespace, which we don't want to consider as part of the value:

my regex value { <-[ \n ; ]>+ }
if ' abc ' ~~ /<value>/ {
    say "matched '$/'";         # matched ' abc '
}

If Perl 6 regexes were limited to a regular language in the Computer Science sense, we'd have to something like this:

my regex value { 
    # match a first non-whitespace character
    <-[ \s ; ]> 
    [
        # then arbitrarily many that can contain whitespace
        <-[ \n ; ]>* 
        # ... terminated by one non-whitespace 
        <-[ \s ; ]> 
    ]?  # and make it optional, in case the value is only
        # only one non-whitespace character
}

And now you know why people respond with "And now you have two problems" when proposing to solve problems with regexes. A simpler solution is to match a value as introduced first, and then introduce a constraint that neither the first nor the last character may be whitespace:

my regex value { <!before \s> <-[ \s ; ]>+ <!after \s> }

along with accompanying tests:

is ' abc ' ~~ /<value>/, 'abc', '<value> does not match leading or trailing whitespace';
is ' a' ~~ /<value>/, 'a', '<value> watches single non-whitespace too';
ok "a\nb" !~~ /^ <value> $/, '<valuee> does not match \n';

<!before regex> is a negated look-ahead, that is, the following text must not match the regex, and the text isn't consumed while matching. Unsurprisingly, <!after regex> is the negated look-behind, which tries to match text that's already been matched, and must not succeed in doing so for the whole match to be successful.

This being Perl 6, there is of course yet another way to approach this problem. If you formulate the requirements as "a value must not contain a newline or semicolon and start with a non-whitepsace and end with a non-whitespace", it become obvious that if we just had an AND operator in regexes, this could be easy. And it is:

my regex value { <-[ \s ; ]>+ & \S.* & .*\S }

The & operator delimits two or more smaller regex expressions that must all match the same string successfully for the whole match to succeed. \S.* matches any string that starts with a non-whitesspace character (\S), followed by any character (.) any number of times *. Likewise .*\S matches any string that ends with non-whitespace character.

Who would have thought that matching something as seemingly simple as a value in a configuration file could be so involved? Luckily, matching a pair of key and value is much simpler, now that we know how to match each on their own:

my regex pair { <key> '=' <value> }

And this works great, as long as there are no blanks surrounding the equality sign. If there is, we have to match it separately:

my regex pair { <key> \h* '=' \h* <value> }

\h matches a horizontal whitespace, that is, a blank, a tabulator character, or any other fancy space-like thing that Unicode has in store for us (for example also the non-breaking space), but not a newline.

Speaking of newlines, it's a good idea to match a newline at the end of regex pair, and since we ignore empty lines, let's match more than one too:

my regex pair { <key> \h* '=' \h* <value> \n+ }

Time to write some tests as well:

ok "key=vaule\n" ~~ /<pair>/, 'simple pair';
ok "key = value\n\n" ~~ /<pair>/, 'pair with blanks';
ok "key\n= value\n" !~~ /<pair>/, 'pair with newline before assignment';

A section header is a string in brackets, so the string itself shouldn't contain brackets or a newline:

my regex header { '[' <-[ \[ \] \n ]>+ ']' \n+ }

# and in multi sub MAIN('test'):
ok "[abc]\n"    ~~ /^ <header> $/, 'simple header';
ok "[a c]\n"    ~~ /^ <header> $/, 'header with spaces';
ok "[a [b]]\n" !~~ /^ <header> $/, 'cannot nest headers';
ok "[a\nb]\n"  !~~ /^ <header> $/, 'No newlines inside headers';

The last remaining primitive is the comment:

my regex comment { ';' \N*\n+ }

\N matches any character that's not a newline, so the comment is just a semicolon, and then anything until the end of the line.

Putting Things Together

A section of an INI file is a header followed by some key-value pairs or comment lines:

my regex section {
    <header>
    [ <pair> | <comment> ]*
}

[...] groups a part of a regex, so that the quantifier * after it applies to the whole group, not just to the last term.

The whole INI file consists of potentially some initial key/value pairs or comments followed by some sections:

my regex inifile {
    [ <pair> | <comment> ]*
    <section>*
}

The avid reader has noticed that the [ <pair> | <comment> ]* part of a regex has been used twice, so it's a good idea to extract it into a standalone regex:

my regex block   { [ <pair> | <comment> ]* }
my regex section { <header> <block> }
my regex inifile { <block> <section>* }

It's time for the "ultimate" test:

my $ini = q:to/EOI/;
key1=value2

[section1]
key2=value2
key3 = with spaces
; comment lines start with a semicolon, and are
; ignored by the parser

[section2]
more=stuff
EOI

ok $ini ~~ /^<inifile>$/, 'Can parse a full ini file';

Backtracking

Regex matching seems magical to many programmers. You just state the pattern, and the regex engine determines for you whether a string matches the pattern or not. While implementing a regex engine is a tricky business, the basics aren't too hard to understand.

The regex engine goes through the parts of a regex from left to right, trying to match each part of the regex. It keeps track of what part of the string it matched so far in a cursor. If a part of a regex can't find a match, the regex engine tries to alter the previous match to take up fewer characters, and the retry the failed match at the new position.

For example if you execute the regex match

'abc' ~~ /.* b/

the regex engine first evaluates the .*. The . matches any character. The * quantifier is greedy, which means it tries to match as many characters as it can. It ends up matching the whole string, abc. Then the regex engine tries to match the b, which is a literal. Since the previous match gobbled up the whole string, matching c against the remaining empty string fails. So the previous regex part, .*, must give up a character. It now matches ab, and the literal matcher for the b compares b and c, and fails again. So there is a final iteration where the .* once again gives up one character it matched, and now the b literal can match the second character in the string.

This back and forth between the parts of a regex is called backtracking. It's great feature when you search for a pattern in a string. But in a parser, it is usually not desirable. If for example the regex key matched the substring "key2" in the input"key2=value2`, you don't want it to match a shorter substring just because the next part of the regex can't match.

There are three major reasons why you don't want that. The first is that it makes debugging harder. When humans think about how a text is structured, they usually commit pretty quickly to basic tokenization, such as where a word or a sentence ends. Thus backtracking can be very uninuitive. If you generate error messages based on which regexes failed to match, backtracking basically always leads to the error message being pretty useless.

The second reason is that backtracking can lead to unexpected regex matches. For example you want to match two words, optionally separated by whitespaces, and you try to translate this directly to a regex:

say "two words" ~~ /\w+\s*\w+/;     # ｢two words｣

This seems to work: the first \w+ matches the first word, the seccond oen matches the second word, all fine and good. Until you find that it actually matches a single word too:

say "two" ~~ /\w+\s*\w+/;           # ｢two｣

How did that happen? Well, the first \w+ matched the whole word, \s* successfully matched the empty string, and then the second \w+ failed, forcing the previous to parts of the regex to match differently. So in the second iteration, the first \w+only matchestw, and the second\w+matcheso`. And then you realize that if two words aren't delimtied by whitespace, how do you even tell where one word ends and the next one starts? With backtracking disabled, the regex fails to match instead of matching in an unintended way.

The third reason is performance. When you disable backtracking, the regex engine has to look at each character only once, or once for each branch it can take in the case of alternatives. With backtracking, the regex engine can be stuck in backtracking loops that take over-proportionally longer with increasing length of the input string.

To disable backtracking, you simply have to replace the word regex by token in the declaration, or by using the :ratchet modifier inside the regex.

In the INI file parser, only regex value needs backtracking (though other formualtions discussed above don't need it), all the other regexes can be switched over to tokens safely:

my token key     { \w+ }
my regex value   {  <!before \s> <-[\n;]>+ <!after \s> }
my token pair    { <key> \h* '=' \h* <value> \n+ }
my token header  { '[' <-[ \[ \] \n ]>+ ']' \n+ }
my token comment { ';' \N*\n+  }
my token block { [ <pair> | <comment> ]* }
my token section { <header> <block> }
my token inifile { <block> <section>* }

Summary

Perl 6 allows regex reuse by treating them as first-class citizens, allowing them to be named and called like normal routines. Further clutter is removed by allowing whitespace inside regexes.

These features allows you to write regexes to parse proper file formats and even programming languages. So far we have only seen a binary decision about whether a string matches a regex or not. In the future, we'll explore ways to improve code reuse even more, extract structured data from the match, and give better error messages when the parse fails.

[/perl-6] Permanent link

Categories