Categories
Posts in this category
- Current State of Exceptions in Rakudo and Perl 6
- Meet DBIish, a Perl 6 Database Interface
- doc.perl6.org and p6doc
- Exceptions Grant Report for May 2012
- Exceptions Grant Report -- Final update
- Perl 6 Hackathon in Oslo: Be Prepared!
- Localization for Exception Messages
- News in the Rakudo 2012.05 release
- News in the Rakudo 2012.06 release
- Perl 6 Hackathon in Oslo: Report From The First Day
- Perl 6 Hackathon in Oslo: Report From The Second Day
- Quo Vadis Perl?
- Rakudo Hack: Dynamic Export Lists
- SQLite support for DBIish
- Stop The Rewrites!
- Upcoming Perl 6 Hackathon in Oslo, Norway
- A small regex optimization for NQP and Rakudo
- Pattern Matching and Unpacking
- Rakudo's Abstract Syntax Tree
- The REPL trick
- First day at YAPC::Europe 2013 in Kiev
- YAPC Europe 2013 Day 2
- YAPC Europe 2013 Day 3
- A new Perl 6 community server - call for funding
- New Perl 6 community server now live, accepting signups
- A new Perl 6 community server - update
- All Perl 6 modules in a box
- doc.perl6.org: some stats, future directions
- Profiling Perl 6 code on IRC
- Why is it hard to write a compiler for Perl 6?
- Writing docs helps you take the user's perspective
- Perl 6 Advent Calendar 2016 -- Call for Authors
- Perl 6 By Example: Running Rakudo
- Perl 6 By Example: Formatting a Sudoku Puzzle
- Perl 6 By Example: Testing the Say Function
- Perl 6 By Example: Testing the Timestamp Converter
- Perl 6 By Example: Datetime Conversion for the Command Line
- What is Perl 6?
- Perl 6 By Example, Another Perl 6 Book
- Perl 6 By Example: Silent Cron, a Cron Wrapper
- Perl 6 By Example: Testing Silent Cron
- Perl 6 By Example: Stateful Silent Cron
- Perl 6 By Example: Perl 6 Review
- Perl 6 By Example: Parsing INI files
- Perl 6 By Example: Improved INI Parsing with Grammars
- Perl 6 By Example: Generating Good Parse Errors from a Parser
- Perl 6 By Example: A File and Directory Usage Graph
- Perl 6 By Example: Functional Refactorings for Directory Visualization Code
- Perl 6 By Example: A Unicode Search Tool
- What's a Variable, Exactly?
- Perl 6 By Example: Plotting using Matplotlib and Inline::Python
- Perl 6 By Example: Stacked Plots with Matplotlib
- Perl 6 By Example: Idiomatic Use of Inline::Python
- Perl 6 By Example: Now "Perl 6 Fundamentals"
- Perl 6 Books Landscape in June 2017
- Living on the (b)leading edge
- The Loss of Name and Orientation
- Perl 6 Fundamentals Now Available for Purchase
- My Ten Years of Perl 6
- Perl 6 Coding Contest 2019: Seeking Task Makers
- A shiny perl6.org site
- Creating an entry point for newcomers
- An offer for software developers: free IRC logging
- Sprixel, a 6 compiler powered by JavaScript
- Announcing try.rakudo.org, an interactive Perl 6 shell in your browser
- Another perl6.org iteration
- Blackjack and Perl 6
- Why I commit Crud to the Perl 6 Test Suite
- This Week's Contribution to Perl 6 Week 5: Implement Str.trans
- This Week's Contribution to Perl 6
- This Week's Contribution to Perl 6 Week 8: Implement $*ARGFILES for Rakudo
- This Week's Contribution to Perl 6 Week 6: Improve Book markup
- This Week's Contribution to Perl 6 Week 2: Fix up a test
- This Week's Contribution to Perl 6 Week 9: Implement Hash.pick for Rakudo
- This Week's Contribution to Perl 6 Week 11: Improve an error message for Hyper Operators
- This Week's Contribution to Perl 6 - Lottery Intermission
- This Week's Contribution to Perl 6 Week 3: Write supporting code for the MAIN sub
- This Week's Contribution to Perl 6 Week 1: A website for proto
- This Week's Contribution to Perl 6 Week 4: Implement :samecase for .subst
- This Week's Contribution to Perl 6 Week 10: Implement samespace for Rakudo
- This Week's Contribution to Perl 6 Week 7: Implement try.rakudo.org
- What is the "Cool" class in Perl 6?
- Report from the Perl 6 Hackathon in Copenhagen
- Custom operators in Rakudo
- A Perl 6 Date Module
- Defined Behaviour with Undefined Values
- Dissecting the "Starry obfu"
- The case for distributed version control systems
- Perl 6: Failing Softly with Unthrown Exceptions
- Perl 6 Compiler Feature Matrix
- The first Perl 6 module on CPAN
- A Foray into Perl 5 land
- Gabor: Keep going
- First Grant Report: Structured Error Messages
- Second Grant Report: Structured Error Messages
- Third Grant Report: Structured Error Messages
- Fourth Grant Report: Structured Error Messages
- Google Summer of Code Mentor Recap
- How core is core?
- How fast is Rakudo's "nom" branch?
- Building a Huffman Tree With Rakudo
- Immutable Sigils and Context
- Is Perl 6 really Perl?
- Mini-Challenge: Write Your Prisoner's Dilemma Strategy
- List.classify
- Longest Palindrome by Regex
- Perl 6: Lost in Wonderland
- Lots of momentum in the Perl 6 community
- Monetize Perl 6?
- Musings on Rakudo's spectest chart
- My first executable from Perl 6
- My first YAPC - YAPC::EU 2010 in Pisa
- Trying to implement new operators - failed
- Programming Languages Are Not Zero Sum
- Perl 6 notes from February 2011
- Notes from the YAPC::EU 2010 Rakudo hackathon
- Let's build an object
- Perl 6 is optimized for fun
- How to get a parse tree for a Perl 6 Program
- Pascal's Triangle in Perl 6
- Perl 6 in 2009
- Perl 6 in 2010
- Perl 6 in 2011 - A Retrospection
- Perl 6 ticket life cycle
- The Perl Survey and Perl 6
- The Perl 6 Advent Calendar
- Perl 6 Questions on Perlmonks
- Physical modeling with Math::Model and Perl 6
- How to Plot a Segment of a Circle with SVG
- Results from the Prisoner's Dilemma Challenge
- Protected Attributes Make No Sense
- Publicity for Perl 6
- PVC - Perl 6 Vocabulary Coach
- Fixing Rakudo Memory Leaks
- Rakudo architectural overview
- Rakudo Rocks
- Rakudo "star" announced
- My personal "I want a PONIE" wish list for Rakudo Star
- Rakudo's rough edges
- Rats and other pets
- The Real World Strikes Back - or why you shouldn't forbid stuff just because you think it's wrong
- Releasing Rakudo made easy
- Set Phasers to Stun!
- Starry Perl 6 obfu
- Recent Perl 6 Developments August 2008
- The State of Regex Modifiers in Rakudo
- Strings and Buffers
- Subroutines vs. Methods - Differences and Commonalities
- A SVG plotting adventure
- A Syntax Highlighter for Perl 6
- Test Suite Reorganization: How to move tests
- The Happiness of Design Convergence
- Thoughts on masak's Perl 6 Coding Contest
- The Three-Fold Function of the Smart Match Operator
- Perl 6 Tidings from September and October 2008
- Perl 6 Tidings for November 2008
- Perl 6 Tidings from December 2008
- Perl 6 Tidings from January 2009
- Perl 6 Tidings from February 2009
- Perl 6 Tidings from March 2009
- Perl 6 Tidings from April 2009
- Perl 6 Tidings from May 2009
- Perl 6 Tidings from May 2009 (second iteration)
- Perl 6 Tidings from June 2009
- Perl 6 Tidings from August 2009
- Perl 6 Tidings from October 2009
- Timeline for a syntax change in Perl 6
- Visualizing match trees
- Want to write shiny SVG graphics with Perl 6? Port Scruffy!
- We write a Perl 6 book for you
- When we reach 100% we did something wrong
- Where Rakudo Lives Now
- Why Rakudo needs NQP
- Why was the Perl 6 Advent Calendar such a Success?
- What you can write in Perl 6 today
- Why you don't need the Y combinator in Perl 6
- You are good enough!
Sun, 29 Jan 2017
Perl 6 By Example: Parsing INI files
Permanent link
This blog post is part of my ongoing project to write a book about Perl 6.
If you're interested, either in this book project or any other Perl 6 book news, please sign up for the mailing list at the bottom of the article, or here. It will be low volume (less than an email per month, on average).
You've probably seen .ini
files before; they are quite common as
configuration files on the Microsoft Windows platform, but are also in many
other places like ODBC configuration files, Ansible's inventory
files and so on.
This is what they look like:
key1=value2
[section1]
key2=value2
key3 = with spaces
; comment lines start with a semicolon, and are
; ignored by the parser
[section2]
more=stuff
Perl 6 offers regexes for parsing, and grammars for structuring and reusing regexes.
Regex Basics
A regex is a piece of code that acts as a pattern for strings with a common structure. It's derived from the computer science concept of a regular expression, but adopted to allow more constructs than pure regular expressions allow, and with some added features that make them easier to use.
We'll use named regexes to match the primitives, and then use regexes that call these named regexes to build a parser for the INI files. Since INI files have no universally accepted, formal grammar, we have to make stuff up a we go.
Let's start with parsing value pairs, like key1=value1
. First the key.
It may contain letters, digits and the underscore _
. There's a shortcut for
match such characters, \w
, and matching more at least one works by appending
a +
character:
use v6;
my regex key { \w+ }
multi sub MAIN('test') {
use Test;
ok 'abc' ~~ /^ <key> $/, '<key> matches a simple identifier';
ok '[abc]' !~~ /^ <key> $/, '<key> does not match a section header';
done-testing;
}
my regex key { \w+ }
declares a lexically (my) scoped regex called key
that matches one or more word character.
There is a long tradition in programming languages to support so-called Perl Compatible Regular Expressions, short PCRE. Many programming languages support some deviations from PCRE, including Perl itself, but common syntax elements remain throughout most implementations. Perl 6 still supports some of these elements, but deviates substantially in others.
Here \w+
is the same as in PCRE, but the fact that white space around the
\w+
is ignored is not. In the testing routine, the slashes in 'abc' ~~ /^
<key> $/
delimit an anonymous regex. In this regex, ^
and $
stand for the
start and the end of the matched string, respectively, which is familiar from
PCRE. But then <key>
calls the named regex key
from earlier. This again
is a Perl 6 extension. In PCRE, the <
in a regex matches a literal <
. In
Perl 6 regexes, it introduces a subrule call.
In general, all non-word characters are reserved for "special" syntax, and you
have to quote or backslash them to get the literal meaning. For example \<
or '<'
in a regex match a less-then sign. Word characters (letters, digits
and the underscore) always match literally.
Parsing the INI primitives
Coming back the INI parsing, we have to think about what characters are allowed inside a value. Listing allowed characters seems to be like a futile exercise, since we are very likely to forget some. Instead, we should think about what's not allowed in a value. Newlines certainly aren't, because they introduce the next key/value pair or a section heading. Neither are semicolons, because they introduce a comment.
We can formulate this exclusion as a negated character class: <-[ \n ; ]>
matches any single character that is neither a newline nor a semicolon. Note
that inside a character class, nearly all characters lose their special
meaning. Only backslash, whitespace and the closing bracket stand for anything
other than themselves. Inside and outside of character classes alike, \n
matches a single newline character, and \s
and whitespace. The upper-case
inverts that, so that for example \S
matches any single character that is
not whitespace.
This leads us to a version version of a regex for match a value in an ini file:
my regex value { <-[ \n ; ]>+ }
There is one problem with this regex: it also matches leading and trailing whitespace, which we don't want to consider as part of the value:
my regex value { <-[ \n ; ]>+ }
if ' abc ' ~~ /<value>/ {
say "matched '$/'"; # matched ' abc '
}
If Perl 6 regexes were limited to a regular language in the Computer Science sense, we'd have to something like this:
my regex value {
# match a first non-whitespace character
<-[ \s ; ]>
[
# then arbitrarily many that can contain whitespace
<-[ \n ; ]>*
# ... terminated by one non-whitespace
<-[ \s ; ]>
]? # and make it optional, in case the value is only
# only one non-whitespace character
}
And now you know why people respond with "And now you have two problems" when proposing to solve problems with regexes. A simpler solution is to match a value as introduced first, and then introduce a constraint that neither the first nor the last character may be whitespace:
my regex value { <!before \s> <-[ \s ; ]>+ <!after \s> }
along with accompanying tests:
is ' abc ' ~~ /<value>/, 'abc', '<value> does not match leading or trailing whitespace';
is ' a' ~~ /<value>/, 'a', '<value> watches single non-whitespace too';
ok "a\nb" !~~ /^ <value> $/, '<valuee> does not match \n';
<!before regex>
is a negated look-ahead, that is, the following text must
not match the regex, and the text isn't consumed while matching.
Unsurprisingly, <!after regex>
is the negated look-behind, which tries to
match text that's already been matched, and must not succeed in doing so for
the whole match to be successful.
This being Perl 6, there is of course yet another way to approach this problem. If you formulate the requirements as "a value must not contain a newline or semicolon and start with a non-whitepsace and end with a non-whitespace", it become obvious that if we just had an AND operator in regexes, this could be easy. And it is:
my regex value { <-[ \s ; ]>+ & \S.* & .*\S }
The &
operator delimits two or more smaller regex expressions that must all
match the same string successfully for the whole match to succeed. \S.*
matches any string that starts with a non-whitesspace character (\S
), followed
by any character (.
) any number of times *
. Likewise .*\S
matches any
string that ends with non-whitespace character.
Who would have thought that matching something as seemingly simple as a value in a configuration file could be so involved? Luckily, matching a pair of key and value is much simpler, now that we know how to match each on their own:
my regex pair { <key> '=' <value> }
And this works great, as long as there are no blanks surrounding the equality sign. If there is, we have to match it separately:
my regex pair { <key> \h* '=' \h* <value> }
\h
matches a horizontal whitespace, that is, a blank, a tabulator character,
or any other fancy space-like thing that Unicode has in store for us (for
example also the non-breaking space), but not a newline.
Speaking of newlines, it's a good idea to match a newline at the end of regex
pair
, and since we ignore empty lines, let's match more than one too:
my regex pair { <key> \h* '=' \h* <value> \n+ }
Time to write some tests as well:
ok "key=vaule\n" ~~ /<pair>/, 'simple pair';
ok "key = value\n\n" ~~ /<pair>/, 'pair with blanks';
ok "key\n= value\n" !~~ /<pair>/, 'pair with newline before assignment';
A section header is a string in brackets, so the string itself shouldn't contain brackets or a newline:
my regex header { '[' <-[ \[ \] \n ]>+ ']' \n+ }
# and in multi sub MAIN('test'):
ok "[abc]\n" ~~ /^ <header> $/, 'simple header';
ok "[a c]\n" ~~ /^ <header> $/, 'header with spaces';
ok "[a [b]]\n" !~~ /^ <header> $/, 'cannot nest headers';
ok "[a\nb]\n" !~~ /^ <header> $/, 'No newlines inside headers';
The last remaining primitive is the comment:
my regex comment { ';' \N*\n+ }
\N
matches any character that's not a newline, so the comment is just a
semicolon, and then anything until the end of the line.
Putting Things Together
A section of an INI file is a header followed by some key-value pairs or comment lines:
my regex section {
<header>
[ <pair> | <comment> ]*
}
[...]
groups a part of a regex, so that the quantifier *
after it
applies to the whole group, not just to the last term.
The whole INI file consists of potentially some initial key/value pairs or comments followed by some sections:
my regex inifile {
[ <pair> | <comment> ]*
<section>*
}
The avid reader has noticed that the [ <pair> | <comment> ]*
part of a regex
has been used twice, so it's a good idea to extract it into a standalone
regex:
my regex block { [ <pair> | <comment> ]* }
my regex section { <header> <block> }
my regex inifile { <block> <section>* }
It's time for the "ultimate" test:
my $ini = q:to/EOI/;
key1=value2
[section1]
key2=value2
key3 = with spaces
; comment lines start with a semicolon, and are
; ignored by the parser
[section2]
more=stuff
EOI
ok $ini ~~ /^<inifile>$/, 'Can parse a full ini file';
Backtracking
Regex matching seems magical to many programmers. You just state the pattern, and the regex engine determines for you whether a string matches the pattern or not. While implementing a regex engine is a tricky business, the basics aren't too hard to understand.
The regex engine goes through the parts of a regex from left to right, trying to match each part of the regex. It keeps track of what part of the string it matched so far in a cursor. If a part of a regex can't find a match, the regex engine tries to alter the previous match to take up fewer characters, and the retry the failed match at the new position.
For example if you execute the regex match
'abc' ~~ /.* b/
the regex engine first evaluates the .*
. The .
matches any character.
The *
quantifier is
greedy, which means it tries to match as many characters as it can. It ends
up matching the whole string, abc
. Then the regex engine tries to match the
b
, which is a literal. Since the previous match gobbled up the whole string,
matching c
against the remaining empty string fails. So the previous regex
part, .*
, must give up a character. It now matches ab
, and the literal
matcher for the b
compares b
and c
, and fails again. So there is a final
iteration where the .*
once again gives up one character it matched, and now
the b
literal can match the second character in the string.
This back and forth between the parts of a regex is called backtracking.
It's great feature when you search for a pattern in a string. But in a parser,
it is usually not desirable. If for example the regex key
matched the
substring "key2" in the input
"key2=value2`, you don't want it to match a
shorter substring just because the next part of the regex can't match.
There are three major reasons why you don't want that. The first is that it makes debugging harder. When humans think about how a text is structured, they usually commit pretty quickly to basic tokenization, such as where a word or a sentence ends. Thus backtracking can be very uninuitive. If you generate error messages based on which regexes failed to match, backtracking basically always leads to the error message being pretty useless.
The second reason is that backtracking can lead to unexpected regex matches. For example you want to match two words, optionally separated by whitespaces, and you try to translate this directly to a regex:
say "two words" ~~ /\w+\s*\w+/; # 「two words」
This seems to work: the first \w+
matches the first word, the seccond oen
matches the second word, all fine and good. Until you find that it actually
matches a single word too:
say "two" ~~ /\w+\s*\w+/; # 「two」
How did that happen? Well, the first \w+
matched the whole word, \s*
successfully matched the empty string, and then the second \w+
failed,
forcing the previous to parts of the regex to match differently. So in the
second iteration, the first \w+only matches
tw, and the second
\w+
matches
o`. And then you realize that if two words aren't delimtied by
whitespace, how do you even tell where one word ends and the next one starts?
With backtracking disabled, the regex fails to match instead of matching in an
unintended way.
The third reason is performance. When you disable backtracking, the regex engine has to look at each character only once, or once for each branch it can take in the case of alternatives. With backtracking, the regex engine can be stuck in backtracking loops that take over-proportionally longer with increasing length of the input string.
To disable backtracking, you simply have to replace the word regex
by
token
in the declaration, or by using the :ratchet
modifier inside the
regex.
In the INI file parser, only regex value
needs backtracking (though other
formualtions discussed above don't need it), all the other regexes can be
switched over to tokens safely:
my token key { \w+ }
my regex value { <!before \s> <-[\n;]>+ <!after \s> }
my token pair { <key> \h* '=' \h* <value> \n+ }
my token header { '[' <-[ \[ \] \n ]>+ ']' \n+ }
my token comment { ';' \N*\n+ }
my token block { [ <pair> | <comment> ]* }
my token section { <header> <block> }
my token inifile { <block> <section>* }
Summary
Perl 6 allows regex reuse by treating them as first-class citizens, allowing them to be named and called like normal routines. Further clutter is removed by allowing whitespace inside regexes.
These features allows you to write regexes to parse proper file formats and even programming languages. So far we have only seen a binary decision about whether a string matches a regex or not. In the future, we'll explore ways to improve code reuse even more, extract structured data from the match, and give better error messages when the parse fails.