Categories
Posts in this category
- Current State of Exceptions in Rakudo and Perl 6
- Meet DBIish, a Perl 6 Database Interface
- doc.perl6.org and p6doc
- Exceptions Grant Report for May 2012
- Exceptions Grant Report -- Final update
- Perl 6 Hackathon in Oslo: Be Prepared!
- Localization for Exception Messages
- News in the Rakudo 2012.05 release
- News in the Rakudo 2012.06 release
- Perl 6 Hackathon in Oslo: Report From The First Day
- Perl 6 Hackathon in Oslo: Report From The Second Day
- Quo Vadis Perl?
- Rakudo Hack: Dynamic Export Lists
- SQLite support for DBIish
- Stop The Rewrites!
- Upcoming Perl 6 Hackathon in Oslo, Norway
- A small regex optimization for NQP and Rakudo
- Pattern Matching and Unpacking
- Rakudo's Abstract Syntax Tree
- The REPL trick
- First day at YAPC::Europe 2013 in Kiev
- YAPC Europe 2013 Day 2
- YAPC Europe 2013 Day 3
- A new Perl 6 community server - call for funding
- New Perl 6 community server now live, accepting signups
- A new Perl 6 community server - update
- All Perl 6 modules in a box
- doc.perl6.org: some stats, future directions
- Profiling Perl 6 code on IRC
- Why is it hard to write a compiler for Perl 6?
- Writing docs helps you take the user's perspective
- Perl 6 Advent Calendar 2016 -- Call for Authors
- Perl 6 By Example: Running Rakudo
- Perl 6 By Example: Formatting a Sudoku Puzzle
- Perl 6 By Example: Testing the Say Function
- Perl 6 By Example: Testing the Timestamp Converter
- Perl 6 By Example: Datetime Conversion for the Command Line
- What is Perl 6?
- Perl 6 By Example, Another Perl 6 Book
- Perl 6 By Example: Silent Cron, a Cron Wrapper
- Perl 6 By Example: Testing Silent Cron
- Perl 6 By Example: Stateful Silent Cron
- Perl 6 By Example: Perl 6 Review
- Perl 6 By Example: Parsing INI files
- Perl 6 By Example: Improved INI Parsing with Grammars
- Perl 6 By Example: Generating Good Parse Errors from a Parser
- Perl 6 By Example: A File and Directory Usage Graph
- Perl 6 By Example: Functional Refactorings for Directory Visualization Code
- Perl 6 By Example: A Unicode Search Tool
- What's a Variable, Exactly?
- Perl 6 By Example: Plotting using Matplotlib and Inline::Python
- Perl 6 By Example: Stacked Plots with Matplotlib
- Perl 6 By Example: Idiomatic Use of Inline::Python
- Perl 6 By Example: Now "Perl 6 Fundamentals"
- Perl 6 Books Landscape in June 2017
- Living on the (b)leading edge
- The Loss of Name and Orientation
- Perl 6 Fundamentals Now Available for Purchase
- My Ten Years of Perl 6
- Perl 6 Coding Contest 2019: Seeking Task Makers
- A shiny perl6.org site
- Creating an entry point for newcomers
- An offer for software developers: free IRC logging
- Sprixel, a 6 compiler powered by JavaScript
- Announcing try.rakudo.org, an interactive Perl 6 shell in your browser
- Another perl6.org iteration
- Blackjack and Perl 6
- Why I commit Crud to the Perl 6 Test Suite
- This Week's Contribution to Perl 6 Week 5: Implement Str.trans
- This Week's Contribution to Perl 6
- This Week's Contribution to Perl 6 Week 8: Implement $*ARGFILES for Rakudo
- This Week's Contribution to Perl 6 Week 6: Improve Book markup
- This Week's Contribution to Perl 6 Week 2: Fix up a test
- This Week's Contribution to Perl 6 Week 9: Implement Hash.pick for Rakudo
- This Week's Contribution to Perl 6 Week 11: Improve an error message for Hyper Operators
- This Week's Contribution to Perl 6 - Lottery Intermission
- This Week's Contribution to Perl 6 Week 3: Write supporting code for the MAIN sub
- This Week's Contribution to Perl 6 Week 1: A website for proto
- This Week's Contribution to Perl 6 Week 4: Implement :samecase for .subst
- This Week's Contribution to Perl 6 Week 10: Implement samespace for Rakudo
- This Week's Contribution to Perl 6 Week 7: Implement try.rakudo.org
- What is the "Cool" class in Perl 6?
- Report from the Perl 6 Hackathon in Copenhagen
- Custom operators in Rakudo
- A Perl 6 Date Module
- Defined Behaviour with Undefined Values
- Dissecting the "Starry obfu"
- The case for distributed version control systems
- Perl 6: Failing Softly with Unthrown Exceptions
- Perl 6 Compiler Feature Matrix
- The first Perl 6 module on CPAN
- A Foray into Perl 5 land
- Gabor: Keep going
- First Grant Report: Structured Error Messages
- Second Grant Report: Structured Error Messages
- Third Grant Report: Structured Error Messages
- Fourth Grant Report: Structured Error Messages
- Google Summer of Code Mentor Recap
- How core is core?
- How fast is Rakudo's "nom" branch?
- Building a Huffman Tree With Rakudo
- Immutable Sigils and Context
- Is Perl 6 really Perl?
- Mini-Challenge: Write Your Prisoner's Dilemma Strategy
- List.classify
- Longest Palindrome by Regex
- Perl 6: Lost in Wonderland
- Lots of momentum in the Perl 6 community
- Monetize Perl 6?
- Musings on Rakudo's spectest chart
- My first executable from Perl 6
- My first YAPC - YAPC::EU 2010 in Pisa
- Trying to implement new operators - failed
- Programming Languages Are Not Zero Sum
- Perl 6 notes from February 2011
- Notes from the YAPC::EU 2010 Rakudo hackathon
- Let's build an object
- Perl 6 is optimized for fun
- How to get a parse tree for a Perl 6 Program
- Pascal's Triangle in Perl 6
- Perl 6 in 2009
- Perl 6 in 2010
- Perl 6 in 2011 - A Retrospection
- Perl 6 ticket life cycle
- The Perl Survey and Perl 6
- The Perl 6 Advent Calendar
- Perl 6 Questions on Perlmonks
- Physical modeling with Math::Model and Perl 6
- How to Plot a Segment of a Circle with SVG
- Results from the Prisoner's Dilemma Challenge
- Protected Attributes Make No Sense
- Publicity for Perl 6
- PVC - Perl 6 Vocabulary Coach
- Fixing Rakudo Memory Leaks
- Rakudo architectural overview
- Rakudo Rocks
- Rakudo "star" announced
- My personal "I want a PONIE" wish list for Rakudo Star
- Rakudo's rough edges
- Rats and other pets
- The Real World Strikes Back - or why you shouldn't forbid stuff just because you think it's wrong
- Releasing Rakudo made easy
- Set Phasers to Stun!
- Starry Perl 6 obfu
- Recent Perl 6 Developments August 2008
- The State of Regex Modifiers in Rakudo
- Strings and Buffers
- Subroutines vs. Methods - Differences and Commonalities
- A SVG plotting adventure
- A Syntax Highlighter for Perl 6
- Test Suite Reorganization: How to move tests
- The Happiness of Design Convergence
- Thoughts on masak's Perl 6 Coding Contest
- The Three-Fold Function of the Smart Match Operator
- Perl 6 Tidings from September and October 2008
- Perl 6 Tidings for November 2008
- Perl 6 Tidings from December 2008
- Perl 6 Tidings from January 2009
- Perl 6 Tidings from February 2009
- Perl 6 Tidings from March 2009
- Perl 6 Tidings from April 2009
- Perl 6 Tidings from May 2009
- Perl 6 Tidings from May 2009 (second iteration)
- Perl 6 Tidings from June 2009
- Perl 6 Tidings from August 2009
- Perl 6 Tidings from October 2009
- Timeline for a syntax change in Perl 6
- Visualizing match trees
- Want to write shiny SVG graphics with Perl 6? Port Scruffy!
- We write a Perl 6 book for you
- When we reach 100% we did something wrong
- Where Rakudo Lives Now
- Why Rakudo needs NQP
- Why was the Perl 6 Advent Calendar such a Success?
- What you can write in Perl 6 today
- Why you don't need the Y combinator in Perl 6
- You are good enough!
Wed, 09 Aug 2017
My Ten Years of Perl 6
Permanent link
Time for some old man's reminiscence. Or so it feels when I realize that I've spent more than 10 years involved with the Perl 6 community.
How I Joined the Perl 6 Community
It was February 2007.
I was bored. I had lots of free time (crazy to imagine that now...), and I spent some of that answering (Perl 5) questions on perlmonks. There was a category of questions where I routinely had no good answers, and those were related to threads. So I decided to play with threads, and got frustrated pretty quickly.
And then I remember that a friend in school had told me (about four years earlier) that there was this Perl 6 project that wanted to do concurrency really well, and even automatically parallelize some stuff. And this was some time ago, maybe they had gotten anywhere?
So I searched the Internet, and found out about Pugs, a Perl 6 compiler written in Haskell. And I wanted to learn more, but some of the links to the presentations were dead. I joined the #perl6 IRC channel to report the broken link.
And within three minutes I got a "thank you" for the report, the broken links were gone, and I had an invitation for a commit bit to the underlying SVN repo.
I stayed.
The Early Days
Those were they wild young days of Perl 6 and Pugs. Audrey Tang was pushing Pugs (and Haskell) very hard, and often implemented a feature within 20 minutes after somebody mentioned it. Things were unstable, broken often, and usually fixed quickly. No idea was too crazy to be considered or even implemented.
We had bots that evaluated Perl 6 and Haskell code, and gave the result directly on IRC. There were lots of cool (and sometimes somewhat frightening) automations, for example for inviting others to the SVN repo, to the shared hosting system (called feather), for searching SVN logs and so on. Since git was still an obscure and very unusable, people tried to use SVK, an attempt to implement a decentralized version control system on top of of the SVN protocol.
Despite some half-hearted attempts, I didn't really make inroads into compiler developments. Having worked with neither Haskell nor compilers before proved to be a pretty steep step. Instead I focused on some early modules, documentation, tests, and asking and answering questions. When the IRC logger went offline for a while, I wrote my own, which is still in use today.
I felt at home in that IRC channel and the community. When the community asked for mentors for the Google Summer of Code project, I stepped up. The project was a revamp of the Perl 6 test suite, and to prepare for mentoring task, I decided to dive deeper. That made me the maintainer of the test suite.
Pet Projects
I can't recount a full history of Perl 6 projects during that time range, but I want to reflect on some projects that I considered my pet projects, at least for some time.
- In early 2008 I started to contribute some patches to perl6.org. Over time I became frustrated with the outdated content on many Perl 6 websites, so I created perl6-projects.org, which only contained links to up-to-date resources. Over time, people with an actual sense of design contribute a nice layout. Later we switched perl6.org over to the content from perl6-projects.org. Its structure today is still mostly that of a link collection.
- In September 2008 I started with the Perl 5 to 6 blog series, which was the de facto resource for learning Perl 6 for some years.
- In 2009 I wrote JSON::Tiny, the first pure-Perl 6 JSON parser. I mostly did it to become familiar with grammars, and because the book "Real World Haskell" also developed a JSON module.
- May 2010 kicked off a program to make contributing to Perl 6 more accessible.
- September 2010: the Pugs SVN repository became unmaintainable. It had accumulated the official Perl 6 test suite, some documentation, modules, at least one IRC bot and a proverbial metric ton of other stuff. In coordination with #perl6, I split it up into a collection of git repositories and put them on GitHub. A decision we haven't regretted yet.
- In 2012 and 2016 I was part of the organization team of the German Perl Workshop in Erlangen and Nürnberg, which both featured several Perl 6 talks.
- 2012 saw the birth of the documentation project. Initially I was nearly the sole contributor, and I'm very happy that, over time, others have taken over.
- Between 2014 and 2015, our community server "feather" EOLed, so the Perl 6 community funded a new server, which my employer, noris network AG kindly hosts for free in their data center in Nürnberg, Germany.
- In 2016 I started writing Perl 6 by Example, which Apress later adopted and published as Perl 6 Fundamentals.
- Currently I'm working on a treatise on Perl 6 Regexes and Grammars, which Apress will also publish.
It is not quite clear from this (very selected) timeline, but my Perl 6 related activity dropped around 2009 or 2010. This is when I started to work full time, moved in with my girlfriend (now wife), and started to plan a family.
Relationships
The technologies and ideas in Perl 6 are fascinating, but that's not what kept me. I came for the technology, but stayed for the community.
There were and are many great people in the Perl 6 community, some of whom I am happy to call my friends. Whenever I get the chance to attend a Perl conference, workshop or hackathon, I find a group of Perl 6 hackers to hang out and discuss with, and generally have a good time.
Four events stand out in my memory. In 2010 I was invited to the Open Source Days in Copenhagen. I missed most of the conference, but spent a day or two with (if memory serve right) Carl Mäsak, Patrick Michaud, Jonathan Worthington and Arne Skjærholt. We spent some fun time trying to wrap our minds around macros, the intricacies of human and computer language, and Japanese food. (Ok, the last one was easy). Later the same year, I attended my first YAPC::EU in Pisa, and met most of the same crowd again -- this time joined by Larry Wall, and over three or four days. I still fondly remember the Perl 6 hallway track from that conference. And 2012 I flew to Oslo for a Perl 6 hackathon, with a close-knit, fabulous group of Perl 6 hackers. Finally, the Perl Reunification Summit in the beautiful town of Perl in Germany, which brought together Perl 5 and Perl 6 hackers in a very relaxed atmosphere.
For three of these four events, different private sponsors from the Perl and Perl 6 community covered travel and/or hotel costs, with their only motivation being meeting folks they liked, and seeing the community and technology flourish.
The Now
The Perl 6 community has evolved a lot over the last ten years, but it is still a very friendly and welcoming place. There are lots of "new" folks (where "new" is everybody who joined after me, of course :D), and a surprising number of the old guard still hang around, some more involved, some less, all of them still very friendly and supportive
The Future
I anticipate that my family and other projects will continue to occupy much of my time, and it is unlikely that I'll be writing another Perl 6 book (after the one about regexes) any time soon. But the Perl 6 community has become a second home for me, and I don't want to miss it.
In the future, I see myself supporting the Perl 6 community through infrastructure (community servers, IRC logs, running IRC bots etc.), answering questions, writing a blog article here and there, but mostly empowering the "new" guard to do whatever they deem best.
Sat, 22 Jul 2017
Perl 6 Fundamentals Now Available for Purchase
Permanent link
After about nine months of work, my book Perl 6 Fundamentals is now available for purchase on apress.com and springer.com.
The ebook can be purchased right now, and comes in the epub and PDF formats (with watermarks, but DRM free). The print form can be pre-ordered from Amazon, and will become ready for shipping in about a week or two.
I will make a copy of the ebook available for free for everybody who purchased an earlier version, "Perl 6 by Example", from LeanPub.
The book is aimed at people familiar with the basics of programming; prior
Perl 5 or Perl 6 knowledge is not required. It features a practical example in most chapters (no mammal hierarchies or class Rectangle
inheriting from class Shape
), ranging from simple input/output and text formatting to plotting with python's matplotlib libraries. Other examples include date and time conversion, a Unicode search tool and a directory size visualization.
I use these examples to explain subset of Perl 6, with many pointers to more
documentation where relevant. Perl 6 topics include the basic lexicographic
structure, testing, input and output, multi dispatch, object orientation, regexes and grammars, usage of modules, functional programming and interaction
with python libraries through Inline::Python
.
Let me finish with Larry Wall's description of this book, quoted from his foreword:
It's not just a reference, since you can always find such materials online. Nor is it just a cookbook. I like to think of it as an extended invitation, from a well-liked and well-informed member of our circle, to people like you who might want to join in on the fun. Because joy is what's fundamental to Perl. The essence of Perl is an invitation to love, and to be loved by, the Perl community. It's an invitation to be a participant of the gift economy, on both the receiving and the giving end.
Tue, 11 Jul 2017
The Loss of Name and Orientation
Permanent link
The Perl 6 naming debate has started again. And I guess with good reason. Teaching people that Perl 6 is a Perl, but not the Perl requires too much effort. Two years ago, I didn't believe. Now you're reading a tired man's words.
I'm glad that this time, we're not discussing giving up the "Perl" brand, which still has very positive connotations in my mind, and in many other minds as well.
And yet, I can't bring myself to like "Rakudo Perl 6" as a name. There are two vary shallow reasons for that: Going from two syllables, "Perl six", to five of them, seems a step in the wrong direction. And two, I remember the days when the name was pretty young, and people would misspell it all the time. That seems to have abated, though I don't know why.
But there's also a deeper reason, probably sentimental old man's reason. I remember the days when Pugs was actively developed, and formed the center of a vibrant community. When kp6 and SMOP and all those weird projects were around. And then, just when it looked like there was only a single compiler was around, Stefan O'Rear conjured up niecza, almost single-handedly, and out of thin air. Within months, it was a viable Perl 6 compiler, that people on #perl6 readily recommended.
All of this was born out of the vision that Perl 6 was a language with no single, preferred compiler. Changing the language name to include the compiler name means abandoning this vision. How can we claim to welcome alternative implementations when the commitment to one compiler is right in the language name?
However I can't weigh this loss of vision against a potential gain in popularity. I can't decide if it's my long-term commitment to the name "Perl 6" that makes me resent the new name, or valid objections. The lack of vision mirrors my own state of mind pretty well.
I don't know where this leaves us. I guess I must apologize for wasting your time by publishing this incoherent mess.
Sun, 25 Jun 2017
Living on the (b)leading edge
Permanent link
Perl 6 is innovative in many ways, and sometimes we don't fully appreciate all the implications, for good or for bad.
There's one I stumbled upon recently: The use of fancy Unicode symbols for built-in stuff. In this case: the `.gist` output of Match objects. For example
my token word { \w+ } say 'abc=def' ~~ /<word> '=' <word>/;produces this output:
「abc=def」 word => 「abc」 word => 「def」
And that's where the problems start. In my current quest to write a book on Perl 6 regexes, I noticed that the PDF that LeanPub generates from my Markdown sources don't correctly display those pesky 「」 characters, which are
$ uni -c 「」 「 - U+0FF62 - HALFWIDTH LEFT CORNER BRACKET 」 - U+0FF63 - HALFWIDTH RIGHT CORNER BRACKET
When I copied the text from the PDF and pasted into my editor, they showed up correctly, which indicates that the characters are likely missing from the monospace font.
The toolchain allows control over the font used for displaying code, so I tried all the monospace fonts that were available. I tried them in alphabetical order. Among the earlier fonts I tried was Deja Vu Sans Mono, which I use in my terminal, and which hasn't let me down yet. No dice. I arrived at Noto, a font designed to cover all Unicode codepoints. And it didn't work either. So it turns out these two characters are part of some Noto Sans variants, but not of the monospace font.
My terminal, and even some font viewers, use some kind of fallback where they use glyphs from other fonts to render missing characters. The book generation toolchain does not.
The Google Group for Leanpub was somewhat helpful: if I could recommend an Open Source mono space font that fit my needs, they'd likely include it in their toolchain.
So I searched and searched, learning more about fonts than I wanted to know. My circle of geek friends came up with several suggestions, one of them being Iosevka, which actually contains those characters. So now I wait for others to step up, either for LeanPub to include that font, or for the Noto maintainers to create a monospace variant of those characters (and then LeanPub updating their version of the font).
And all of that because Perl 6 was being innovative, and used two otherwise little-used characters as delimiters, in an attempt to avoid collisions between delimiters and content.
(In the mean time I've replaced the two offending characters with ones that look similar. It means the example output is technically incorrect, but at least it's readable).
Thu, 08 Jun 2017
Perl 6 Books Landscape in June 2017
Permanent link
There are lots of news around Perl 6 books to share these days. If you follow the community very closely, you might be aware of most of it. If not, read on :-).
Think Perl 6 is now available for purchase, and also for download as a free ebook. Heck, it's even Open Source, with the LaTeX sources on GitHub!
Perl 6 at a Glance, previously only available in print form, is now available as an ebook. Save paper and shipping costs!
My own book, Perl 6 Fundamentals, is now in the "production" phase: copyediting, indexing, layout. And just before the manuscript submission deadline, Larry Wall has contributed a foreword. How awesome is that?
I've revamped perl6book.com to provide a short overview of the current and future Perl 6 books. As a small gimmick, it contains a flow chart explaining which book to chose. And I even got input from two other Perl 6 book authors (Laurent Rosenfeld of "Think Perl 6", Andrew Shitov of "Perl 6 at a Glance", "Migrating to Perl 6".
From a pull request to perl6book.com, it looks like Andrew Shitov is working on two more Perl 6 books. Keep 'em coming!
Last but not least, Gabor Szabo has started a crowd funding campaign for a Perl 6 book on web app development. There are still a few day left, so you can help it succeed!
And as always, if you want to keep informed about Perl 6 books, you can sign up at perl6book.com for my Perl 6 books mailing list (low volume, typically less than one email per month).
Mon, 01 May 2017
Perl 6 By Example: Now "Perl 6 Fundamentals"
Permanent link
This blog post is part of my ongoing project to write a book about Perl 6.
If you're interested, either in this book project or any other Perl 6 book news, please sign up for the mailing list at the bottom of the article, or here. It will be low volume (less than an email per month, on average).
After some silence during the past few weeks, I can finally share some exciting news about the book project. Apress has agreed to publish the book, both as print and electronic book.
The title is Perl 6 Fundamentals, with A Primer with Examples, Projects, and Case Studies as subtitle. The editorial process is happening right now. I've received some great feedback on my initial manuscript, so there's a lot to do for me.
Stay tuned for more updates!
Sun, 02 Apr 2017
Perl 6 By Example: Idiomatic Use of Inline::Python
Permanent link
This blog post is part of my ongoing project to write a book about Perl 6.
If you're interested, either in this book project or any other Perl 6 book news, please sign up for the mailing list at the bottom of the article, or here. It will be low volume (less than an email per month, on average).
In the two previous installments, we've seen Python libraries being used in Perl 6 code through the Inline::Python module. Here we will explore some options to make the Perl 6 code more idiomatic and closer to the documentation of the Python modules.
Types of Python APIs
Python is an object-oriented language, so many APIs involve method
calls, which Inline::Python
helpfully automatically translates for us.
But the objects must come from somewhere and typically this is by calling a function that returns an object, or by instantiating a class. In Python, those two are really the same under the hood, since instantiating a class is the same as calling the class as if it were a function.
An example of this (in Python) would be
from matplotlib.pyplot import subplots
result = subplots()
But the matplotlib documentation tends to use another, equivalent syntax:
import matplotlib.pyplot as plt
result = plt.subplots()
This uses the subplots
symbol (class or function) as a method on the
module matplotlib.pyplot
, which the import statement aliases to plt
.
This is a more object-oriented syntax for the same API.
Mapping the Function API
The previous code examples used this Perl 6 code to call the subplots
symbol:
my $py = Inline::Python.new;
$py.run('import matplotlib.pyplot');
sub plot(Str $name, |c) {
$py.call('matplotlib.pyplot', $name, |c);
}
my ($figure, $subplots) = plot('subplots');
If we want to call subplots()
instead of plot('subplots')
, and
bar(args)
instead of `plot('bar', args), we can use a function to
generate wrapper functions:
my $py = Inline::Python.new;
sub gen(Str $namespace, *@names) {
$py.run("import $namespace");
return @names.map: -> $name {
sub (|args) {
$py.call($namespace, $name, |args);
}
}
}
my (&subplots, &bar, &legend, &title, &show)
= gen('matplotlib.pyplot', <subplots bar legend title show>);
my ($figure, $subplots) = subplots();
# more code here
legend($@plots, $@top-authors);
title('Contributions per day');
show();
This makes the functions' usage quite nice, but comes at the cost of duplicating their names. One can view this as a feature, because it allows the creation of different aliases, or as a source for bugs when the order is messed up, or a name misspelled.
How could we avoid the duplication should we choose to create wrapper functions?
This is where Perl 6's flexibility and introspection abilities pay off. There are two key components that allow a nicer solution: the fact that declarations are expressions and that you can introspect variables for their names.
The first part means you can write mysub my ($a, $b)
, which
declares the variables $a
and $b
, and calls a function with those
variables as arguments. The second part means that $a.VAR.name
returns
a string '$a'
, the name of the variable.
Let's combine this to create a wrapper that initializes subroutines for us:
sub pysub(Str $namespace, |args) {
$py.run("import $namespace");
for args[0] <-> $sub {
my $name = $sub.VAR.name.substr(1);
$sub = sub (|args) {
$py.call($namespace, $name, |args);
}
}
}
pysub 'matplotlib.pyplot',
my (&subplots, &bar, &legend, &title, &show);
This avoids duplicating the name, but forces us to use some lower-level
Perl 6 features in sub pysub
. Using ordinary variables means that accessing their
.VAR.name
results in the name of the variable, not the name of the
variable that's used on the caller side. So we can't use slurpy
arguments as in
sub pysub(Str $namespace, *@subs)
Instead we must use
|args
to obtain the rest of the arguments in a
Capture. This doesn't
flatten the list of variables passed to the function, so when we iterate
over them, we must do so by accessing args[0]
. By default, loop
variables are read-only, which we can avoid by using <->
instead of
->
to introduce the signature. Fortunately, that also preserves the
name of the caller side variable.
An Object-Oriented Interface
Instead of exposing the functions, we can also create types that emulate
the method calls on Python modules. For that we can implement a class with a
method FALLBACK
, which Perl 6 calls for us when calling a method that is not
implemented in the class:
class PyPlot is Mu {
has $.py;
submethod TWEAK {
$!py.run('import matplotlib.pyplot');
}
method FALLBACK($name, |args) {
$!py.call('matplotlib.pyplot', $name, |args);
}
}
my $pyplot = PyPlot.new(:$py);
my ($figure, $subplots) = $pyplot.subplots;
# plotting code goes here
$pyplot.legend($@plots, $@top-authors);
$pyplot.title('Contributions per day');
$pyplot.show;
Class PyPlot
inherits directly from Mu
, the root of the Perl 6 type
hierarchy, instead of Any
, the default parent class (which in turn inherits
from Mu
). Any
introduces a large number of methods that Perl 6 objects
get by default and since FALLBACK
is only invoked when a method is not
present, this is something to avoid.
The method TWEAK
is another method that
Perl 6 calls automatically for us, after the object has been fully
instantiated. All-caps method names are reserved for such special purposes.
It is marked as a submethod
, which means it is not inherited into subclasses.
Since TWEAK
is called at the level of each class, if it were a regular
method, a subclass would call it twice implicitly. Note that TWEAK
is only
supported in Rakudo version 2016.11 and later.
There's nothing specific to the Python package matplotlib.pyplot
in class
PyPlot
, except the namespace name. We could easily generalize it to any
namespace:
class PythonModule is Mu {
has $.py;
has $.namespace;
submethod TWEAK {
$!py.run("import $!namespace");
}
method FALLBACK($name, |args) {
$!py.call($!namespace, $name, |args);
}
}
my $pyplot = PythonModule.new(:$py, :namespace<matplotlib.pyplot>);
This is one Perl 6 type that can represent any Python module. If instead we want a separate Perl 6 type for each Python module, we could use roles, which are optionally parameterized:
role PythonModule[Str $namespace] is Mu {
has $.py;
submethod TWEAK {
$!py.run("import $namespace");
}
method FALLBACK($name, |args) {
$!py.call($namespace, $name, |args);
}
}
my $pyplot = PythonModule['matplotlib.pyplot'].new(:$py);
Using this approach, we can create type constraints for Python modules in Perl 6 space:
sub plot-histogram(PythonModule['matplotlib.pyplot'], @data) {
# implementation here
}
Passing in any other wrapped Python module than matplotlib.pyplot
results
in a type error.
Summary
Perl 6 offers enough flexibility to create function and method call APIs around Python modules. With a bit of meta programming, we can emulate the typical Python APIs close enough that translating from the Python documentation to Perl 6 code becomes easy.
Sun, 26 Mar 2017
Perl 6 By Example: Stacked Plots with Matplotlib
Permanent link
This blog post is part of my ongoing project to write a book about Perl 6.
If you're interested, either in this book project or any other Perl 6 book news, please sign up for the mailing list at the bottom of the article, or here. It will be low volume (less than an email per month, on average).
In a previous episode, we've explored plotting git statistics in Perl 6 using matplotlib.
Since I wasn't quite happy with the result, I want to explore using stacked plots for presenting the same information. In a regular plot, the y coordiante of each plotted value is proportional to its value. In a stacked plot, it is the distance to the previous value that is proportional to its value. This is nice for values that add up to a total that is also interesting.
Matplotlib offers a method called
stackplot
for that. Contrary to multiple plot
calls on subplot object, it
requires a shared x axis for all data series. So we must construct
one array for each author of git commits, where dates with no value come
out as zero.
As a reminder, this is what the logic for extracting the stats looked like in the first place:
my $proc = run :out, <git log --date=short --pretty=format:%ad!%an>;
my (%total, %by-author, %dates);
for $proc.out.lines -> $line {
my ( $date, $author ) = $line.split: '!', 2;
%total{$author}++;
%by-author{$author}{$date}++;
%dates{$date}++;
}
my @top-authors = %total.sort(-*.value).head(5)>>.key;
And some infrastructure for plotting with matplotlib:
my $py = Inline::Python.new;
$py.run('import datetime');
$py.run('import matplotlib.pyplot');
sub plot(Str $name, |c) {
$py.call('matplotlib.pyplot', $name, |c);
}
sub pydate(Str $d) {
$py.call('datetime', 'date', $d.split('-').map(*.Int));
}
my ($figure, $subplots) = plot('subplots');
$figure.autofmt_xdate();
So now we have to construct an array of arrays, where each inner array has the values for one author:
my @dates = %dates.keys.sort;
my @stack = $[] xx @top-authors;
for @dates -> $d {
for @top-authors.kv -> $idx, $author {
@stack[$idx].push: %by-author{$author}{$d} // 0;
}
}
Now plotting becomes a simple matter of a method call, followed by the usual commands adding a title and showing the plot:
$subplots.stackplot($[@dates.map(&pydate)], @stack);
plot('title', 'Contributions per day');
plot('show');
The result (again run on the zef source repository) is this:
Comparing this to the previous visualization reveals a discrepancy: There were no commits in 2014, and yet the stacked plot makes it appear this way. In fact, the previous plots would have shown the same "alternative facts" if we had chosen lines instead of points. It comes from matplotlib (like nearly all plotting libraries) interpolates linearly between data points. But in our case, a date with no data points means zero commits happened on that date.
To communicate this to matplotlib, we must explicitly insert zero values for missing dates. This can be achieved by replacing
my @dates = %dates.keys.sort;
with the line
my @dates = %dates.keys.minmax;
The minmax
method
finds the minimal and maximal values, and returns them in a
Range. Assigning the range to an
array turns it into an array of all values between the minimal and the
maximal value. The logic for assembling the @stack
variable already
maps missing values to zero.
The result looks a bit better, but still far from perfect:
Thinking more about the problem, contributions from separate days should not be joined together, because it produces misleading results. Matplotlib doesn't support adding a legend automatically to stacked plots, so this seems to be to be a dead end.
Since a dot plot didn't work very well, let's try a different kind of
plot that represents each data point separately: a bar chart, or more
specifically, a stacked bar chart. Matplotlib offers the bar
plotting
method, and a named parameter bottom
can be used to generate the
stacking:
my @dates = %dates.keys.sort;
my @stack = $[] xx @top-authors;
my @bottom = $[] xx @top-authors;
for @dates -> $d {
my $bottom = 0;
for @top-authors.kv -> $idx, $author {
@bottom[$idx].push: $bottom;
my $value = %by-author{$author}{$d} // 0;
@stack[$idx].push: $value;
$bottom += $value;
}
}
We need to supply color names ourselves, and set the edge color of the bars to the same color, otherwise the black edge color dominates the result:
my $width = 1.0;
my @colors = <red green blue yellow black>;
my @plots;
for @top-authors.kv -> $idx, $author {
@plots.push: plot(
'bar',
$[@dates.map(&pydate)],
@stack[$idx],
$width,
bottom => @bottom[$idx],
color => @colors[$idx],
edgecolor => @colors[$idx],
);
}
plot('legend', $@plots, $@top-authors);
plot('title', 'Contributions per day');
plot('show');
This produces the first plot that's actually informative and not misleading (provided you're not color blind):
If you want to improve the result further, you could experiment with
limiting the number of bars by lumping together contributions by week or
month (or maybe $n
-day period).
Next, we'll investigate ways to make the matplotlib API more idiomatic to use from Perl 6 code.
Sun, 19 Mar 2017
Perl 6 By Example: Plotting using Matplotlib and Inline::Python
Permanent link
This blog post is part of my ongoing project to write a book about Perl 6.
If you're interested, either in this book project or any other Perl 6 book news, please sign up for the mailing list at the bottom of the article, or here. It will be low volume (less than an email per month, on average).
Occasionally I come across git repositories, and want to know how active they are, and who the main developers are.
Let's develop a script that plots the commit history, and explore how to use Python modules in Perl 6.
Extracting the Stats
We want to plot the number of commits by author and date.
Git makes it easy for us to get to this information by giving some options
to git log
:
my $proc = run :out, <git log --date=short --pretty=format:%ad!%an>;
my (%total, %by-author, %dates);
for $proc.out.lines -> $line {
my ( $date, $author ) = $line.split: '!', 2;
%total{$author}++;
%by-author{$author}{$date}++;
%dates{$date}++;
}
run
executes an external command, and :out
tells it to capture the
command's output, and makes it available as $proc.out
. The command is
a list, with the first element being the actual executable, and the rest of
the elements are command line arguments to this executable.
Here git log
gets the options --date short --pretty=format:%ad!%an
, which
instructs it to print produce lines like 2017-03-01!John Doe
. This line
can be parsed with a simple call to $line.split: '!', 2
, which splits
on the !
, and limits the result to two elements. Assigning it to a
two-element list ( $date, $author )
unpacks it. We then use hashes to
count commits by author (in %total
), by author and date (%by-author
)
and finally by date. In the second case, %by-author{$author}
isn't
even a hash yet, and we can still hash-index it. This is due to a feature
called autovivification, which automatically creates ("vivifies") objects
where we need them. The use of ++
creates integers, {...}
indexing creates
hashes, [...]
indexing and .push
creates arrays, and so on.
To get from these hashes to the top contributors by commit count, we can
sort %total
by value. Since this sorts in ascending order, sorting
by the negative value gives the list in descending order. The list contains
Pair objects, and we only want the
first five of these, and only their keys:
my @top-authors = %total.sort(-*.value).head(5).map(*.key);
For each author, we can extract the dates of their activity and their commit counts like this:
my @dates = %by-author{$author}.keys.sort;
my @counts = %by-author{$author}{@dates};
The last line uses slicing, that is, indexing an array with list to return a list elements.
Plotting with Python
Matplotlib is a very versatile library for all sorts of plotting and visualization. It's written in Python and for Python programs, but that won't stop us from using it in a Perl 6 program.
But first, let's take a look at a basic plotting example that uses dates
on the x
axis:
import datetime
import matplotlib.pyplot as plt
fig, subplots = plt.subplots()
subplots.plot(
[datetime.date(2017, 1, 5), datetime.date(2017, 3, 5), datetime.date(2017, 5, 5)],
[ 42, 23, 42 ],
label='An example',
)
subplots.legend(loc='upper center', shadow=True)
fig.autofmt_xdate()
plt.show()
To make this run, you have to install python 2.7 and matplotlib. You can do
this on Debian-based Linux systems with apt-get install -y python-matplotlib
.
The package name is the same on RPM-based distributions such as CentOS or SUSE
Linux. MacOS users are advised to install a python 2.7 through homebrew and
macports, and then use pip2 install matplotlib
or pip2.7 install
matplotlib
to get the library. Windows installation is probably easiest
through the conda package manager, which offers
pre-built binaries of both python and matplotlib.
When you run this scripts with python2.7 dates.py
, it opens a GUI window, showing
the plot and some controls, which allow you to zoom, scroll, and write the
plot graphic to a file:
Bridging the Gap
The Rakudo Perl 6 compiler comes with a handy library for calling foreign functions, which allows you to call functions written in C, or anything with a compatible binary interface.
The Inline::Python library uses the native call functionality to talk to python's C API, and offers interoperability between Perl 6 and Python code. At the time of writing, this interoperability is still fragile in places, but can be worth using for some of the great libraries that Python has to offer.
To install Inline::Python
, you must have a C compiler available, and then
run
$ zef install Inline::Python
(or the same with panda
instead of zef
, if that's your module installer).
Now you can start to run Python 2 code in your Perl 6 programs:
use Inline::Python;
my $py = Inline::Python.new;
$py.run: 'print("Hello, Pyerl 6")';
Besides the run
method, which takes a string of Python code and execute it,
you can also use call
to call Python routines by specifying the namespace,
the routine to call, and a list of arguments:
use Inline::Python;
my $py = Inline::Python.new;
$py.run('import datetime');
my $date = $py.call('datetime', 'date', 2017, 1, 31);
$py.call('__builtin__', 'print', $date); # 2017-01-31
The arguments that you pass to call
are Perl 6 objects, like three Int
objects in this example. Inline::Python
automatically translates them to
the corresponding Python built-in data structure. It translate numbers,
strings, arrays and hashes. Return values are also translated in opposite
direction, though since Python 2 does not distinguish properly between
byte and Unicode strings, Python strings end up as buffers in Perl 6.
Object that Inline::Python
cannot translate are handled as opaque objects
on the Perl 6 side. You can pass them back into python routines (as shown
with the print
call above), or you can also call methods on them:
say $date.isoformat().decode; # 2017-01-31
Perl 6 exposes attributes through methods, so Perl 6 has no syntax for
accessing attributes from foreign objects directly. If you try to access
for example the year
attribute of datetime.date
through the normal method call syntax, you get an error.
say $date.year;
Dies with
'int' object is not callable
Instead, you have to use the getattr
builtin:
say $py.call('__builtin__', 'getattr', $date, 'year');
Using the Bridge to Plot
We need access to two namespaces in python, datetime
and matplotlib.pyplot
,
so let's start by importing them, and write some short helpers:
my $py = Inline::Python.new;
$py.run('import datetime');
$py.run('import matplotlib.pyplot');
sub plot(Str $name, |c) {
$py.call('matplotlib.pyplot', $name, |c);
}
sub pydate(Str $d) {
$py.call('datetime', 'date', $d.split('-').map(*.Int));
}
We can now call pydate('2017-03-01')
to create a python datetime.date
object from an ISO-formatted string, and call the plot
function to access
functionality from matplotlib:
my ($figure, $subplots) = plot('subplots');
$figure.autofmt_xdate();
my @dates = %dates.keys.sort;
$subplots.plot:
$[@dates.map(&pydate)],
$[ %dates{@dates} ],
label => 'Total',
marker => '.',
linestyle => '';
The Perl 6 call plot('subplots')
corresponds to the python code
fig, subplots = plt.subplots()
. Passing arrays to python function needs
a bit extra work, because Inline::Python
flattens arrays. Using an extra $
sigil in front of an array puts it into an extra scalar, and thus prevents
the flattening.
Now we can actually plot the number of commits by author, add a legend, and plot the result:
for @top-authors -> $author {
my @dates = %by-author{$author}.keys.sort;
my @counts = %by-author{$author}{@dates};
$subplots.plot:
$[ @dates.map(&pydate) ],
$@counts,
label => $author,
marker =>'.',
linestyle => '';
}
$subplots.legend(loc=>'upper center', shadow=>True);
plot('title', 'Contributions per day');
plot('show');
When run in the zef git repository, it produces this plot:
Summary
We've explored how to use the python library matplotlib to generate a plot
from git contribution statistics. Inline::Python
provides convenient
functionality for accessing python libraries from Perl 6 code.
In the next installment, we'll explore ways to improve both the graphics and the glue code between Python and Perl 6.
Sun, 12 Mar 2017
What's a Variable, Exactly?
Permanent link
When you learn programming, you typically first learn about basic expressions,
like 2 * 21
, and then the next topic is control structures or variables. (If
you start with functional programming, maybe it takes you a bit longer to get
to variables).
So, every programmer knows what a variable is, right?
Turns out, it might not be that easy.
Some people like to say that in ruby, everything is an object. Well, a variable isn't really an object. The same holds true for other languages.
But let's start from the bottom up. In a low-level programming language like C, a local variable is a name that the compiler knows, with a type attached. When the compiler generates code for the function that the variable is in, the name resolves to an address on the stack (unless the compiler optimizes the variable away entirely, or manages it through a CPU register).
So in C, the variable only exists as such while the compiler is running. When the compiler is finished, and the resulting executable runs, there might be some stack offset or memory location that corresponds to our understanding of the variable. (And there might be debugging symbols that allows some mapping back to the variable name, but that's really a special case).
In case of recursion, a local variable can exist once for each time the function is called.
Closures
In programming languages with closures, local variables can be referenced from inner functions. They can't generally live on the stack, because the reference keeps them alive. Consider this piece of Perl 6 code (though we could write the same in Javascript, Ruby, Perl 5, Python or most other dynamic languages):
sub outer() {
my $x = 42;
return sub inner() {
say $x;
}
}
my &callback = outer();
callback();
The outer
function has a local (lexical) variable $x
, and the inner
function uses it. So once outer
has finished running, there's still an
indirect reference to the value stored in this variable.
They say you can solve any problem in computer science through another layer
of indirection, and that's true for the implementation of closures. The
&callback
variable, which points to a closure, actually stores two pointers
under the hood. One goes to the static byte code representation of the code,
and the second goes to a run-time data structure called a lexical pad, or
short lexpad. Each time you invoke the outer function, a new instance of the
lexpad is created, and the closure points to the new instance, and the always
the same static code.
But even in dynamic languages with closures, variables themselves don't need to be objects. If a language forbids the creation of variables at run time, the compiler knows what variables exist in each scope, and can for example map each of them to an array index, so the lexpad becomes a compact array, and an access to a variable becomes an indexing operation into that array. Lexpads generally live on the heap, and are garbage collected (or reference counted) just like other objects.
Lexpads are mostly performance optimizations. You could have separate runtime representations of each variable, but then you'd have to have an allocation for each variable in each function call you perform, whereas which are generally much slower than a single allocation of the lexpad.
The Plot Thickens
To summarize, a variable has a name, a scope, and in languages that support it, a type. Those are properties known to the compiler, but not necessarily present at run time. At run time, a variable typically resolves to a stack offset in low-level languages, or to an index into a lexpad in dynamic languages.
Even in languages that boldly claim that "everything is an object", a variable often isn't. The value inside a variable may be, but the variable itself typically not.
Perl 6 Intricacies
The things I've written above generalize pretty neatly to many programming languages. I am a Perl 6 developer, so I have some insight into how Perl 6 implements variables. If you don't resist, I'll share it with you :-).
Variables in Perl 6 typically come with one more level of indirection, we which call a container. This allows two types of write operations: assignment stores a value inside a container (which again might be referenced to by a variable), and binding, which places either a value or a container directly into variable.
Here's an example of assignment and binding in action:
my $x;
my $y;
# assignment:
$x = 42;
$y = 'a string';
say $x; # => 42
say $y; # => a string
# binding:
$x := $y;
# now $x and $y point to the same container, so that assigning to one
# changes the other:
$y = 21;
say $x; # => 21
Why, I hear you cry?
There are three major reasons.
The first is that makes assignment something
that's not special. For example in python, if you assign to anything other
than a plain variable, the compiler translates it to some special method call
(obj.attr = x
to setattr(obj, 'attr', x)
, obj[idx] = x
to a
__setitem__
call etc.). In Perl 6, if you want to implement something you
can assign to, you simply return a container from that expression, and then
assignment works naturally.
For example an array is basically just a list in which the elements are
containers. This makes @array[$index] = $value
work without any special
cases, and allows you to assign to the return value of methods, functions, or
anything else you can think of, as long as the expression returns a container.
The second reason for having both binding and assignment is that it makes it pretty easy to make things read-only. If you bind a non-container into a variable, you can't assign to it anymore:
my $a := 42;
$a = "hordor"; # => Cannot assign to an immutable value
Perl 6 uses this mechanism to make function parameters read-only by default.
Likewise, returning from a function or method by default strips the container,
which avoids accidental action-at-a-distance (though an is rw
annotation can
prevent that, if you really want it).
This automatic stripping of containers also makes expressions like $a + 2
work,
independently of whether $a
holds an integer directly, or a container that
holds an integer. (In the implementation of Perl 6's core types, sometimes
this has to be done manually. If you ever wondered what nqp::decont
does in
Rakudo's source code, that's what).
The third reason relates to types.
Perl 6 supports gradual typing, which means you can optionally annotate your variables (and other things) with types, and Perl 6 enforces them for you. It detects type errors at compile time where possible, and falls back to run-time checking types.
The type of a variable only applies to binding, but it inherits this type to its default container. And the container type is enforced at run time. You can observe this difference by binding a container with a different constraint:
my Any $x;
my Int $i;
$x := $i;
$x = "foo"; # => Type check failed in assignment to $i; expected Int but got Str ("foo")
Int
is a subtype of Any
, which is why the binding of $i
to $x
succeeds. Now $x
and $i
share a container that is type-constrained to
Int
, so assigning a string to it fails.
Did you notice how the error message mentions $i
as the variable name, even though
we've tried to assign to $x
? The variable name in the error message is really
a heuristic, which works often enough, but sometimes fails. The container
that's shared between $x
and $i
has no idea which variable you used to
access it, it just knows the name of the variable that created it, here $i
.
Binding checks the variable type, not the container type, so this code doesn't complain:
my Any $x;
my Int $i;
$x := $i;
$x := "a string";
This distinction between variable type and container type might seem weird for scalar variables, but it really starts to make sense for arrays, hashes and other compound data structures that might want to enforce a type constraint on its elements:
sub f($x) {
$x[0] = 7;
}
my Str @s;
f(@s);
This code declares an array whose element all must be of type Str
(or
subtypes thereof). When you pass it to a function, that function has no
compile-time knowledge of the type. But since $x[0]
returns a container with
type constraint Str
, assigning an integer to it can produce the error you
expect from it.
Summary
Variables typically only exists as objects at compile time. At run time, they are just some memory location, either on the stack or in a lexical pad.
Perl 6 makes the understanding of the exact nature of variables a bit more involved by introducing a layer of containers between variables and values. This offers great flexibility when writing libraries that behaves like built-in classes, but comes with the burden of additional complexity.
Sun, 05 Mar 2017
Perl 6 By Example: A Unicode Search Tool
Permanent link
This blog post is part of my ongoing project to write a book about Perl 6.
If you're interested, either in this book project or any other Perl 6 book news, please sign up for the mailing list at the bottom of the article, or here. It will be low volume (less than an email per month, on average).
Every so often I have to identify or research some Unicode characters. There's a tool called uni in the Perl 5 distribution App::Uni.
Let's reimplement its basic functionality in a few lines of Perl 6 code and use that as an occasion to talk about Unicode support in Perl 6.
If you give it one character on the command line, it prints out a description of the character:
$ uni 🕐
🕐 - U+1f550 - CLOCK FACE ONE OCLOCK
If you give it a longer string instead, it searches in the list of Unicode character names and prints out the same information for each character whose description matches the search string:
$ uni third|head -n5
⅓ - U+02153 - VULGAR FRACTION ONE THIRD
⅔ - U+02154 - VULGAR FRACTION TWO THIRDS
↉ - U+02189 - VULGAR FRACTION ZERO THIRDS
㆛ - U+0319b - IDEOGRAPHIC ANNOTATION THIRD MARK
𐄺 - U+1013a - AEGEAN WEIGHT THIRD SUBUNIT
Each line corresponds to what Unicode calls a "code point", which is usually a
character on its own, but occasionally also something like a U+00300 -
COMBINING GRAVE ACCENT
, which, combined with a a - U+00061 - LATIN SMALL
LETTER A
makes the character à
.
Perl 6 offers a method uniname
in both the classes Str
and Int
that
produces the Unicode code point name for a given character, either in its
direct character form, or in the form the code point number. With that, the
first part of uni
's desired functionality:
#!/usr/bin/env perl6
use v6;
sub format-codepoint(Int $codepoint) {
sprintf "%s - U+%05x - %s\n",
$codepoint.chr,
$codepoint,
$codepoint.uniname;
}
multi sub MAIN(Str $x where .chars == 1) {
print format-codepoint($x.ord);
}
Let's look at it in action:
$ uni ø
ø - U+000f8 - LATIN SMALL LETTER O WITH STROKE
The chr
method turns a code point number into the character and ord
is
the reverse, in other words: from character to code point number.
The second part, searching in all Unicode character names, works by
brute-force enumerating all possible characters and searching through their
uniname
:
multi sub MAIN($search is copy) {
$search.=uc;
for 1..0x10FFFF -> $codepoint {
if $codepoint.uniname.contains($search) {
print format-codepoint($codepoint);
}
}
}
Since all character names are in upper case, the search term is first
converted to upper case with $search.=uc
, which is short for $search =
$search.uc
. By default, parameters are read only, which is why its
declaration here uses is copy
to prevent that.
Instead of this rather imperative style, we can also formulate it in a more functional style. We could think of it as a list of all characters, which we whittle down to those characters that interest us, to finally format them the way we want:
multi sub MAIN($search is copy) {
$search.=uc;
print (1..0x10FFFF).grep(*.uniname.contains($search))
.map(&format-codepoint)
.join;
}
To make it easier to identify (rather than search for) a string of more than one character, an explicit option can help disambiguate:
multi sub MAIN($x, Bool :$identify!) {
print $x.ords.map(&format-codepoint).join;
}
Str.ords
returns the list of code points that make up the string. With this
multi candidate of sub MAIN
in place, we can do something like
$ uni --identify øre
ø - U+000f8 - LATIN SMALL LETTER O WITH STROKE
r - U+00072 - LATIN SMALL LETTER R
e - U+00065 - LATIN SMALL LETTER E
Code Points, Grapheme Clusters and Bytes
As alluded to above, not all code points are fully-fledged characters on their own. Or put another way, some things that we visually identify as a single character are actually made up of several code points. Unicode calls these sequences of one base character and potentially several combining characters as a grapheme cluster.
Strings in Perl 6 are based on these grapheme clusters. If you get a list of
characters in string with $str.comb
, or extract a substring with
$str.substr(0, 4)
, match a regex against a string, determine the length, or
do any other operation on a string, the unit is always the grapheme cluster.
This best fits our intuitive understanding of what a character is and avoids
accidentally tearing apart a logical character through a substr
, comb
or
similar operation:
my $s = "ø\c[COMBINING TILDE]";
say $s; # ø̃
say $s.chars; # 1
The Uni type is akin to a string and
represents a sequence of codepoints. It is useful in edge cases, but doesn't
support the same wealth of operations as
Str. The typical way to go from Str
to a
Uni
value is to use one of the NFC
, NFD
, NFKC
, or NFKD
methods, which
yield a Uni
value in the normalization form of the same name.
Below the Uni
level you can also represent strings as bytes by choosing an
encoding. If you want to get from string to the byte level, call the
encode
method:
my $bytes = 'Perl 6'.encode('UTF-8');
UTF-8 is the default encoding and also the one Perl 6 assumes when reading
source files. The result is something that does the
Blob role; you can access
individual bytes with positional indexing, such as $bytes[0]
. The
decode
method helps
you to convert a Blob
to a Str
.
Numbers
Number literals in Perl 6 aren't limited to the Arabic digits we are so used
to in the English speaking part of the world. All Unicode code points that
have the Decimal_Number
(short Nd
) property are allowed, so you can for
example use Bengali digits:
say ৪২; # 42
The same holds true for string to number conversions:
say "৪২".Int; # 42
For other numeric code points you can use the unival
method to obtain its
numeric value:
say "\c[TIBETAN DIGIT HALF ZERO]".unival;
which produces the output -0.5
and also illustrates how to use a codepoint
by name inside a string literal.
Other Unicode Properties
The uniprop
method
in type Str
returns the general category by default:
say "ø".uniprop; # Ll
say "\c[TIBETAN DIGIT HALF ZERO]".uniprop; # No
The return value needs some Unicode knowledge in order to make sense of it,
or one could read
Unicode's Technical Report 44 for the gory details.
Ll
stands for Letter_Lowercase
, No
is Other_Number
. This is what
Unicode calls the General Category, but you can ask the uniprop
(or
uniprop-bool
method if you're only interested in a boolean result) for
other properties as well:
say "a".uniprop-bool('ASCII_Hex_Digit'); # True
say "ü".uniprop-bool('Numeric_Type'); # False
say ".".uniprop("Word_Break"); # MidNumLet
Collation
Sorting strings starts to become complicated when you're not limited to ASCII
characters. Perl 6's sort
method uses the cmp
infix operator, which does a
pretty standard lexicographic comparison based on the codepoint number.
If you need to use a more sophisticated collation algorithm, Rakudo 2017.02 and newer offer the Unicode Collation Algorithm as an experimental feature:
my @list = <a ö ä Ä o ø>;
say @list.sort; # (a o Ä ä ö ø)
use experimental :collation;
say @list.collate; # (a ä Ä o ö ø)
$*COLLATION.set(:tertiary(False));
say @list.collate; # (a Ä ä o ö ø)
The default sort
considers any character with diacritics to be larger than
ASCII characters, because that's how they appear in the code point list. On
the other hand, collate
knows that characters with diacritics belong
directly after their base character, which is not perfect in every language,
but internally a good compromise.
For Latin-based scripts, the primary sorting criteria is alphabetic, the
secondary diacritics, and the third is case.
$*COLLATION.set(:tertiary(False))
thus makes .collate
ignore case, so it
doesn't force lower case characters to come before upper case characters
anymore.
At the time of writing, language specification of collation is not yet implemented.
Summary
Perl 6 takes languages other than English very seriously, and goes to great lengths to facilitate working with them and the characters they use.
This includes basing strings on grapheme clusters rather than code points, support for non-Arabic digits in numbers, and access to large parts of Unicode database through built-in methods.
Sun, 26 Feb 2017
Perl 6 By Example: Functional Refactorings for Directory Visualization Code
Permanent link
This blog post is part of my ongoing project to write a book about Perl 6.
If you're interested, either in this book project or any other Perl 6 book news, please sign up for the mailing list at the bottom of the article, or here. It will be low volume (less than an email per month, on average).
In the last installment we've seen some code that generated tree maps and flame graphs from a tree of directory and file sizes.
There's a pattern that occurs three times in that code: dividing an area based on the size of the files and directories in the tree associated with the area.
Extracting such common code into a function is a good idea, but it's slightly hindered by the fact that there is custom code inside the loop that's part of the common code. Functional programming offers a solution: Put the custom code inside a separate function and have the common code call it.
Applying this technique to the tree graph flame graph looks like this:
sub subdivide($tree, $lower, $upper, &todo) {
my $base = ($upper - $lower ) / $tree.total-size;
my $var = $lower;
for $tree.children -> $child {
my $incremented = $var + $base * $child.total-size;
todo($child, $var, $incremented);
$var = $incremented,
}
}
sub flame-graph($tree, :$x1!, :$x2!, :$y!, :$height!) {
return if $y >= $height;
take 'rect' => [
x => $x1,
y => $y,
width => $x2 - $x1,
height => 15,
style => "fill:" ~ random-color(),
title => [$tree.name ~ ', ' ~ format-size($tree.total-size)],
];
return if $tree ~~ File;
subdivide( $tree, $x1, $x2, -> $child, $x1, $x2 {
flame-graph( $child, :$x1, :$x2, :y($y + 15), :$height );
});
}
sub tree-map($tree, :$x1!, :$x2!, :$y1!, :$y2) {
return if ($x2 - $x1) * ($y2 - $y1) < 20;
take 'rect' => [
x => $x1,
y => $y1,
width => $x2 - $x1,
height => $y2 - $y1,
style => "fill:" ~ random-color(),
title => [$tree.name],
];
return if $tree ~~ File;
if $x2 - $x1 > $y2 - $y1 {
# split along the x-axis
subdivide $tree, $x1, $x2, -> $child, $x1, $x2 {
tree-map $child, :$x1, :$x2, :$y1, :$y2;
}
}
else {
# split along the y-axis
subdivide $tree, $y1, $y2, -> $child, $y1, $y2 {
tree-map $child, :$x1, :$x2, :$y1, :$y2;
}
}
}
The newly introduced subroutine subdivide
takes a directory tree, a start
point and an end point, and finally a code object &todo
. For each child of
the directory tree it calculates the new coordinates and then calls the
&todo
function.
The usage in subroutine flame-graph
looks like this:
subdivide( $tree, $x1, $x2, -> $child, $x1, $x2 {
flame-graph( $child, :$x1, :$x2, :y($y + 15), :$height );
});
The code object being passed to subdivide
starts with ->
,
which introduces the signature of a block. The code block recurses into
flame-graph
, adding some extra arguments, and turning two positional
arguments into named arguments along the way.
This refactoring shortened the code and made it overall more pleasant to work
with. But there's still quite a bit of duplication between tree-map
and
flame-graph
: both have an initial termination condition, a take
of a
rectangle, and then a call or two to subdivide
. If we're willing to put all
the small differences into small, separate functions, we can unify it further.
If we pass all those new functions as arguments to each call, we create an
unpleasantly long argument list. Instead, we can use those functions to
generate the previous functions flame-graph
and tree-map
:
sub svg-tree-gen(:&terminate!, :&base-height!, :&subdivide-x!, :&other!) {
sub inner($tree, :$x1!, :$x2!, :$y1!, :$y2!) {
return if terminate(:$x1, :$x2, :$y1, :$y2);
take 'rect' => [
x => $x1,
y => $y1,
width => $x2 - $x1,
height => base-height(:$y1, :$y2),
style => "fill:" ~ random-color(),
title => [$tree.name ~ ', ' ~ format-size($tree.total-size)],
];
return if $tree ~~ File;
if subdivide-x(:$x1, :$y1, :$x2, :$y2) {
# split along the x-axis
subdivide $tree, $x1, $x2, -> $child, $x1, $x2 {
inner($child, :$x1, :$x2, :y1(other($y1)), :$y2);
}
}
else {
# split along the y-axis
subdivide $tree, $y1, $y2, -> $child, $y1, $y2 {
inner($child, :x1(other($x1)), :$x2, :$y1, :$y2);
}
}
}
}
my &flame-graph = svg-tree-gen
terminate => -> :$y1, :$y2, | { $y1 > $y2 },
base-height => -> | { 15 },
subdivide-x => -> | { True },
other => -> $y1 { $y1 + 15 },
my &tree-map = svg-tree-gen
terminate => -> :$x1, :$y1, :$x2, :$y2 { ($x2 - $x1) * ($y2 - $y1) < 20 },
base-height => -> :$y1, :$y2 { $y2 - $y1 },
subdivide-x => -> :$x1, :$x2, :$y1, :$y2 { $x2 - $x1 > $y2 - $y1 },
other => -> $a { $a },
;
So there's a new function svg-tree-gen
, which returns a function. The
behavior of the returned function depends on the four small functions that
svg-tree-gen
receives as arguments.
The first argument, terminate
, determines
under what condition the inner function should terminate early. For tree-map
that's when the area is below 20 pixels, for flame-graph
when the current
y-coordinate $y1
exceeds the height of the whole image, which is stored in
$y2
. svg-tree-gen
always calls this function with the four named arguments
x1
, x2
, y1
and y2
, so the terminate
function must ignore the x1
and x2
values. It does this by adding |
as a parameter, which is an
anonymous capture. Such a parameter can bind arbitrary positional and named
arguments, and since it's an anonymous parameter, it discards all the values.
The second configuration function, base-height
, determines the height of the
rectangle in the base case. For flame-graph
it's a constant, so the
configuration function must discard all arguments, again with a |
. For
tree-graph
, it must return the difference between $y2
and $y1
, as before
the refactoring.
The third function determines when to subdivide along the x-axis. Flame graphs
always divide along the x-axis, so -> | { True }
accomplishes that. Our
simplistic approach to tree graphs divides along the longer axis, so only
along the x-axis if $x2 - $x1 > $y2 - $y1
.
The fourth and final function we pass to svg-tree-gen
calculates the
coordinate of the axis that isn't being subdivided. In the case of
flame-graph
that's increasing over the previous value by the height of the
bars, and for tree-map
it's the unchanged coordinate, so we pass the
identity function -> $a { $a }
.
The inner function only needs a name because we need to call it from itself
recursively; otherwise an anonymous function sub ($tree, :$x1!, :$x2!,
:$y1!, :$y2!) { ... }
would have worked fine.
Now that we have very compact definitions of flame-graph
and tree-map
,
it's a good time to play with some of the parameters. For example we can
introduce a bit of margin in the flame graph by having the increment in
other
greater than the bar height in base-height
:
my &flame-graph = svg-tree-gen
base-height => -> | { 15 },
other => -> $y1 { $y1 + 16 },
# rest as before
Another knob to turn is to change the color generation to something more deterministic, and make it configurable from the outside:
sub svg-tree-gen(:&terminate!, :&base-height!, :&subdivide-x!, :&other!,
:&color=&random-color) {
sub inner($tree, :$x1!, :$x2!, :$y1!, :$y2!) {
return if terminate(:$x1, :$x2, :$y1, :$y2);
take 'rect' => [
x => $x1,
y => $y1,
width => $x2 - $x1,
height => base-height(:$y1, :$y2),
style => "fill:" ~ color(:$x1, :$x2, :$y1, :$y2),
title => [$tree.name ~ ', ' ~ format-size($tree.total-size)],
];
# rest as before
}
We can, for example, keep state within the color generator and return a slightly different color during each iteration:
sub color-range(|) {
state ($r, $g, $b) = (0, 240, 120);
$r = ($r + 5) % 256;
$g = ($g + 10) % 256;
$b = ($b + 15) % 256;
return "rgb($r,$g,$b)";
}
state
variables keep their values between calls to the same subroutine and
their initialization runs only on the first call. So this function slightly
increases the lightness in each color channel for each invocation, except
when it reaches 256, where the modulo operator %
resets it back to a small
value.
If we plug this into our functions by passing color => &color-range
to the
calls to svg-tree-gen
, we get much less chaotic looking output:
And the flame graph:
More Language Support for Functional Programming
As you've seen in the examples above, functional programming typically involves writing lots of small functions. Perl 6 has some language features that make it very easy to write such small functions.
A common task is to write a function that calls a particular method on its argument, as we've seen here:
method total-size() {
$!total-size //= $.size + @.children.map({.total-size}).sum;
# ^^^^^^^^^^^^^
}
This can be abbreviated to *.total-size
:
method total-size() {
$!total-size //= $.size + @.children.map(*.total-size).sum;
}
This works for chains of method calls too, so you could write
@.children.map(*.total-size.round)
if total-size
returned a fractional
number and you wanted to the call .round
method on the result.
There are more cases where you can replace an expression with the "Whatever"
star *
to create a small function. To create a function that adds 15 to its
argument, you can write * + 15
instead of -> $a { $a + 15 }
.
If you need to write a function to just call another function, but pass more
arguments to the second function, you can use the method
assuming. For example -> $x {
f(42, $x }
can be replaced with &f.assuming(42)
. This works also for named
arguments, so -> $x { f($x, height => 42 ) }
can be replaced with
&f.assuming(height => 42)
.
Summary
Functional programming offers techniques for extracting common logic into separate functions. The desired differences in behavior can be encoded in more functions that you pass in as arguments to other functions.
Perl 6 supports functional programming by making functions first class, so you can pass them around as ordinary objects. It also offers closures (access to outer lexical variables from functions), and various shortcuts that make it more pleasant to write short functions.
Sun, 19 Feb 2017
Perl 6 By Example: A File and Directory Usage Graph
Permanent link
This blog post is part of my ongoing project to write a book about Perl 6.
If you're interested, either in this book project or any other Perl 6 book news, please sign up for the mailing list at the bottom of the article, or here. It will be low volume (less than an email per month, on average).
A File and Directory Usage Graph
You bought a shiny new 2TB disk just a short while ago, and you're already getting low disk space warnings. What's taking up all that space?
To answer this question, and experiment a bit with data visualization, let's write a small tool that visualizes which files use up how much disk space.
To do that, we must first recursively read all directories and files in a given directory, and record their sizes. To get a listing of all elements in a directory, we can use the dir function, which returns a lazy list of IO::Path objects.
We distinguish between directories, which can have child entries, and files, which don't. Both can have a direct size, and in the case of directories also a total size, which includes files and subdirectories, recursively:
class File {
has $.name;
has $.size;
method total-size() { $.size }
}
class Directory {
has $.name;
has $.size;
has @.children;
has $!total-size;
method total-size() {
$!total-size //= $.size + @.children.map({.total-size}).sum;
}
}
sub tree(IO::Path $path) {
if $path.d {
return Directory.new(
name => $path.basename,
size => $path.s,
children => dir($path).map(&tree),
);
}
else {
return File.new(
name => $path.Str,
size => $path.s,
);
}
}
Method total-size
in class Directory
uses the construct $var //= EXPR´.
The
//stands for *defined-OR*, so it returns the left-hand side if it has a
defined value. Otherwise, it evalutes and returns the value of
EXPR`.
Combined with the assignment operator, it evaluates the right-hand side only
if the variable is undefined, and then stores the value of the expression in
the variable. That's a short way to write a cache.
The code for reading a file tree recursively uses the d
and s
methods on
IO::Path
. d
returns True for directories, and false for files. s
returns
the size. (Note that .s
on directories used to throw an exception in older
Rakudo versions. You must use Rakudo 2017.01-169 or newer for this to work; if
you are stuck on an older version of Rakudo, you could hard code the size of a
directory to a typical block size, like 4096 bytes. It typically won't skew
your results too much).
Just to check that we've got a sensible data structure, we can write a short routine that prints it recursively, with indention to indicate nesting of directory entries:
sub print-tree($tree, Int $indent = 0) {
say ' ' x $indent, format-size($tree.total-size), ' ', $tree.name;
if $tree ~~ Directory {
print-tree($_, $indent + 2) for $tree.children
}
}
sub format-size(Int $bytes) {
my @units = flat '', <k M G T P>;
my @steps = (1, { $_ * 1024 } ... *).head(6);
for @steps.kv -> $idx, $step {
my $in-unit = $bytes / $step;
if $in-unit < 1024 {
return sprintf '%.1f%s', $in-unit, @units[$idx];
}
}
}
sub MAIN($dir = '.') {
print-tree(tree($dir.IO));
}
The subroutine print-tree
is pretty boring, if you're used to recursion. It
prins out the name and size of the current node, and if the current node is a
directory, recurses into each children with an increased indention. The
indention is applied through the x
string repetition operator, which when
called as $string x $count
repeates the $string
for $count
times.
To get a human-readable repretation of the size of a number, format-size
knows a list of six units: the empty string for one, k
(kilo) for 1024, M
(mega) for 1024*1024 and so on. This list is stored in the array @units
. The
multiplies assoziated with each unit is stored in @steps
, which is
iniitliazed through the series operator. ...
. Its structure is
INITIAL, CALLABLE ... LIMIT
, and repeatedly applies CALLABLE
first to the
initial value, and then to next value generated and so on, until it hits
LIMIT
. The limit here is *
, a special term called Whatever, which in
this means it's unlimited. So the sequence operator returns a lazy,
potentially infinite list, and the tailing .head(6)
call limits it to 6
values.
To find the most appropriate unit to print the size in, we have to iterate
over both the values and in the indexes of the array, which for @steps.kv ->
$idx, $step { .. }
accomplishes. sprintf
, known from other programming
languages, does the actual formatting to one digit after the dot, and appends
the unit.
Generating a Tree Map
One possible visualization of file and directory sizes is a tree map, which represents each directory as a rectangle, and a each file inside it as a rectangle inside directory's rectangle. The size of each rectangle is proportional to the size of the file or directory it represents.
We'll generate an SVG file containing all those rectangles. Modern browsers support displaying those files, and also show mouse-over texts for each rectangle. This alleviates the burden to actually label the rectangnles, which can be quite a hassle.
To generate the SVG, we'll use the SVG
module, which you can install with
$ zef install SVG
or
$ panda install SVG
depending on the module installer you have available.
This module provides a single static method, to which you pass nested pairs. Pairs whose values are arrays are turned into XML tags, and other pairs are turned into attributes. For example this Perl 6 script
use SVG;
print SVG.serialize(
:svg[
width => 100,
height => 20,
title => [
'example',
]
],
);
produces this output:
<svg xmlns="http://www.w3.org/2000/svg"
xmlns:svg="http://www.w3.org/2000/svg"
xmlns:xlink="http://www.w3.org/1999/xlink"
width="100"
height="20">
<title>example</title>
</svg>
(without the indention). The xmlns
-tags are helpfully added by the SVG
module, and are necessary for programs to recognize the file as SVG.
To return to the tree maps, a very simple way to lay out the rectangle is to recurse into areas, and for each area subdivide it either horizontally or vertically, depending on which axis is longer:
sub tree-map($tree, :$x1!, :$x2!, :$y1!, :$y2) {
# do not produce rectangles for small files/dirs
return if ($x2 - $x1) * ($y2 - $y1) < 20;
# produce a rectangle for the current file or dir
take 'rect' => [
x => $x1,
y => $y1,
width => $x2 - $x1,
height => $y2 - $y1,
style => "fill:" ~ random-color(),
title => [$tree.name],
];
return if $tree ~~ File;
if $x2 - $x1 > $y2 - $y1 {
# split along the x axis
my $base = ($x2 - $x1) / $tree.total-size;
my $new-x = $x1;
for $tree.children -> $child {
my $increment = $base * $child.total-size;
tree-map(
$child,
x1 => $new-x,
x2 => $new-x + $increment,
:$y1,
:$y2,
);
$new-x += $increment;
}
}
else {
# split along the y axis
my $base = ($y2 - $y1) / $tree.total-size;
my $new-y = $y1;
for $tree.children -> $child {
my $increment = $base * $child.total-size;
tree-map(
$child,
:$x1,
:$x2,
y1 => $new-y,
y2 => $new-y + $increment,
);
$new-y += $increment;
}
}
}
sub random-color {
return 'rgb(' ~ (1..3).map({ (^256).pick }).join(',') ~ ')';
}
sub MAIN($dir = '.', :$type="flame") {
my $tree = tree($dir.IO);
use SVG;
my $width = 1024;
my $height = 768;
say SVG.serialize(
:svg[
:$width,
:$height,
| gather tree-map $tree, x1 => 0, x2 => $width, y1 => 0, y2 => $height
]
);
}
Tree map of an example directory, with random colors and a mouse-over hover identifying one of the files.
The generated file is not pretty, due to the random colors, and due to some
files being identified as very narrow rectangles. But it does make it obvious
that there are a few big files, and many mostly small files in a directory
(which happens to be the .git
directory of a repository). Viewing a file in
a browser shows the name of the file on mouse over.
How did we generate this file?
Sub tree-map
calls take
to adds elements to a result list, so it must be
called in the context of a gather
statement. gather { take 1; take 2 }
returns a lazy list of two elements, 1, 2
. But the take
calls don't have
to occur in the lexical scope of the gather
, they can be in any code
that's directly or indirectly called from the gather
. We call that the
dynamic scope.
The rest of sub tree-map
is mostly straight-forward. For each direction in
which the remaining rectangle can be split, we calculate a base unit that
signifies how many pixels a byte should take up. This is used to split up the
current canvas into smaller ones, and use those to recurse into tree-map
.
The random color generation uses ^256
to create a range from 0 to 256
(exclusive), and .pick
returns a random element from this range. The result
is a random CSS color string like rgb(120,240,5)
.
In sub MAIN, the gather
returns a list, which would normally be
nested inside the outer array. The pipe symbol |
in :svg[ ..., | gather
... ]
before the gather prevents the normal nesting, and flattens the list
into the outer array.
Flame Graphs
The disadvantage of tree maps as generated before is that the human brain isn't very good at comparing rectangle sizes of different aspect ratios, so if their ratio of width to height is very different. Flame graphs prevent this perception errors by showing file sizes as horizontal bars. The vertical arrangement indicates the nesting of directories and files inside other directories. The disadvantage is that less of the available space is used for visualizing the file sizes.
Generating flame graphs is easier than tree maps, because you only need to subdivide in one direction, whereas the height of each bar is fixed, here to 15 pixel:
sub flame-graph($tree, :$x1!, :$x2!, :$y!, :$height!) {
return if $y >= $height;
take 'rect' => [
x => $x1,
y => $y,
width => $x2 - $x1,
height => 15,
style => "fill:" ~ random-color(),
title => [$tree.name ~ ', ' ~ format-size($tree.total-size)],
];
return if $tree ~~ File;
my $base = ($x2 - $x1) / $tree.total-size;
my $new-x = $x1;
for $tree.children -> $child {
my $increment = $base * $child.total-size;
flame-graph(
$child,
x1 => $new-x,
x2 => $new-x + $increment,
y => $y + 15,
:$height,
);
$new-x += $increment;
}
}
We can add a switch to sub MAIN
to call either tree-map
or flame-graph
,
depending on a command line option:
sub MAIN($dir = '.', :$type="flame") {
my $tree = tree($dir.IO);
use SVG;
my $width = 1024;
my $height = 768;
my &grapher = $type eq 'flame'
?? { flame-graph $tree, x1 => 0, x2 => $width, y => 0, :$height }
!! { tree-map $tree, x1 => 0, x2 => $width, y1 => 0, y2 => $height }
say SVG.serialize(
:svg[
:$width,
:$height,
| gather grapher()
]
);
}
Since SVG's coordinate system places the zero of the vertical axis at the top, this actually produces an inverted flame graph, sometimes called icicle graph:
Inverted flame graph with random colors, where the width of each bar
represents a file/directory size, and the vertical position the nesting inside
a directory.
Summary
We've explored tree maps and flame graphs to visualize which files and directories use up how much disk space.
But the code contains quite some duplications. Next week we'll explore techniques from functional programming to reduce code duplication. We'll also try to make the resulting files a bit prettier.
Sun, 12 Feb 2017
Perl 6 By Example: Generating Good Parse Errors from a Parser
Permanent link
This blog post is part of my ongoing project to write a book about Perl 6.
If you're interested, either in this book project or any other Perl 6 book news, please sign up for the mailing list at the bottom of the article, or here. It will be low volume (less than an email per month, on average).
Good error messages are paramount to the user experience of any product. Parsers are no exception to this. Consider the difference between the message "Square bracket [ on line 5 closed by curly bracket } on line 5", in contrast to Python's lazy and generic "SyntaxError: invalid syntax".
In addition to the textual message, knowing the location of the parse error helps tremendously in figuring out what's wrong.
We'll explore how to generate better parsing error messages from a Perl 6 grammar, using the INI file parse from the previous blog posts as an example.
Failure is Normal
Before we start, it's important to realize that in a grammar-based parser, it's normal for regex to fail to match. Even in an overall successful parse.
Let's recall a part of the parser:
token block { [<pair> | <comment>]* }
token section { <header> <block> }
token TOP { <block> <section>* }
When this grammar matches against the string
key=value
[header]
other=stuff
then TOP
calls block
, which calls both pair
and comment
. The pair
match succeeds, the comment
match fails. No big deal. But since there is a
*
quantifier in token block
, it tries again to match pair
or comment
.
neither succeeds, but the overall match of token block
still succeeds.
A nice way to visualize passed and failed submatches is to install the
Grammar::Tracer
module (zef install Grammar::Tracer
or panda install
Grammar::Tracer
), and simple add the statement use Grammar::Tracer
before
the grammar definition. This produces debug output showing which rules matched
and which didn't:
TOP
| block
| | pair
| | | key
| | | * MATCH "key"
| | | ws
| | | * MATCH ""
| | | ws
| | | * MATCH ""
| | | value
| | | * MATCH "value"
| | | ws
| | | * MATCH ""
| | | ws
| | | * MATCH ""
| | * MATCH "key=value\n"
| | pair
| | | key
| | | * FAIL
| | * FAIL
| | comment
| | * FAIL
| * MATCH "key=value\n"
| section
...
Detecting Harmful Failure
To produce good parsing error messages, you must distinguish between expected and unexpected parse failures. As explained above, a match failure of a single regex or token is not generally an indication of a malformed input. But you can identify points where you know that once the regex engine got this far, the rest of the match must succeed.
If you recall pair
:
rule pair { <key> '=' <value> \n+ }
we know that if a key was parsed, we really expect the next character to be an equals sign. If not, the input is malformed.
In code, this looks like this:
rule pair {
<key>
[ '=' || <expect('=')> ]
<value> \n+
}
||
is a sequential alternative, which first tries to match the subregex on
the left-hand side, and only executes the right-hand side if that failed. On
the other hand, |
executes all alternatives notionally in parallel, and
takes the long match.
So now we have to define expect
:
method expect($what) {
die "Cannot parse input as INI file: Expected $what";
}
Yes, you can call methods just like regexes, because regexes really are
methods under the hood. die
throws an exception, so now the malformed input
justakey
produces the error
Cannot parse input as INI file: Expected =
followed by a backtrace. That's already better than "invalid syntax", though
the position is still missing. Inside method expect
, we can find the current
parsing position through method pos
, a method supplied by the implicit
parent class Grammar that the grammar
declaration brings with it.
We can use that to improve the error message a bit:
method expect($what) {
die "Cannot parse input as INI file: Expected $what at character {self.pos}";
}
Providing Context
For larger inputs, we really want to print the line number. To calculate that,
we need to get hold of the target string, which is available as method
target
:
method expect($what) {
my $parsed-so-far = self.target.substr(0, self.pos);
my @lines = $parsed-so-far.lines;
die "Cannot parse input as INI file: Expected $what at line @lines.elems(), after '@lines[*-1]'";
}
This brings us from the "meh" realm of error messages to quite good.
IniFile.parse(q:to/EOI/);
key=value
[section]
key_without_value
more=key
EOI
now dies with
Cannot parse input as INI file: Expected = at line 3, after 'key_without_value'
You can refine method expect
more, for example by providing context both before
and after the position of the parse failure.
And of course you have to apply the [ thing || <expect('thing')> ]
pattern
at more places inside the regex to get better error messages.
Finally you can provide different kinds of error messages too. For example
when parsing a section header, once the initial [
is parsed, you likely
don't want an error message "expected rest of section header", but rather
"malformed section header, at line ...":
rule pair {
<key>
[ '=' || <expect('=')> ]
[ <value> || <expect('value')>]
\n+
}
token header {
'['
[ ( <-[ \[ \] \n ]>+ ) ']'
|| <error("malformed section header")> ]
}
...
method expect($what) {
self.error("expected $what");
}
method error($msg) {
my $parsed-so-far = self.target.substr(0, self.pos);
my @lines = $parsed-so-far.lines;
die "Cannot parse input as INI file: $msg at line @lines.elems(), after '@lines[*-1]'";
}
Since Rakudo Perl 6 uses grammars to parse Perl 6 input, you can use Rakudo's own grammar as source of inspiration for more ways to make error reporting even better.
Summary
To generate good error messages from a parser, you need to distinguish between
expected and unexpected match failures. The sequential alternative ||
is a
tool you can use to turn unexpected match failures into error messages by
raising an exception from the second branch of the alternative.
Sun, 05 Feb 2017
Perl 6 By Example: Improved INI Parsing with Grammars
Permanent link
This blog post is part of my ongoing project to write a book about Perl 6.
If you're interested, either in this book project or any other Perl 6 book news, please sign up for the mailing list at the bottom of the article, or here. It will be low volume (less than an email per month, on average).
Last week we've seen a collection of regexes that can parse a configuration file in the INI format that's popular in world of Microsoft Windows applications.
Here we'll explore grammars, a feature that groups regexes into a class-like structure, and how to extract structured data from a successful match.
Grammars
A grammar is class with some extra features that makes it suitable for parsing text. Along with methods and attributes you can put regexes into a grammar.
This is what the INI file parser looks like when formulated as a grammar:
grammar IniFile {
token key { \w+ }
token value { <!before \s> <-[\n;]>+ <!after \s> }
token pair { <key> \h* '=' \h* <value> \n+ }
token header { '[' <-[ \[ \] \n ]>+ ']' \n+ }
token comment { ';' \N*\n+ }
token block { [<pair> | <comment>]* }
token section { <header> <block> }
token TOP { <block> <section>* }
}
You can use it to parse some text by calling the parse
method, which uses
regex or token TOP
as the entry point:
my $result = IniFile.parse($text);
Besides the standardized entry point, a grammar offers more advantages. You can inherit from it like from a normal class, thus bringing even more reusability to regexes. You can group extra functionality together with the regexes by adding methods to the grammar. And then there are some mechanisms in grammars that can make your life as a developer easier.
One of them is dealing with whitespace. In INI files, horizontal whitespace is
generally considered to be insignificant, in that key=value
and key =
value
lead to the same configuration of the application. So far we've dealt
with that explicitly by adding \h*
to token pair
. But there are place we
haven't actually considered. For example it's OK to have a comment that's not
at start of the line.
The mechanism that grammars offer is that you can define a rule called ws
,
and when you declare a token with rule
instead of token
(or enable this
feature in regex through the :sigspace
modifier), Perl 6 inserts implicit
<ws>
calls for you where there is whitespace in the regex definition:
grammar IniFile {
token ws { \h* }
rule pair { <key> '=' <value> \n+ }
# rest as before
}
This might not be worth the effort for a single rule that needs to parse whitespace, but when there are more, this really pays off by keeping whitespace parsing in a singles space.
Note that you should only parse insignificant whitespace in token ws
. For
example for INI files, newlines are significant, so ws
shouldn't match
them.
Extracting Data from the Match
So far the IniFile
grammar only checks whether a given input matches the
grammar or not. But when it does match, we really want the result of the parse
in a data structure that's easy to use. For example we could translate this
example INI file:
key1=value2
[section1]
key2=value2
key3 = with spaces
; comment lines start with a semicolon, and are
; ignored by the parser
[section2]
more=stuff
Into this data structure of nested hashes:
{
_ => {
key1 => "value2"
},
section1 => {
key2 => "value2",
key3 => "with spaces"
},
section2 => {
more => "stuff"
}
}
Key-value pairs from outside of any section show up in the _
top-level
key.
The result from the IniFile.parse
call is a
Match object that has (nearly) all the
information necessary to extract the desired match. If you turn a Match object
into a string, it becomes the matched string. But there's more. You can use it
like a hash to extract the matches from named submatches. For example if the
top-level match from
token TOP { <block> <section>* }
produces a Match
object $m
, then $m<block>
is again a Match object, this
one from the match of the call of token block´. And
$mis a list
of
Matchobjects from the repeated calls to token
section. So a
Match` is
really a tree of matches.
We can walk this data structure to extract the nested hashes.
Token header
matches a string like "[section1]\n", and we're only
interested in
"section1". To get to the inner part, we can modify token
header` by inserting a pair of round parenthesis around the subregex whose
match we're interested in:
token header { '[' ( <-[ \[ \] \n ]>+ ) ']' \n+ }
# ^^^^^^^^^^^^^^^^^^^^ a capturing group
That's a capturing group, and we can get its match by using the top-level
match for header
as an array, and accessing its first element. This leads us
to the full INI parser:
sub parse-ini(Str $input) {
my $m = IniFile.parse($input);
unless $m {
die "The input is not a valid INI file.";
}
sub block(Match $m) {
my %result;
for $m<block><pair> -> $pair {
%result{ $pair<key>.Str } = $pair<value>.Str;
}
return %result;
}
my %result;
%result<_> = hash-from-block($m);
for $m<section> -> $section {
%result{ $section<header>[0].Str } = hash-from-block($section);
}
return %result;
}
This top-down approach works, but it requires a very intimate understanding of the grammar's structure. Which means that if you change the structure during maintenance, you'll have a hard time figuring out how to change the data extraction code.
So Perl 6 offers a bottom-up approach as well. It allows you to write a data
extraction or action method for each regex, token or rule. The grammar engine
passes in the match object as the single argument, and the action method can
call the routine make
to attach a result to the match object. The result is
available through the .made
method on the match object.
This execution of action methods happens as soon as a regex matches
successfully, which means that an action method for a regex can rely on the
fact that the action methods for subregex calls have already run. For example
when the rule pair { <key> '=' <value> \n+ }
is being executed, first
token key
matches successfully, and its action method runs immediately
afterwards. Then token value
matches, and its action method runs too. Then
finally rule pair
itself can match successfully, so its action method can
rely on $m<key>.made
and $m<value>.made
being available, assuming that the
match result is stored in variable $m
.
Speaking of variables, a regex match implicitly stores its result in the
special variable $/
, and it is custom to use $/
as parameter in action
methods. And there is a shortcut for accessing named submatches: instead of
writing $/<key>
, you can write $<key>
. With this convention in mind, the
action class becomes:
class IniFile::Actions {
method key($/) { make $/.Str }
method value($/) { make $/.Str }
method header($/) { make $/[0].Str }
method pair($/) { make $<key>.made => $<value>.made }
method block($/) { make $<pair>.map({ .made }).hash }
method section($/) { make $<header>.made => $<block>.made }
method TOP($/) {
make {
_ => $<block>.made,
$<section>.map: { .made },
}
}
}
The first two action methods are really simple. The result of a key
or
value
match is simply the string that matched. For a header
, it's just the
substring inside the brackets. Fittingly, a pair
returns a
Pair object, composed from key and value.
Method block
constructs a hash from all the lines in the block by iterating
over each pair
submatch, extracting the already attached Pair
object.
One level above that in the match tree, section
takes that hash and pairs it
with the name of section, extracted from $<header>.made
. Finally the
top-level action method gathers the sectionless key-value pairs under they key
_
as well as all the sections, and returns them in a hash.
In each method of the action class, we only rely on the knowledge of the
first level of regexes called directly from the regex that corresponds to the
action method, and the data types that they .made
. Thus when you refactor one
regex, you also have to change only the corresponding action method. Nobody
needs to be aware of the global structure of the grammar.
Now we just have to tell Perl 6 to actually use the action class:
sub parse-ini(Str $input) {
my $m = IniFile.parse($input, :actions(IniFile::Actions));
unless $m {
die "The input is not a valid INI file.";
}
return $m.made
}
If you want to start parsing with a different rule than TOP
(which you might
want to do in a test, for example), you can pass a named argument rule
to
method parse
:
sub parse-ini(Str $input, :$rule = 'TOP') {
my $m = IniFile.parse($input,
:actions(IniFile::Actions),
:$rule,
);
unless $m {
die "The input is not a valid INI file.";
}
return $m.made
}
say parse-ini($ini).perl;
use Test;
is-deeply parse-ini("k = v\n", :rule<pair>), 'k' => 'v',
'can parse a simple pair';
done-testing;
To better encapsulate all the parsing functionality within the grammar, we can
turn parse-ini
into a method:
grammar IniFile {
# regexes/tokens unchanged as before
method parse-ini(Str $input, :$rule = 'TOP') {
my $m = self.parse($input,
:actions(IniFile::Actions),
:$rule,
);
unless $m {
die "The input is not a valid INI file.";
}
return $m.made
}
}
# Usage:
my $result = IniFile.parse-ini($text);
To make this work, the class IniFile::Actions
either has to be declared before the
grammar, or it needs to be pre-declared with class IniFile::Action { ... }
at the top of the file (with literal three dots to mark it as a forward
declaration).
Summary
Match objects are really a tree of matches, with nodes for each named submatch and for each capturing group. Action methods make it easy to decouple parsing from data extraction.
Next we'll explore how to generate better error messages from a failed parse.
Sun, 29 Jan 2017
Perl 6 By Example: Parsing INI files
Permanent link
This blog post is part of my ongoing project to write a book about Perl 6.
If you're interested, either in this book project or any other Perl 6 book news, please sign up for the mailing list at the bottom of the article, or here. It will be low volume (less than an email per month, on average).
You've probably seen .ini
files before; they are quite common as
configuration files on the Microsoft Windows platform, but are also in many
other places like ODBC configuration files, Ansible's inventory
files and so on.
This is what they look like:
key1=value2
[section1]
key2=value2
key3 = with spaces
; comment lines start with a semicolon, and are
; ignored by the parser
[section2]
more=stuff
Perl 6 offers regexes for parsing, and grammars for structuring and reusing regexes.
Regex Basics
A regex is a piece of code that acts as a pattern for strings with a common structure. It's derived from the computer science concept of a regular expression, but adopted to allow more constructs than pure regular expressions allow, and with some added features that make them easier to use.
We'll use named regexes to match the primitives, and then use regexes that call these named regexes to build a parser for the INI files. Since INI files have no universally accepted, formal grammar, we have to make stuff up a we go.
Let's start with parsing value pairs, like key1=value1
. First the key.
It may contain letters, digits and the underscore _
. There's a shortcut for
match such characters, \w
, and matching more at least one works by appending
a +
character:
use v6;
my regex key { \w+ }
multi sub MAIN('test') {
use Test;
ok 'abc' ~~ /^ <key> $/, '<key> matches a simple identifier';
ok '[abc]' !~~ /^ <key> $/, '<key> does not match a section header';
done-testing;
}
my regex key { \w+ }
declares a lexically (my) scoped regex called key
that matches one or more word character.
There is a long tradition in programming languages to support so-called Perl Compatible Regular Expressions, short PCRE. Many programming languages support some deviations from PCRE, including Perl itself, but common syntax elements remain throughout most implementations. Perl 6 still supports some of these elements, but deviates substantially in others.
Here \w+
is the same as in PCRE, but the fact that white space around the
\w+
is ignored is not. In the testing routine, the slashes in 'abc' ~~ /^
<key> $/
delimit an anonymous regex. In this regex, ^
and $
stand for the
start and the end of the matched string, respectively, which is familiar from
PCRE. But then <key>
calls the named regex key
from earlier. This again
is a Perl 6 extension. In PCRE, the <
in a regex matches a literal <
. In
Perl 6 regexes, it introduces a subrule call.
In general, all non-word characters are reserved for "special" syntax, and you
have to quote or backslash them to get the literal meaning. For example \<
or '<'
in a regex match a less-then sign. Word characters (letters, digits
and the underscore) always match literally.
Parsing the INI primitives
Coming back the INI parsing, we have to think about what characters are allowed inside a value. Listing allowed characters seems to be like a futile exercise, since we are very likely to forget some. Instead, we should think about what's not allowed in a value. Newlines certainly aren't, because they introduce the next key/value pair or a section heading. Neither are semicolons, because they introduce a comment.
We can formulate this exclusion as a negated character class: <-[ \n ; ]>
matches any single character that is neither a newline nor a semicolon. Note
that inside a character class, nearly all characters lose their special
meaning. Only backslash, whitespace and the closing bracket stand for anything
other than themselves. Inside and outside of character classes alike, \n
matches a single newline character, and \s
and whitespace. The upper-case
inverts that, so that for example \S
matches any single character that is
not whitespace.
This leads us to a version version of a regex for match a value in an ini file:
my regex value { <-[ \n ; ]>+ }
There is one problem with this regex: it also matches leading and trailing whitespace, which we don't want to consider as part of the value:
my regex value { <-[ \n ; ]>+ }
if ' abc ' ~~ /<value>/ {
say "matched '$/'"; # matched ' abc '
}
If Perl 6 regexes were limited to a regular language in the Computer Science sense, we'd have to something like this:
my regex value {
# match a first non-whitespace character
<-[ \s ; ]>
[
# then arbitrarily many that can contain whitespace
<-[ \n ; ]>*
# ... terminated by one non-whitespace
<-[ \s ; ]>
]? # and make it optional, in case the value is only
# only one non-whitespace character
}
And now you know why people respond with "And now you have two problems" when proposing to solve problems with regexes. A simpler solution is to match a value as introduced first, and then introduce a constraint that neither the first nor the last character may be whitespace:
my regex value { <!before \s> <-[ \s ; ]>+ <!after \s> }
along with accompanying tests:
is ' abc ' ~~ /<value>/, 'abc', '<value> does not match leading or trailing whitespace';
is ' a' ~~ /<value>/, 'a', '<value> watches single non-whitespace too';
ok "a\nb" !~~ /^ <value> $/, '<valuee> does not match \n';
<!before regex>
is a negated look-ahead, that is, the following text must
not match the regex, and the text isn't consumed while matching.
Unsurprisingly, <!after regex>
is the negated look-behind, which tries to
match text that's already been matched, and must not succeed in doing so for
the whole match to be successful.
This being Perl 6, there is of course yet another way to approach this problem. If you formulate the requirements as "a value must not contain a newline or semicolon and start with a non-whitepsace and end with a non-whitespace", it become obvious that if we just had an AND operator in regexes, this could be easy. And it is:
my regex value { <-[ \s ; ]>+ & \S.* & .*\S }
The &
operator delimits two or more smaller regex expressions that must all
match the same string successfully for the whole match to succeed. \S.*
matches any string that starts with a non-whitesspace character (\S
), followed
by any character (.
) any number of times *
. Likewise .*\S
matches any
string that ends with non-whitespace character.
Who would have thought that matching something as seemingly simple as a value in a configuration file could be so involved? Luckily, matching a pair of key and value is much simpler, now that we know how to match each on their own:
my regex pair { <key> '=' <value> }
And this works great, as long as there are no blanks surrounding the equality sign. If there is, we have to match it separately:
my regex pair { <key> \h* '=' \h* <value> }
\h
matches a horizontal whitespace, that is, a blank, a tabulator character,
or any other fancy space-like thing that Unicode has in store for us (for
example also the non-breaking space), but not a newline.
Speaking of newlines, it's a good idea to match a newline at the end of regex
pair
, and since we ignore empty lines, let's match more than one too:
my regex pair { <key> \h* '=' \h* <value> \n+ }
Time to write some tests as well:
ok "key=vaule\n" ~~ /<pair>/, 'simple pair';
ok "key = value\n\n" ~~ /<pair>/, 'pair with blanks';
ok "key\n= value\n" !~~ /<pair>/, 'pair with newline before assignment';
A section header is a string in brackets, so the string itself shouldn't contain brackets or a newline:
my regex header { '[' <-[ \[ \] \n ]>+ ']' \n+ }
# and in multi sub MAIN('test'):
ok "[abc]\n" ~~ /^ <header> $/, 'simple header';
ok "[a c]\n" ~~ /^ <header> $/, 'header with spaces';
ok "[a [b]]\n" !~~ /^ <header> $/, 'cannot nest headers';
ok "[a\nb]\n" !~~ /^ <header> $/, 'No newlines inside headers';
The last remaining primitive is the comment:
my regex comment { ';' \N*\n+ }
\N
matches any character that's not a newline, so the comment is just a
semicolon, and then anything until the end of the line.
Putting Things Together
A section of an INI file is a header followed by some key-value pairs or comment lines:
my regex section {
<header>
[ <pair> | <comment> ]*
}
[...]
groups a part of a regex, so that the quantifier *
after it
applies to the whole group, not just to the last term.
The whole INI file consists of potentially some initial key/value pairs or comments followed by some sections:
my regex inifile {
[ <pair> | <comment> ]*
<section>*
}
The avid reader has noticed that the [ <pair> | <comment> ]*
part of a regex
has been used twice, so it's a good idea to extract it into a standalone
regex:
my regex block { [ <pair> | <comment> ]* }
my regex section { <header> <block> }
my regex inifile { <block> <section>* }
It's time for the "ultimate" test:
my $ini = q:to/EOI/;
key1=value2
[section1]
key2=value2
key3 = with spaces
; comment lines start with a semicolon, and are
; ignored by the parser
[section2]
more=stuff
EOI
ok $ini ~~ /^<inifile>$/, 'Can parse a full ini file';
Backtracking
Regex matching seems magical to many programmers. You just state the pattern, and the regex engine determines for you whether a string matches the pattern or not. While implementing a regex engine is a tricky business, the basics aren't too hard to understand.
The regex engine goes through the parts of a regex from left to right, trying to match each part of the regex. It keeps track of what part of the string it matched so far in a cursor. If a part of a regex can't find a match, the regex engine tries to alter the previous match to take up fewer characters, and the retry the failed match at the new position.
For example if you execute the regex match
'abc' ~~ /.* b/
the regex engine first evaluates the .*
. The .
matches any character.
The *
quantifier is
greedy, which means it tries to match as many characters as it can. It ends
up matching the whole string, abc
. Then the regex engine tries to match the
b
, which is a literal. Since the previous match gobbled up the whole string,
matching c
against the remaining empty string fails. So the previous regex
part, .*
, must give up a character. It now matches ab
, and the literal
matcher for the b
compares b
and c
, and fails again. So there is a final
iteration where the .*
once again gives up one character it matched, and now
the b
literal can match the second character in the string.
This back and forth between the parts of a regex is called backtracking.
It's great feature when you search for a pattern in a string. But in a parser,
it is usually not desirable. If for example the regex key
matched the
substring "key2" in the input
"key2=value2`, you don't want it to match a
shorter substring just because the next part of the regex can't match.
There are three major reasons why you don't want that. The first is that it makes debugging harder. When humans think about how a text is structured, they usually commit pretty quickly to basic tokenization, such as where a word or a sentence ends. Thus backtracking can be very uninuitive. If you generate error messages based on which regexes failed to match, backtracking basically always leads to the error message being pretty useless.
The second reason is that backtracking can lead to unexpected regex matches. For example you want to match two words, optionally separated by whitespaces, and you try to translate this directly to a regex:
say "two words" ~~ /\w+\s*\w+/; # 「two words」
This seems to work: the first \w+
matches the first word, the seccond oen
matches the second word, all fine and good. Until you find that it actually
matches a single word too:
say "two" ~~ /\w+\s*\w+/; # 「two」
How did that happen? Well, the first \w+
matched the whole word, \s*
successfully matched the empty string, and then the second \w+
failed,
forcing the previous to parts of the regex to match differently. So in the
second iteration, the first \w+only matches
tw, and the second
\w+
matches
o`. And then you realize that if two words aren't delimtied by
whitespace, how do you even tell where one word ends and the next one starts?
With backtracking disabled, the regex fails to match instead of matching in an
unintended way.
The third reason is performance. When you disable backtracking, the regex engine has to look at each character only once, or once for each branch it can take in the case of alternatives. With backtracking, the regex engine can be stuck in backtracking loops that take over-proportionally longer with increasing length of the input string.
To disable backtracking, you simply have to replace the word regex
by
token
in the declaration, or by using the :ratchet
modifier inside the
regex.
In the INI file parser, only regex value
needs backtracking (though other
formualtions discussed above don't need it), all the other regexes can be
switched over to tokens safely:
my token key { \w+ }
my regex value { <!before \s> <-[\n;]>+ <!after \s> }
my token pair { <key> \h* '=' \h* <value> \n+ }
my token header { '[' <-[ \[ \] \n ]>+ ']' \n+ }
my token comment { ';' \N*\n+ }
my token block { [ <pair> | <comment> ]* }
my token section { <header> <block> }
my token inifile { <block> <section>* }
Summary
Perl 6 allows regex reuse by treating them as first-class citizens, allowing them to be named and called like normal routines. Further clutter is removed by allowing whitespace inside regexes.
These features allows you to write regexes to parse proper file formats and even programming languages. So far we have only seen a binary decision about whether a string matches a regex or not. In the future, we'll explore ways to improve code reuse even more, extract structured data from the match, and give better error messages when the parse fails.
Sun, 22 Jan 2017
Perl 6 By Example: Perl 6 Review
Permanent link
This blog post is part of my ongoing project to write a book about Perl 6.
If you're interested, either in this book project or any other Perl 6 book news, please sign up for the mailing list at the bottom of the article, or here. It will be low volume (less than an email per month, on average).
In the previous "Perl 6 by Example" blog posts we've discussed some examples interleaved with the Perl 6 mechanics that make them work. Here I want to summarize and deepen the Perl 6 knowledge that we've touched on so far, removed from the original examples.
Variables and Scoping
In Perl 6, variable names are made of a sigil, $
, @
, %
or &
,
followed by an identifier. The sigil implies a type constraint, where $
is
the most general one (no restriction by default), @
is for arrays, %
for
hashes (associative arrays/maps), and &
for code objects.
Identifiers can contain -
and '
characters, as long as the character after
it is a letter. Identifiers must start with a letter or underscore.
Subroutines and variables declared with my
are lexically scoped. They are
visible from the point of the declaration to the end of the current
{}
-enclosed block (or the current file, in case the declaration is outside a
block). Subroutine parameters are visible in the signature and block of the
subroutine.
An optional twigil between the sigil and identifier can influence the
scoping. The *
twigil marks a dynamically scoped variable, thus lookup is
performed in the current call stack. !
marks attributes, that is,
a per-instance variable that's attached to an object.
Subroutines
A subroutine, or short sub, is a piece of code with its own scope and usually also a name. It has a signature which specifies what kind of values you have to pass in when you call it:
sub chunks(Str $s, Int $chars) {
# ^^^^^^^^^^^^^^^^^^^^ signature
# ^^^^^^ name
gather for 0 .. $s.chars / $chars - 1 -> $idx {
take substr($s, $idx * $chars, $chars);
}
}
The variables used in the signature are called parameters, whereas we call the values that you pass in arguments.
To refer to a subroutine without calling it, put an ampersand character in front of it, for example
say &chunks.^name; # Sub
to call it, simply use its name, followed by the list of arguments, which can optionally be in parentheses:
say chunks 'abcd', 2; # (ab cd)
say chunks('abcd', 2); # (ab cd)
You only need the parentheses if some other construct would otherwise interfere with the subroutine call. For example if you intend to write
say chunks(join('x', 'ab', 'c'), 2);
and you leave out the inner pair of parentheses:
say chunks(join 'x', 'ab', 'c', 2);
then all the arguments go to the join
function, leaving only one argument to the
chunks
function. On the other hand it is fine to leave out the outer pair of
parentheses and write
say chunks join('x', 'ab', 'c'), 2;
because there's no ambiguity here.
One case worth noting is that if you call a subroutine without arguments as the
block of an if
condition or a for
loop (or similar constructs), you have
to include the parentheses, because otherwise the block is parsed as an
argument to the function.
sub random-choice() {
Bool.pick;
}
# right way:
if random-choice() {
say 'You were lucky.';
}
# wrong way:
if random-choice {
say 'You were lucky.';
}
If you do happen to make this mistake, the Perl 6 compiler tries very hard to detect it. In the example above, it says
Function 'random-choice' needs parens to avoid gobbling block
and when it tries to parse the block for the if
-statement, it doesn't find
one:
Missing block (apparently claimed by 'random-choice')
When you have a sub called MAIN
, Perl 6 uses its signature to parse the
command line arguments and pass those command line arguments to MAIN
.
multi subs are several subroutines with the same name but different signatures. The compiler decides at run time which of the candidates it calls based on the best match between arguments and parameters.
Classes and Objects
Class declarations follow the same syntactic schema as subroutine
declarations: the keyword class
, followed by the name, followed by the body
in curly braces:
class OutputCapture {
has @!lines;
method print(\s) {
@!lines.push(s);
}
method captured() {
@!lines.join;
}
}
By default, type names are scoped to the current namespace, however you can make
it lexically scoped by adding a my
in front of class
:
my class OutputCapture { ... }
Creating a new instance generally works by calling the new
method on the
type object. The new
method is inherited from the implicit parent class
Any
that all types get:
my $c = OutputCapture.new;
Per-instance state is stored in attributes, which are declared with the
has
keyword, as seen above in has @!lines
. Attributes are always private,
as indicated by the !
twigil. If you use the dot .
twigil in the
declaration instead, you have both the private attribute @!lines
and a
public, read-only accessor method:
my class OutputCapture {
has @.lines;
method print(\s) {
# the private name with ! still works
@!lines.push(s);
}
method captured() {
@!lines.join;
}
}
my $c = OutputCapture.new;
$c.print('42');
# use the `lines` accessor method:
say $c.lines; # [42]
When you declare attributes with the dot twigil, you can also initialize the
attributes from the constructor through named arguments, as in
OutputCapture.new( lines => [42] )
.
Private methods start with a !
and can only be called from inside the class
body as self!private-method
.
Methods are basically just subroutines, with two differences. The first is
that they get an implicit parameter called self
, which contains the object
the method is called on (which we call the invocant). The second is that
if you call a subroutine, the compiler searches for this subroutine in the
current lexical scope, and outer scopes. On the other hand, the methods for
a method calls are looked up in the class of the object and its
superclasses.
Concurrency
Perl 6 provides high-level primitives for concurrency and parallel execution.
Instead of explicitly spawning new threads, you are encouraged to run a
computation with start
, which returns a
Promise. This is an object that
promises that in the future the computation will yield a result. The status
can thus be Planned
, Kept
or Broken
. You can chain promises, combine
them, and wait for them.
In the background, a scheduler distributes such computations to operating system level threads. The default scheduler is a thread pool scheduler with an upper limit to the number of threads to use.
Communication between parallel computations should happen through thread-safe
data structures. Foremost among them are the
Channel, a thread-safe queue, and
Supply, Perl 6's implementation of the
Observer Pattern. Supplies
are very powerful, because you can transform them with methods such as map
,
grep
, throttle
or delayed
, and use their actor
semantic to ensure that a
consumer is run in only one thread at a time.
Sun, 15 Jan 2017
Perl 6 By Example: Stateful Silent Cron
Permanent link
This blog post is part of my ongoing project to write a book about Perl 6.
If you're interested, either in this book project or any other Perl 6 book news, please sign up for the mailing list at the bottom of the article, or here. It will be low volume (less than an email per month, on average).
In the last two installments we've looked at silent-cron
, a wrapper around
external programs that silences them in case their exit status is zero. But
to make it really practical, it should also silence occasional failures.
External APIs fail, networks become congested, and other things happen that
prevent a job from succeeding, so some kind of retry mechanism is desirable.
In case of a cron job, cron already takes care of retrying a job on a regular
basis, so silent-cron
should just suppress occasional errors. On the other
hand, if a job fails consistently, this is usually something that an operator
or developer should look into, so it's a problem worth reporting.
To implement this functionality, silent-cron
needs to store persistent
state between separate runs. It needs to record the results from the current
run and then analyze if the failure history qualifies as "occasional".
Persistent Storage
The storage backend needs to write and retrieve structured data, and protect concurrent access to the state file with locking. A good library for such a storage backend is SQLite, a zero-maintenance SQL engine that's available as a C library. It's public domain software and in use in most major browsers, operating systems and even some airliners.
Perl 6 gives you access to SQLite's functionality through
DBIish, a generic database interface with
backend drivers for SQLite, MySQL, PostgreSQL and Oracle DB. To use it, first
make sure that SQLite3 is installed, including its header files. On a Debian-based
Linux system, for example, you can achieve this with apt-get install
libsqlite3-dev
. If you are using the Rakudo Star distribution, DBIish is
already available. If not, you can use one of the module installers to
retrieve and install it: panda install DBIish
or zef install DBIish
.
To use the DBIish's SQLite backend, you first have to create a database handle by selecting the backend and supplying connection information:
use DBIish;
my $dbh = DBIish.connect('SQLite', :database('database-file.sqlite3'));
Connecting to a database file that does not yet exist creates that file.
One-off SQL statements can be executed directly on the database handle:
$dbh.do('INSERT INTO player (name) VALUES ?', 'John');
The ?
in the SQL is a placeholder that is passed out-of-band as a separate
argument to the do
method, which avoids potential errors such as SQL
injection vulnerabilities.
Queries tend to work by first preparing a statement which returns a
statement handle. You can execute a statement once or multiple times, and
retrieve result rows after each execute
call:
my $sth = $dbh.prepare('SELECT id FROM player WHERE name = ?');
my %ids;
for <John Jack> -> $name {
$sth.execute($name);
%ids{ $name } = $sth.row[0];
}
$sth.finish;
Developing the Storage Backend
We shouldn't just stuff all the storage handling code into sub MAIN
, we
should instead carefully consider the creation of a useful API for the storage backend. At
first, we need only two pieces of functionality: insert the result of a job
execution; and retrieve the most recent results.
Since silent-cron
can be used to guard multiple cron jobs on the same
machine, we might need something to distinguish the different jobs so that
one of them succeeding doesn't prevent error reporting for one that is constantly
failing. For that we introduce a job name, which can default to the
command (including arguments) being executed but which can be set explicitly
on the command line.
The API for the storage backend could look something like this:
my $repo = ExecutionResultRepository.new(
jobname => 'refresh cache',
statefile => 'silent-cron.sqlite3',
);
$repo.insert($result);
my @last-results = $repo.tail(5);
This API isn't specific to the SQLite backend at all; a storage backend that works with plain text files could have the exact same API.
Let's implement this API. First we need the class and the two attributes that should be obvious from the usage example above:
class ExecutionResultRepository {
has $.jobname is required;
has $.statefile is required;
# ... more code
To implement the insert
method, we need to connect to the database
and create the relevant table if it doesn't exist yet.
has $!db;
method !db() {
return $!db if $!db;
$!db = DBIish.connect('SQLite', :database($.statefile));
self!create-schema();
return $!db;
}
This code uses a private attribute $!db
to cache the database handle and a
private method !db
to create the handle if it doesn't exist yet.
Private methods are declared like ordinary methods, except that the name
starts with an exclamation mark. To call one, substitute the method call dot
for the exclamation mark, in other words, use self!db()
instead of self.db()
.
The !db
method also calls the next private method, !create-schema
, which
creates the storage table and some indexes:
method !create-schema() {
$!db.do(qq:to/SCHEMA/);
CREATE TABLE IF NOT EXISTS $table (
id INTEGER PRIMARY KEY,
jobname VARCHAR NOT NULL,
exitcode INTEGER NOT NULL,
timed_out INTEGER NOT NULL,
output VARCHAR NOT NULL,
executed TIMESTAMP NOT NULL DEFAULT (DATETIME('NOW'))
);
SCHEMA
$!db.do(qq:to/INDEX/);
CREATE INDEX IF NOT EXISTS {$table}_jobname_exitcode ON $table ( jobname, exitcode );
INDEX
$!db.do(qq:to/INDEX/);
CREATE INDEX IF NOT EXISTS {$table}_jobname_executed ON $table ( jobname, executed );
INDEX
}
Multi-line string literals are best written with the heredoc
syntax. qq:to/DELIMITER/
tells Perl 6 to finish parsing the current
statement so that you can still close the method call parenthesis and add the
statement-ending semicolon. The next line starts the string literal, which
goes on until Perl 6 finds the delimiter on a line on its own. Leading
whitespace is stripped from each line of the string literal by as much as the
closing delimiter is indented.
For example
print q:to/EOS/;
Not indented
Indented four spaces
EOS
Produces the output
Not indented
Indented four spaces
Now that we have a working database connection and know that the database table exists, inserting a new record becomes easy:
method insert(ExecutionResult $r) {
self!db.do(qq:to/INSERT/, $.jobname, $r.exitcode, $r.timed-out, $r.output);
INSERT INTO $table (jobname, exitcode, timed_out, output)
VALUES(?, ?, ?, ?)
INSERT
}
Selecting the most recent records is a bit more work, partially because we need to convert the table rows into objects:
method tail(Int $count) {
my $sth = self!db.prepare(qq:to/SELECT/);
SELECT exitcode, timed_out, output
FROM $table
WHERE jobname = ?
ORDER BY executed DESC
LIMIT $count
SELECT
$sth.execute($.jobname);
$sth.allrows(:array-of-hash).map: -> %h {
ExecutionResult.new(
exitcode => %h<exitcode>,
timed-out => ?%h<timed_out>,
output => %h<output>,
);
}
}
The last statement in the tail
method deserves a bit of extra attention.
$sth.allrows(:array-of-hash)
produces the database rows as a list of hashes.
This list is lazy, that is, it's generated on-demand. Lazy lists are a very
convenient feature because they allow you to use iterators and lists with the
same API. For example when reading lines from a file, you can write for
$handle.lines -> $line { ... }
, and the lines
method doesn't have to load
the whole file into memory; instead it can read a line whenever it is
accessed.
$sth.allrows(...)
is lazy, and so is the .map
call that comes after it.
map
transforms a list one element at a time by calling the code object
that's passed to it. And that is done lazily as well. So SQLite only retrieves
rows from the database file when elements of the resulting list are actually
accessed.
Using the Storage Backend
With the storage API in place, it's time to use it:
multi sub MAIN(*@cmd, :$timeout, :$jobname is copy,
:$statefile='silent-cron.sqlite3', Int :$tries = 3) {
$jobname //= @cmd.Str;
my $result = run-with-timeout(@cmd, :$timeout);
my $repo = ExecutionResultRepository.new(:$jobname, :$statefile);
$repo.insert($result);
my @runs = $repo.tail($tries);
unless $result.is-success or @runs.grep({.is-success}) {
say "The last @runs.elems() runs of @cmd[] all failed, the last execution ",
$result.timed-out ?? "ran into a timeout"
!! "exited with code $result.exitcode()";
print "Output:\n", $result.output if $result.output;
}
exit $result.exitcode // 2;
}
Now a job that succeeds a few times, and then fails up to two times in a row
doesn't produce any error output, and only the third failed execution in a row
produces output. You can override that on the command line with --tries=5
.
Summary
We've discussed DBIish
, a database API with pluggable backend, and explored
using it with SQLite to store persistent data. In the process we also came
across lazy lists and a new form of string literals called heredocs.
Sun, 08 Jan 2017
Perl 6 By Example: Testing Silent Cron
Permanent link
This blog post is part of my ongoing project to write a book about Perl 6.
If you're interested, either in this book project or any other Perl 6 book news, please sign up for the mailing list at the bottom of the article, or here. It will be low volume (less than an email per month, on average).
The previous blog post left us with a bare-bones silent-cron implementation, but without tests. I probably sound like a broken record for bringing this up time and again, but I really want some tests when I start refactoring or extending my programs. And this time, getting the tests in is a bit harder, so I think it's worth discussing how to do it.
Refactoring
As a short reminder, this is what the program looks like:
#!/usr/bin/env perl6
sub MAIN(*@cmd, :$timeout) {
my $proc = Proc::Async.new(|@cmd);
my $collector = Channel.new;
for $proc.stdout, $proc.stderr -> $supply {
$supply.tap: { $collector.send($_) }
}
my $promise = $proc.start;
my $waitfor = $promise;
$waitfor = Promise.anyof(Promise.in($timeout), $promise)
if $timeout;
$ = await $waitfor;
$collector.close;
my $output = $collector.list.join;
if !$timeout || $promise.status ~~ Kept {
my $exitcode = $promise.result.exitcode;
if $exitcode != 0 {
say "Program @cmd[] exited with code $exitcode";
print "Output:\n", $output if $output;
}
exit $exitcode;
}
else {
$proc.kill;
say "Program @cmd[] did not finish after $timeout seconds";
sleep 1 if $promise.status ~~ Planned;
$proc.kill(9);
$ = await $promise;
exit 2;
}
}
There's logic in there for executing external programs with a timeout, and then there's logic for dealing with two possible outcomes. In terms of both testability and for future extensions it makes sense to factor out the execution of external programs into a subroutine. The result of this code is not a single value, we're potentially interested in the output it produced, the exit code, and whether it ran into a timeout. We could write a subroutine that returns a list or a hash of these values, but here I chose to write a small class instead:
class ExecutionResult {
has Int $.exitcode = -1;
has Str $.output is required;
has Bool $.timed-out = False;
method is-success {
!$.timed-out && $.exitcode == 0;
}
}
We've seen classes before, but this one has a few new features. Attributes
declared with the .
twigil automatically get an accessor method, so
has Int $.exitcode;
is roughly the same as
has Int $!exitcode;
method exitcode() { $!exitcode }
So it allows a user of the class to access the value in the attribute from the
outside. As a bonus, you can also initialize it from the standard constructor
as a named argument, ExecutionResult.new( exitcode => 42 )
. The exit code is
not a required attribute, because we can't know the exit code of a program
that has timed out. So with has Int $.exitcode = -1
we give it a default
value that applies if the attribute hasn't been initialized.
The output is a required attribute, so we mark it as such with is
required
. That's a trait. Traits are pieces of code that modify the behavior
of other things, here of an attribute. They crop up in several places, for
example in subroutine signatures (is copy
on a parameter), variable
declarations and classes. If you try to call ExecutionResult.new()
without
specifying an output
, you get such an error:
The attribute '$!output' is required, but you did not provide a value for it.
Mocking and Testing
Now that we have a convenient way to return more than one value from a hypothetical subroutine, let's look at what this subroutine might look like:
sub run-with-timeout(@cmd, :$timeout) {
my $proc = Proc::Async.new(|@cmd);
my $collector = Channel.new;
for $proc.stdout, $proc.stderr -> $supply {
$supply.tap: { $collector.send($_) }
}
my $promise = $proc.start;
my $waitfor = $promise;
$waitfor = Promise.anyof(Promise.in($timeout), $promise)
if $timeout;
$ = await $waitfor;
$collector.close;
my $output = $collector.list.join;
if !$timeout || $promise.status ~~ Kept {
say "No timeout";
return ExecutionResult.new(
:$output,
:exitcode($promise.result.exitcode),
);
}
else {
$proc.kill;
sleep 1 if $promise.status ~~ Planned;
$proc.kill(9);
$ = await $promise;
return ExecutionResult.new(
:$output,
:timed-out,
);
}
}
The usage of Proc::Async
has remained the same, but instead of printing this when an error occurs, the
routine now returns ExecutionResult
objects.
This simplifies the MAIN
sub quite a bit:
multi sub MAIN(*@cmd, :$timeout) {
my $result = run-with-timeout(@cmd, :$timeout);
unless $result.is-success {
say "Program @cmd[] ",
$result.timed-out ?? "ran into a timeout"
!! "exited with code $result.exitcode()";
print "Output:\n", $result.output if $result.output;
}
exit $result.exitcode // 2;
}
A new syntactic feature here is the ternary operator, CONDITION ??
TRUE-BRANCH !! FALSE-BRANCH
, which you might know from other programming
languages such as C or Perl 5 as CONDITION ? TRUE-BRANCH : FALSE-BRANCH
.
Finally, the logical defined-or operator LEFT // RIGHT
returns the LEFT
side if it's defined, and if not, runs the RIGHT
side and returns its
value. It works like the ||
and or
infix operators, except that those
check for the boolean value of the left, not whether they are defined.
In Perl 6, we distinguish between defined and true values. By
default, all instances are true and defined, and all type objects
are false and undefined.
Several built-in types override what they consider to be true. Numbers
that equal 0 evaluate to False
in a boolean context, as do
empty strings and empty containers such as arrays, hashes and sets.
On the other hand, only the built-in type Failure overrides definedness.
You can override the truth value of a custom type by implementing
a method Bool
(which should return True
or False
), and the
definedness with a method defined
.
Now we could start testing the sub run-with-timeout
by writing custom external
commands with defined characteristics (output, run time, exit code), but
that's rather fiddly to do so in a reliable, cross-platform way. So instead I
want to replace Proc::Async
with a mock implementation, and give the sub a
way to inject that:
sub run-with-timeout(@cmd, :$timeout, :$executer = Proc::Async) {
my $proc = $executer.defined ?? $executer !! $executer.new(|@cmd);
# rest as before
Looking through sub run-with-timeout
, we can make a quick list of methods
that the stub Proc::Async
implementation needs: stdout
, stderr
, start
and kill
. Both stdout
and stderr
need to return a
Supply. The simplest thing that could
possibly work is to return a Supply that will emit just a single value:
my class Mock::Proc::Async {
has $.out = '';
has $.err = '';
method stdout {
Supply.from-list($.out);
}
method stderr {
Supply.from-list($.err);
}
Supply.from-list returns a Supply that will emit all the arguments passed to it; in this case just a single string.
The simplest possible implementation of kill
just does nothing:
method kill($?) {}
$?
in a signature is an optional argument ($foo?
) without a name.
Only one method remains that needs to be stubbed: start
. It's supposed to
return a Promise that, after a defined number of seconds, returns a Proc
object or a mock thereof. Since the code only calls the exitcode
method on
it, writing a stub for it is easy:
has $.exitcode = 0;
has $.execution-time = 1;
method start {
Promise.in($.execution-time).then({
(class {
has $.exitcode;
}).new(:$.exitcode);
});
}
Since we don't need the class for the mock Proc
anywhere else, we don't even
need to give it a name. class { ... }
creates an anonymous class, and the
.new
call on it creates a new object from it.
As mentioned before, a Proc
with a non-zero exit code throws an exception
when evaluated in void context, or sink context as we call it in Perl 6. We
can emulate this behavior by extending the anonymous class a bit:
class {
has $.exitcode;
method sink() {
die "mock Proc used in sink context";
}
}
With all this preparation in place, we can finally write some tests:
multi sub MAIN('test') {
use Test;
my class Mock::Proc::Async {
has $.exitcode = 0;
has $.execution-time = 0;
has $.out = '';
has $.err = '';
method kill($?) {}
method stdout {
Supply.from-list($.out);
}
method stderr {
Supply.from-list($.err);
}
method start {
Promise.in($.execution-time).then({
(class {
has $.exitcode;
method sink() {
die "mock Proc used in sink context";
}
}).new(:$.exitcode);
});
}
}
# no timeout, success
my $result = run-with-timeout([],
timeout => 2,
executer => Mock::Proc::Async.new(
out => 'mocked output',
),
);
isa-ok $result, ExecutionResult;
is $result.exitcode, 0, 'exit code';
is $result.output, 'mocked output', 'output';
ok $result.is-success, 'success';
# timeout
$result = run-with-timeout([],
timeout => 0.1,
executer => Mock::Proc::Async.new(
execution-time => 1,
out => 'mocked output',
),
);
isa-ok $result, ExecutionResult;
is $result.output, 'mocked output', 'output';
ok $result.timed-out, 'timeout reported';
nok $result.is-success, 'success';
}
This runs through two scenarios, one where a timeout is configured but not used (because the mocked external program exits first), and one where the timeout takes effect.
Improving Reliability and Timing
Relying on timing in tests is always unattractive. If the times are too short (or too slow together), you risk sporadic test failures on slow or heavily loaded machines. If you use more conservative temporal spacing of tests, the tests can become very slow.
There's a module (not distributed with Rakudo) to alleviate this pain: Test::Scheduler provides a thread scheduler with virtualized time, allowing you to write the tests like this:
use Test::Scheduler;
my $*SCHEDULER = Test::Scheduler.new;
my $result = start run-with-timeout([],
timeout => 5,
executer => Mock::Proc::Async.new(
execution-time => 2,
out => 'mocked output',
),
);
$*SCHEDULER.advance-by(5);
$result = $result.result;
isa-ok $result, ExecutionResult;
# more tests here
This installs the custom scheduler, and $*SCHEDULER.advance-by(5)
instructs
it to advance the virtual time by 5 seconds, without having to wait five
actual seconds. At the time of writing (December 2016), Test::Scheduler
is
rather new module, and has a bug that prevents the second test case from
working this way.
Installing a Module
If you want to try out Test::Scheduler
, you need to install it first. If you
run Rakudo Star, it has already provided you with the panda
module installer. You can
use that to download and install the module for you:
$ panda install Test::Scheduler
If you don't have panda available, you can instead bootstrap zef
, an
alternative module installer:
$ git clone https://github.com/ugexe/zef.git
$ cd zef
$ perl6 -Ilib bin/zef install .
and then use zef
to install the module:
$ zef install Test::Scheduler
Summary
In this installment, we've seen attributes with accessors, the ternary
operator and anonymous classes. Testing of threaded code has been discussed,
and how a third-party module can help. Finally we had a very small glimpse
at the two module installers, panda
and zef
.
Sun, 01 Jan 2017
Perl 6 By Example: Silent Cron, a Cron Wrapper
Permanent link
This blog post is part of my ongoing project to write a book about Perl 6.
If you're interested, either in this book project or any other Perl 6 book news, please sign up for the mailing list at the bottom of the article, or here. It will be low volume (less than an email per month, on average).
On Linux and UNIX-Like systems, a program called cron periodically executes user-defined commands in the background. It is used for system maintenance tasks such as refreshing or removing caches, rotating and deleting old log files and so on.
If such a command produces any output, cron typically sends an email containing the output so that an operator can look at it and judge if some action is required.
But not all command line programs are written for usage with cron. For example they might produce output even on successful execution, and indicate failure through a non-zero exit code. Or they might hang, or otherwise misbehave.
To deal with such commands, we'll develop a small program called
silent-cron
, which wraps such commands and suppresses output when the
exit code is zero. It also allows you to specify a timeout that kills the
wrapped program if it takes too long:
$ silent-cron -- command-that-might-fail args
$ silent-cron --timeout=5 -- command-that-might-hang
Running Commands Asynchronously
When you want to run external commands, Perl 6 gives you basically two choices: run, a simple, synchronous interface, and Proc::Async, an asynchronous and slightly more complex option. Even though we will omit the timeout in the first iteration, we need to be aware that implementing the timeout is easier in the asynchronous interface, so that's what we'll use:
#!/usr/bin/env perl6
sub MAIN(*@cmd) {
my $proc = Proc::Async.new(@cmd);
my $collector = Channel.new;
for $proc.stdout, $proc.stderr -> $supply {
$supply.tap: { $collector.send($_) }
}
my $result = $proc.start.result;
$collector.close;
my $output = $collector.list.join;
my $exitcode = $result.exitcode;
if $exitcode != 0 {
say "Program @cmd[] exited with code $exitcode";
print "Output:\n", $output if $output;
}
exit $exitcode;
}
There's a big chunk of new features and concepts in here, so let's go through the code bit by bit.
sub MAIN(*@cmd) {
my $proc = Proc::Async.new(@cmd);
This collects all the command line arguments in the array variable @cmd
,
where the first element is the command to be executed, and any further
elements are arguments passed to this command.
The second line creates a new Proc::Async
instance, but doesn't yet run
the command.
We need to capture all output from the command; thus we capture the output
of the STDOUT and STDERR streams (file handles 1 and 2 on Linux), and
combine it into a single string. In the asynchronous API, STDOUT and STDERR
are modeled as objects of type Supply,
and hence are streams of events. Since supplies can emit events in parallel, we need
a thread-safe data structure for collecting the result, and Perl 6
conveniently provides a Channel
for that:
my $collector = Channel.new;
To actually get the output from the program, we need to tap into the STDOUT and STDERR streams:
for $proc.stdout, $proc.stderr -> $supply {
$supply.tap: { $collector.send($_) }
}
Each supply executes the block { $collector.send($_) }
for each string it
receives. The string can be a character, a line or something larger if the
stream is buffered. All we do with it is put the string into the
channel $collector
via the send
method.
Now that the streams are tapped, we can start the program and wait for it to finish:
my $result = $proc.start.result;
Proc::Async.start
executes the external process and returns a
Promise. A promise wraps a piece of code that
potentially runs on another thread, has a status (Planned
, Kept
or
Broken
), and once it's finished, a result. Accessing the result
automatically waits for the wrapped code to finish. Here the code is the
one that runs the external program and the result is an object of type
Proc (which happens to be the same as
the run()
function from the synchronous interface).
After this line, we can be sure that the external command has terminated,
and thus no more output will come from $proc.stdout
and $proc.stderr
.
Hence we can safely close the channel and access all its elements through
Channel.list
:
$collector.close;
my $output = $collector.list.join;
Finally it's time to check if the external command was successful -- by checking its exit code -- and to exit the wrapper with the command's exit code:
my $exitcode = $result.exitcode;
if $exitcode != 0 {
say "Program @cmd[] exited with code $exitcode";
print "Output:\n", $output if $output;
}
exit $exitcode;
Implementing Timeouts
The idiomatic way to implement timeouts in Perl 6 is to use the
Promise.anyof
combinator together with a timer:
sub MAIN(*@cmd, :$timeout) {
my $proc = Proc::Async.new(|@cmd);
my $collector = Channel.new;
for $proc.stdout, $proc.stderr -> $supply {
$supply.tap: { $collector.send($_) }
}
my $promise = $proc.start;
my $waitfor = $promise;
$waitfor = Promise.anyof(Promise.in($timeout), $promise)
if $timeout;
await $waitfor;
The initialization of $proc
hasn't changed. But instead of accessing
$proc.start.result
, we store the promise returned from $proc.start
. If
the user specified a timeout, we run this piece of code:
$waitfor = Promise.anyof(Promise.in($timeout), $promise)
Promise.in($seconds)
returns a promise that will be fulfilled in
$seconds
seconds. It's basically the same as start { sleep $seconds }
,
but the scheduler can be a bit smarter about not allocating a whole thread
just for sleeping.
Promise.anyof($p1, $p2)
returns a promise that is fulfilled as soon as
one of the arguments (which should also be promises) is fulfilled. So we
wait either until the external program finished, or until the sleep promise is
fulfilled.
With await $waitfor;
the program waits for the promise to be fulfilled
(or broken). When that is the case, we can't simply access
$promise.result
as before, because $promise
(which is the promise for
the external program) might not be fulfilled in the case of a timeout. So
we have to check the status of the promise first and only then can we
safely access $promise.result
:
if !$timeout || $promise.status ~~ Kept {
my $exitcode = $promise.result.exitcode;
if $exitcode != 0 {
say "Program @cmd[] exited with code $exitcode";
print "Output:\n", $output if $output;
}
exit $exitcode;
}
else {
...
}
In the else { ... }
branch, we need to handle the timeout case. This
might be as simple as printing a statement that a timeout has occurred, and
when silent-cron
exits immediately afterwards, that might be acceptable.
But we might want to do more in the future, so we should kill the external
program. And if the program doesn't terminate after the friendly kill
signal, it should receive a kill(9)
, which on UNIX systems forcefully
terminates the program:
else {
$proc.kill;
say "Program @cmd[] did not finish after $timeout seconds";
sleep 1 if $promise.status ~~ Planned;
$proc.kill(9);
await $promise;
exit 2;
}
await $promise
returns the result of the promise, so here a Proc
object.
Proc
has a safety feature built in that if the command returned with a
non-zero exit code, evaluating the object in void context throws an exception.
Since we explicitly handle the non-zero exit code in the code, we can suppress
the generation of this exception by assigning the return value from await
to
a dummy variable:
my $dummy = await $promise
Since we don't need the value, we can also assign it to an anonymous variable instead:
$ = await $promise
More on Promises
If you have worked with concurrent or parallel programs in other languages, you might have come across threads, locks, mutexes, and other low-level constructs. These exist in Perl 6 too, but their direct usage is discouraged.
The problem with such low-level primitives is that they don't compose well. You can have two libraries that use threads and work fine on their own, but lead to deadlocks when combined within the same program. Or different components might launch threads on their own, which can lead to too many threads and high memory consumption when several such components come together in the same process.
Perl 6 provides higher-level primitives. Instead of spawning a thread, you
use start
to run code asynchronously and the scheduler
decides which thread to run this on. If more start
calls
happen that ask for threads to schedule things on, some will run serially.
Here is a very simple example of running a computation in the background:
sub count-primes(Int $upto) {
(1..$upto).grep(&is-prime).elems;
}
my $p = start count-primes 10_000;
say $p.status;
await $p;
say $p.result;
It gives this output:
Planned
1229
You can see that the main line of execution continued after the start
call,
and $p
immediately had a value -- the promise, with status Planned
.
As we've seen before, there are combinators for promises, anyof
and
allof
. You can also chain actions to a promise using the then
method:
sub count-primes(Int $upto) {
(1..$upto).grep(&is-prime).elems;
}
my $p1 = start count-primes 10_000;
my $p2 = $p1.then({ say .result });
await $p2;
If an exception is thrown inside asynchronously executing code, the status of
the promise becomes Broken
, and calling its .result
method re-throws
the exception.
As a demonstration of the scheduler distributing tasks, let's consider a small Monte Carlo simulation to calculate an approximation for π. We generate a pair of random numbers between zero and one, and interpret them as dots in a square. A quarter circle with radius one covers the area of π/4, so the ratio of randomly placed dots within the quarter circle to the total number of dots approaches π/4, if we use enough dots.
sub pi-approx($iterations) {
my $inside = 0;
for 1..$iterations {
my $x = 1.rand;
my $y = 1.rand;
$inside++ if $x * $x + $y * $y <= 1;
}
return ($inside / $iterations) * 4;
}
my @approximations = (1..1000).map({ start pi-approx(80) });
await @approximations;
say @approximations.map({.result}).sum / @approximations;
The program starts one thousand computations asynchronously, but if you look at a system monitoring tool while it runs, you'll observe only 16 threads running. This magic number comes from the default thread scheduler, and we can override it by providing our own instance of a scheduler above the previous code:
my $*SCHEDULER = ThreadPoolScheduler.new(:max_threads(3));
For CPU bound tasks like this Monte Carlo Simulation, it is a good idea to limit the number of threads roughly to the number of (possibly virtual) CPU cores; if many threads are stuck waiting for IO, a higher number of threads can yield better performance.
Possible Extensions
If you want to play with silent-cron
, you could add a retry mechanism.
If a command fails because of an external dependency (like an API or an NFS
share), it might take time for that external dependency to recover.
Hence you should add a quadratic or exponential backoff, that is, the wait
time between retries should increase quadratically (1, 2, 4, 9, 16, ...) or
exponentially (1, 2, 4, 8, 16, 32, ...).
Summary
We've seen an asynchronous API for running external programs and how to use Promises to implement timeouts. We've also discussed how promises are distributed to threads by a scheduler, allowing you to start an arbitrary number of promises without overloading your computer.