Categories

Ads

Your advertisement could be here -- Contact me!.

Sun, 02 Apr 2017

Perl 6 By Example: Idiomatic Use of Inline::Python


Permanent link

This blog post is part of my ongoing project to write a book about Perl 6.

If you're interested, either in this book project or any other Perl 6 book news, please sign up for the mailing list at the bottom of the article, or here. It will be low volume (less than an email per month, on average).


In the two previous installments, we've seen Python libraries being used in Perl 6 code through the Inline::Python module. Here we will explore some options to make the Perl 6 code more idiomatic and closer to the documentation of the Python modules.

Types of Python APIs

Python is an object-oriented language, so many APIs involve method calls, which Inline::Python helpfully automatically translates for us.

But the objects must come from somewhere and typically this is by calling a function that returns an object, or by instantiating a class. In Python, those two are really the same under the hood, since instantiating a class is the same as calling the class as if it were a function.

An example of this (in Python) would be

from matplotlib.pyplot import subplots
result = subplots()

But the matplotlib documentation tends to use another, equivalent syntax:

import matplotlib.pyplot as plt
result = plt.subplots()

This uses the subplots symbol (class or function) as a method on the module matplotlib.pyplot, which the import statement aliases to plt. This is a more object-oriented syntax for the same API.

Mapping the Function API

The previous code examples used this Perl 6 code to call the subplots symbol:

my $py = Inline::Python.new;
$py.run('import matplotlib.pyplot');
sub plot(Str $name, |c) {
    $py.call('matplotlib.pyplot', $name, |c);
}

my ($figure, $subplots) = plot('subplots');

If we want to call subplots() instead of plot('subplots'), and bar(args) instead of `plot('bar', args), we can use a function to generate wrapper functions:

my $py = Inline::Python.new;

sub gen(Str $namespace, *@names) {
    $py.run("import $namespace");

    return @names.map: -> $name {
        sub (|args) {
            $py.call($namespace, $name, |args);
        }
    }
}

my (&subplots, &bar, &legend, &title, &show)
    = gen('matplotlib.pyplot', <subplots bar legend title show>);

my ($figure, $subplots) = subplots();

# more code here

legend($@plots, $@top-authors);
title('Contributions per day');
show();

This makes the functions' usage quite nice, but comes at the cost of duplicating their names. One can view this as a feature, because it allows the creation of different aliases, or as a source for bugs when the order is messed up, or a name misspelled.

How could we avoid the duplication should we choose to create wrapper functions?

This is where Perl 6's flexibility and introspection abilities pay off. There are two key components that allow a nicer solution: the fact that declarations are expressions and that you can introspect variables for their names.

The first part means you can write mysub my ($a, $b), which declares the variables $a and $b, and calls a function with those variables as arguments. The second part means that $a.VAR.name returns a string '$a', the name of the variable.

Let's combine this to create a wrapper that initializes subroutines for us:

sub pysub(Str $namespace, |args) {
    $py.run("import $namespace");

    for args[0] <-> $sub {
        my $name = $sub.VAR.name.substr(1);
        $sub = sub (|args) {
            $py.call($namespace, $name, |args);
        }
    }
}

pysub 'matplotlib.pyplot',
    my (&subplots, &bar, &legend, &title, &show);

This avoids duplicating the name, but forces us to use some lower-level Perl 6 features in sub pysub. Using ordinary variables means that accessing their .VAR.name results in the name of the variable, not the name of the variable that's used on the caller side. So we can't use slurpy arguments as in

sub pysub(Str $namespace, *@subs)

Instead we must use |args to obtain the rest of the arguments in a Capture. This doesn't flatten the list of variables passed to the function, so when we iterate over them, we must do so by accessing args[0]. By default, loop variables are read-only, which we can avoid by using <-> instead of -> to introduce the signature. Fortunately, that also preserves the name of the caller side variable.

An Object-Oriented Interface

Instead of exposing the functions, we can also create types that emulate the method calls on Python modules. For that we can implement a class with a method FALLBACK, which Perl 6 calls for us when calling a method that is not implemented in the class:

class PyPlot is Mu {
    has $.py;
    submethod TWEAK {
        $!py.run('import matplotlib.pyplot');
    }
    method FALLBACK($name, |args) {
        $!py.call('matplotlib.pyplot', $name, |args);
    }
}

my $pyplot = PyPlot.new(:$py);
my ($figure, $subplots) = $pyplot.subplots;
# plotting code goes here
$pyplot.legend($@plots, $@top-authors);

$pyplot.title('Contributions per day');
$pyplot.show;

Class PyPlot inherits directly from Mu, the root of the Perl 6 type hierarchy, instead of Any, the default parent class (which in turn inherits from Mu). Any introduces a large number of methods that Perl 6 objects get by default and since FALLBACK is only invoked when a method is not present, this is something to avoid.

The method TWEAK is another method that Perl 6 calls automatically for us, after the object has been fully instantiated. All-caps method names are reserved for such special purposes. It is marked as a submethod, which means it is not inherited into subclasses. Since TWEAK is called at the level of each class, if it were a regular method, a subclass would call it twice implicitly. Note that TWEAK is only supported in Rakudo version 2016.11 and later.

There's nothing specific to the Python package matplotlib.pyplot in class PyPlot, except the namespace name. We could easily generalize it to any namespace:

class PythonModule is Mu {
    has $.py;
    has $.namespace;
    submethod TWEAK {
        $!py.run("import $!namespace");
    }
    method FALLBACK($name, |args) {
        $!py.call($!namespace, $name, |args);
    }
}

my $pyplot = PythonModule.new(:$py, :namespace<matplotlib.pyplot>);

This is one Perl 6 type that can represent any Python module. If instead we want a separate Perl 6 type for each Python module, we could use roles, which are optionally parameterized:

role PythonModule[Str $namespace] is Mu {
    has $.py;
    submethod TWEAK {
        $!py.run("import $namespace");
    }
    method FALLBACK($name, |args) {
        $!py.call($namespace, $name, |args);
    }
}

my $pyplot = PythonModule['matplotlib.pyplot'].new(:$py);

Using this approach, we can create type constraints for Python modules in Perl 6 space:

sub plot-histogram(PythonModule['matplotlib.pyplot'], @data) {
    # implementation here
}

Passing in any other wrapped Python module than matplotlib.pyplot results in a type error.

Summary

Perl 6 offers enough flexibility to create function and method call APIs around Python modules. With a bit of meta programming, we can emulate the typical Python APIs close enough that translating from the Python documentation to Perl 6 code becomes easy.

Subscribe to the Perl 6 book mailing list

* indicates required

[/perl-6] Permanent link

Sun, 26 Mar 2017

Perl 6 By Example: Stacked Plots with Matplotlib


Permanent link

This blog post is part of my ongoing project to write a book about Perl 6.

If you're interested, either in this book project or any other Perl 6 book news, please sign up for the mailing list at the bottom of the article, or here. It will be low volume (less than an email per month, on average).


In a previous episode, we've explored plotting git statistics in Perl 6 using matplotlib.

Since I wasn't quite happy with the result, I want to explore using stacked plots for presenting the same information. In a regular plot, the y coordiante of each plotted value is proportional to its value. In a stacked plot, it is the distance to the previous value that is proportional to its value. This is nice for values that add up to a total that is also interesting.

Matplotlib offers a method called stackplot for that. Contrary to multiple plot calls on subplot object, it requires a shared x axis for all data series. So we must construct one array for each author of git commits, where dates with no value come out as zero.

As a reminder, this is what the logic for extracting the stats looked like in the first place:

my $proc = run :out, <git log --date=short --pretty=format:%ad!%an>;
my (%total, %by-author, %dates);
for $proc.out.lines -> $line {
    my ( $date, $author ) = $line.split: '!', 2;
    %total{$author}++;
    %by-author{$author}{$date}++;
    %dates{$date}++;
}
my @top-authors = %total.sort(-*.value).head(5)>>.key;

And some infrastructure for plotting with matplotlib:

my $py = Inline::Python.new;
$py.run('import datetime');
$py.run('import matplotlib.pyplot');
sub plot(Str $name, |c) {
    $py.call('matplotlib.pyplot', $name, |c);
}
sub pydate(Str $d) {
    $py.call('datetime', 'date', $d.split('-').map(*.Int));
}

my ($figure, $subplots) = plot('subplots');
$figure.autofmt_xdate();

So now we have to construct an array of arrays, where each inner array has the values for one author:

my @dates = %dates.keys.sort;
my @stack = $[] xx @top-authors;

for @dates -> $d {
    for @top-authors.kv -> $idx, $author {
        @stack[$idx].push: %by-author{$author}{$d} // 0;
    }
}

Now plotting becomes a simple matter of a method call, followed by the usual commands adding a title and showing the plot:

$subplots.stackplot($[@dates.map(&pydate)], @stack);
plot('title', 'Contributions per day');
plot('show');

The result (again run on the zef source repository) is this:

Stacked plot of zef contributions over time

Comparing this to the previous visualization reveals a discrepancy: There were no commits in 2014, and yet the stacked plot makes it appear this way. In fact, the previous plots would have shown the same "alternative facts" if we had chosen lines instead of points. It comes from matplotlib (like nearly all plotting libraries) interpolates linearly between data points. But in our case, a date with no data points means zero commits happened on that date.

To communicate this to matplotlib, we must explicitly insert zero values for missing dates. This can be achieved by replacing

my @dates = %dates.keys.sort;

with the line

my @dates = %dates.keys.minmax;

The minmax method finds the minimal and maximal values, and returns them in a Range. Assigning the range to an array turns it into an array of all values between the minimal and the maximal value. The logic for assembling the @stack variable already maps missing values to zero.

The result looks a bit better, but still far from perfect:

Stacked plot of zef contributions over time, with missing dates mapped to zero

Thinking more about the problem, contributions from separate days should not be joined together, because it produces misleading results. Matplotlib doesn't support adding a legend automatically to stacked plots, so this seems to be to be a dead end.

Since a dot plot didn't work very well, let's try a different kind of plot that represents each data point separately: a bar chart, or more specifically, a stacked bar chart. Matplotlib offers the bar plotting method, and a named parameter bottom can be used to generate the stacking:

my @dates = %dates.keys.sort;
my @stack = $[] xx @top-authors;
my @bottom = $[] xx @top-authors;

for @dates -> $d {
    my $bottom = 0;
    for @top-authors.kv -> $idx, $author {
        @bottom[$idx].push: $bottom;
        my $value = %by-author{$author}{$d} // 0;
        @stack[$idx].push: $value;
        $bottom += $value;
    }
}

We need to supply color names ourselves, and set the edge color of the bars to the same color, otherwise the black edge color dominates the result:

my $width = 1.0;
my @colors = <red green blue yellow black>;
my @plots;

for @top-authors.kv -> $idx, $author {
    @plots.push: plot(
        'bar',
        $[@dates.map(&pydate)],
        @stack[$idx],
        $width,
        bottom => @bottom[$idx],
        color => @colors[$idx],
        edgecolor => @colors[$idx],
    );
}
plot('legend', $@plots, $@top-authors);

plot('title', 'Contributions per day');
plot('show');

This produces the first plot that's actually informative and not misleading (provided you're not color blind):

Stacked bar plot of zef contributions over time

If you want to improve the result further, you could experiment with limiting the number of bars by lumping together contributions by week or month (or maybe $n-day period).

Next, we'll investigate ways to make the matplotlib API more idiomatic to use from Perl 6 code.

Subscribe to the Perl 6 book mailing list

* indicates required

[/perl-6] Permanent link

Sun, 19 Mar 2017

Perl 6 By Example: Plotting using Matplotlib and Inline::Python


Permanent link

This blog post is part of my ongoing project to write a book about Perl 6.

If you're interested, either in this book project or any other Perl 6 book news, please sign up for the mailing list at the bottom of the article, or here. It will be low volume (less than an email per month, on average).


Occasionally I come across git repositories, and want to know how active they are, and who the main developers are.

Let's develop a script that plots the commit history, and explore how to use Python modules in Perl 6.

Extracting the Stats

We want to plot the number of commits by author and date. Git makes it easy for us to get to this information by giving some options to git log:

my $proc = run :out, <git log --date=short --pretty=format:%ad!%an>;
my (%total, %by-author, %dates);
for $proc.out.lines -> $line {
    my ( $date, $author ) = $line.split: '!', 2;
    %total{$author}++;
    %by-author{$author}{$date}++;
    %dates{$date}++;
}

run executes an external command, and :out tells it to capture the command's output, and makes it available as $proc.out. The command is a list, with the first element being the actual executable, and the rest of the elements are command line arguments to this executable.

Here git log gets the options --date short --pretty=format:%ad!%an, which instructs it to print produce lines like 2017-03-01!John Doe. This line can be parsed with a simple call to $line.split: '!', 2, which splits on the !, and limits the result to two elements. Assigning it to a two-element list ( $date, $author ) unpacks it. We then use hashes to count commits by author (in %total), by author and date (%by-author) and finally by date. In the second case, %by-author{$author} isn't even a hash yet, and we can still hash-index it. This is due to a feature called autovivification, which automatically creates ("vivifies") objects where we need them. The use of ++ creates integers, {...} indexing creates hashes, [...] indexing and .push creates arrays, and so on.

To get from these hashes to the top contributors by commit count, we can sort %total by value. Since this sorts in ascending order, sorting by the negative value gives the list in descending order. The list contains Pair objects, and we only want the first five of these, and only their keys:

my @top-authors = %total.sort(-*.value).head(5).map(*.key);

For each author, we can extract the dates of their activity and their commit counts like this:

my @dates  = %by-author{$author}.keys.sort;
my @counts = %by-author{$author}{@dates};

The last line uses slicing, that is, indexing an array with list to return a list elements.

Plotting with Python

Matplotlib is a very versatile library for all sorts of plotting and visualization. It's written in Python and for Python programs, but that won't stop us from using it in a Perl 6 program.

But first, let's take a look at a basic plotting example that uses dates on the x axis:

import datetime
import matplotlib.pyplot as plt

fig, subplots = plt.subplots()
subplots.plot(
    [datetime.date(2017, 1, 5), datetime.date(2017, 3, 5), datetime.date(2017, 5, 5)],
    [ 42, 23, 42 ],
    label='An example',
)
subplots.legend(loc='upper center', shadow=True)
fig.autofmt_xdate()
plt.show()

To make this run, you have to install python 2.7 and matplotlib. You can do this on Debian-based Linux systems with apt-get install -y python-matplotlib. The package name is the same on RPM-based distributions such as CentOS or SUSE Linux. MacOS users are advised to install a python 2.7 through homebrew and macports, and then use pip2 install matplotlib or pip2.7 install matplotlib to get the library. Windows installation is probably easiest through the conda package manager, which offers pre-built binaries of both python and matplotlib.

When you run this scripts with python2.7 dates.py, it opens a GUI window, showing the plot and some controls, which allow you to zoom, scroll, and write the plot graphic to a file:

Basic matplotlib plotting window

Bridging the Gap

The Rakudo Perl 6 compiler comes with a handy library for calling foreign functions, which allows you to call functions written in C, or anything with a compatible binary interface.

The Inline::Python library uses the native call functionality to talk to python's C API, and offers interoperability between Perl 6 and Python code. At the time of writing, this interoperability is still fragile in places, but can be worth using for some of the great libraries that Python has to offer.

To install Inline::Python, you must have a C compiler available, and then run

$ zef install Inline::Python

(or the same with panda instead of zef, if that's your module installer).

Now you can start to run Python 2 code in your Perl 6 programs:

use Inline::Python;

my $py = Inline::Python.new;
$py.run: 'print("Hello, Pyerl 6")';

Besides the run method, which takes a string of Python code and execute it, you can also use call to call Python routines by specifying the namespace, the routine to call, and a list of arguments:

use Inline::Python;

my $py = Inline::Python.new;
$py.run('import datetime');
my $date = $py.call('datetime', 'date', 2017, 1, 31);
$py.call('__builtin__', 'print', $date);    # 2017-01-31

The arguments that you pass to call are Perl 6 objects, like three Int objects in this example. Inline::Python automatically translates them to the corresponding Python built-in data structure. It translate numbers, strings, arrays and hashes. Return values are also translated in opposite direction, though since Python 2 does not distinguish properly between byte and Unicode strings, Python strings end up as buffers in Perl 6.

Object that Inline::Python cannot translate are handled as opaque objects on the Perl 6 side. You can pass them back into python routines (as shown with the print call above), or you can also call methods on them:

say $date.isoformat().decode;               # 2017-01-31

Perl 6 exposes attributes through methods, so Perl 6 has no syntax for accessing attributes from foreign objects directly. If you try to access for example the year attribute of datetime.date through the normal method call syntax, you get an error.

say $date.year;

Dies with

'int' object is not callable

Instead, you have to use the getattr builtin:

say $py.call('__builtin__', 'getattr', $date, 'year');

Using the Bridge to Plot

We need access to two namespaces in python, datetime and matplotlib.pyplot, so let's start by importing them, and write some short helpers:

my $py = Inline::Python.new;
$py.run('import datetime');
$py.run('import matplotlib.pyplot');
sub plot(Str $name, |c) {
    $py.call('matplotlib.pyplot', $name, |c);
}

sub pydate(Str $d) {
    $py.call('datetime', 'date', $d.split('-').map(*.Int));
}

We can now call pydate('2017-03-01') to create a python datetime.date object from an ISO-formatted string, and call the plot function to access functionality from matplotlib:

my ($figure, $subplots) = plot('subplots');
$figure.autofmt_xdate();

my @dates = %dates.keys.sort;
$subplots.plot:
    $[@dates.map(&pydate)],
    $[ %dates{@dates} ],
    label     => 'Total',
    marker    => '.',
    linestyle => '';

The Perl 6 call plot('subplots') corresponds to the python code fig, subplots = plt.subplots(). Passing arrays to python function needs a bit extra work, because Inline::Python flattens arrays. Using an extra $ sigil in front of an array puts it into an extra scalar, and thus prevents the flattening.

Now we can actually plot the number of commits by author, add a legend, and plot the result:

for @top-authors -> $author {
    my @dates = %by-author{$author}.keys.sort;
    my @counts = %by-author{$author}{@dates};
    $subplots.plot:
        $[ @dates.map(&pydate) ],
        $@counts,
        label     => $author,
        marker    =>'.',
        linestyle => '';
}


$subplots.legend(loc=>'upper center', shadow=>True);

plot('title', 'Contributions per day');
plot('show');

When run in the zef git repository, it produces this plot:

Contributions to zef, a Perl 6 module installer

Summary

We've explored how to use the python library matplotlib to generate a plot from git contribution statistics. Inline::Python provides convenient functionality for accessing python libraries from Perl 6 code.

In the next installment, we'll explore ways to improve both the graphics and the glue code between Python and Perl 6.

Subscribe to the Perl 6 book mailing list

* indicates required

[/perl-6] Permanent link

Sun, 12 Mar 2017

What's a Variable, Exactly?


Permanent link

When you learn programming, you typically first learn about basic expressions, like 2 * 21, and then the next topic is control structures or variables. (If you start with functional programming, maybe it takes you a bit longer to get to variables).

So, every programmer knows what a variable is, right?

Turns out, it might not be that easy.

Some people like to say that in ruby, everything is an object. Well, a variable isn't really an object. The same holds true for other languages.

But let's start from the bottom up. In a low-level programming language like C, a local variable is a name that the compiler knows, with a type attached. When the compiler generates code for the function that the variable is in, the name resolves to an address on the stack (unless the compiler optimizes the variable away entirely, or manages it through a CPU register).

So in C, the variable only exists as such while the compiler is running. When the compiler is finished, and the resulting executable runs, there might be some stack offset or memory location that corresponds to our understanding of the variable. (And there might be debugging symbols that allows some mapping back to the variable name, but that's really a special case).

In case of recursion, a local variable can exist once for each time the function is called.

Closures

In programming languages with closures, local variables can be referenced from inner functions. They can't generally live on the stack, because the reference keeps them alive. Consider this piece of Perl 6 code (though we could write the same in Javascript, Ruby, Perl 5, Python or most other dynamic languages):

sub outer() {
    my $x = 42;
    return sub inner() {
        say $x;
    }
}

my &callback = outer();
callback();

The outer function has a local (lexical) variable $x, and the inner function uses it. So once outer has finished running, there's still an indirect reference to the value stored in this variable.

They say you can solve any problem in computer science through another layer of indirection, and that's true for the implementation of closures. The &callback variable, which points to a closure, actually stores two pointers under the hood. One goes to the static byte code representation of the code, and the second goes to a run-time data structure called a lexical pad, or short lexpad. Each time you invoke the outer function, a new instance of the lexpad is created, and the closure points to the new instance, and the always the same static code.

But even in dynamic languages with closures, variables themselves don't need to be objects. If a language forbids the creation of variables at run time, the compiler knows what variables exist in each scope, and can for example map each of them to an array index, so the lexpad becomes a compact array, and an access to a variable becomes an indexing operation into that array. Lexpads generally live on the heap, and are garbage collected (or reference counted) just like other objects.

Lexpads are mostly performance optimizations. You could have separate runtime representations of each variable, but then you'd have to have an allocation for each variable in each function call you perform, whereas which are generally much slower than a single allocation of the lexpad.

The Plot Thickens

To summarize, a variable has a name, a scope, and in languages that support it, a type. Those are properties known to the compiler, but not necessarily present at run time. At run time, a variable typically resolves to a stack offset in low-level languages, or to an index into a lexpad in dynamic languages.

Even in languages that boldly claim that "everything is an object", a variable often isn't. The value inside a variable may be, but the variable itself typically not.

Perl 6 Intricacies

The things I've written above generalize pretty neatly to many programming languages. I am a Perl 6 developer, so I have some insight into how Perl 6 implements variables. If you don't resist, I'll share it with you :-).

Variables in Perl 6 typically come with one more level of indirection, we which call a container. This allows two types of write operations: assignment stores a value inside a container (which again might be referenced to by a variable), and binding, which places either a value or a container directly into variable.

Here's an example of assignment and binding in action:

my $x;
my $y;
# assignment:
$x = 42;
$y = 'a string';

say $x;     # => 42
say $y;     # => a string

# binding:
$x := $y;

# now $x and $y point to the same container, so that assigning to one
# changes the other:
$y = 21;
say $x;     # => 21

Why, I hear you cry?

There are three major reasons.

The first is that makes assignment something that's not special. For example in python, if you assign to anything other than a plain variable, the compiler translates it to some special method call (obj.attr = x to setattr(obj, 'attr', x), obj[idx] = x to a __setitem__ call etc.). In Perl 6, if you want to implement something you can assign to, you simply return a container from that expression, and then assignment works naturally.

For example an array is basically just a list in which the elements are containers. This makes @array[$index] = $value work without any special cases, and allows you to assign to the return value of methods, functions, or anything else you can think of, as long as the expression returns a container.

The second reason for having both binding and assignment is that it makes it pretty easy to make things read-only. If you bind a non-container into a variable, you can't assign to it anymore:

my $a := 42;
$a = "hordor";  # => Cannot assign to an immutable value

Perl 6 uses this mechanism to make function parameters read-only by default.

Likewise, returning from a function or method by default strips the container, which avoids accidental action-at-a-distance (though an is rw annotation can prevent that, if you really want it).

This automatic stripping of containers also makes expressions like $a + 2 work, independently of whether $a holds an integer directly, or a container that holds an integer. (In the implementation of Perl 6's core types, sometimes this has to be done manually. If you ever wondered what nqp::decont does in Rakudo's source code, that's what).

The third reason relates to types.

Perl 6 supports gradual typing, which means you can optionally annotate your variables (and other things) with types, and Perl 6 enforces them for you. It detects type errors at compile time where possible, and falls back to run-time checking types.

The type of a variable only applies to binding, but it inherits this type to its default container. And the container type is enforced at run time. You can observe this difference by binding a container with a different constraint:

my Any $x;
my Int $i;
$x := $i;
$x = "foo";     # => Type check failed in assignment to $i; expected Int but got Str ("foo")

Int is a subtype of Any, which is why the binding of $i to $x succeeds. Now $x and $i share a container that is type-constrained to Int, so assigning a string to it fails.

Did you notice how the error message mentions $i as the variable name, even though we've tried to assign to $x? The variable name in the error message is really a heuristic, which works often enough, but sometimes fails. The container that's shared between $x and $i has no idea which variable you used to access it, it just knows the name of the variable that created it, here $i.

Binding checks the variable type, not the container type, so this code doesn't complain:

my Any $x;
my Int $i;
$x := $i;
$x := "a string";

This distinction between variable type and container type might seem weird for scalar variables, but it really starts to make sense for arrays, hashes and other compound data structures that might want to enforce a type constraint on its elements:

sub f($x) {
    $x[0] = 7;
}
my Str @s;
f(@s);

This code declares an array whose element all must be of type Str (or subtypes thereof). When you pass it to a function, that function has no compile-time knowledge of the type. But since $x[0] returns a container with type constraint Str, assigning an integer to it can produce the error you expect from it.

Summary

Variables typically only exists as objects at compile time. At run time, they are just some memory location, either on the stack or in a lexical pad.

Perl 6 makes the understanding of the exact nature of variables a bit more involved by introducing a layer of containers between variables and values. This offers great flexibility when writing libraries that behaves like built-in classes, but comes with the burden of additional complexity.

[/perl-6] Permanent link

Sun, 05 Mar 2017

Perl 6 By Example: A Unicode Search Tool


Permanent link

This blog post is part of my ongoing project to write a book about Perl 6.

If you're interested, either in this book project or any other Perl 6 book news, please sign up for the mailing list at the bottom of the article, or here. It will be low volume (less than an email per month, on average).


Every so often I have to identify or research some Unicode characters. There's a tool called uni in the Perl 5 distribution App::Uni.

Let's reimplement its basic functionality in a few lines of Perl 6 code and use that as an occasion to talk about Unicode support in Perl 6.

If you give it one character on the command line, it prints out a description of the character:

$ uni 🕐
🕐 - U+1f550 - CLOCK FACE ONE OCLOCK

If you give it a longer string instead, it searches in the list of Unicode character names and prints out the same information for each character whose description matches the search string:

$ uni third|head -n5
⅓ - U+02153 - VULGAR FRACTION ONE THIRD
⅔ - U+02154 - VULGAR FRACTION TWO THIRDS
↉ - U+02189 - VULGAR FRACTION ZERO THIRDS
㆛ - U+0319b - IDEOGRAPHIC ANNOTATION THIRD MARK
𐄺 - U+1013a - AEGEAN WEIGHT THIRD SUBUNIT

Each line corresponds to what Unicode calls a "code point", which is usually a character on its own, but occasionally also something like a U+00300 - COMBINING GRAVE ACCENT, which, combined with a a - U+00061 - LATIN SMALL LETTER A makes the character à.

Perl 6 offers a method uniname in both the classes Str and Int that produces the Unicode code point name for a given character, either in its direct character form, or in the form the code point number. With that, the first part of uni's desired functionality:

#!/usr/bin/env perl6

use v6;

sub format-codepoint(Int $codepoint) {
    sprintf "%s - U+%05x - %s\n",
        $codepoint.chr,
        $codepoint,
        $codepoint.uniname;
}

multi sub MAIN(Str $x where .chars == 1) {
    print format-codepoint($x.ord);
}

Let's look at it in action:

$ uni ø
ø - U+000f8 - LATIN SMALL LETTER O WITH STROKE

The chr method turns a code point number into the character and ord is the reverse, in other words: from character to code point number.

The second part, searching in all Unicode character names, works by brute-force enumerating all possible characters and searching through their uniname:

multi sub MAIN($search is copy) {
    $search.=uc;
    for 1..0x10FFFF -> $codepoint {
        if $codepoint.uniname.contains($search) {
            print format-codepoint($codepoint);
        }
    }
}

Since all character names are in upper case, the search term is first converted to upper case with $search.=uc, which is short for $search = $search.uc. By default, parameters are read only, which is why its declaration here uses is copy to prevent that.

Instead of this rather imperative style, we can also formulate it in a more functional style. We could think of it as a list of all characters, which we whittle down to those characters that interest us, to finally format them the way we want:

multi sub MAIN($search is copy) {
    $search.=uc;
    print (1..0x10FFFF).grep(*.uniname.contains($search))
                       .map(&format-codepoint)
                       .join;
}

To make it easier to identify (rather than search for) a string of more than one character, an explicit option can help disambiguate:

multi sub MAIN($x, Bool :$identify!) {
    print $x.ords.map(&format-codepoint).join;
}

Str.ords returns the list of code points that make up the string. With this multi candidate of sub MAIN in place, we can do something like

$ uni --identify øre
ø - U+000f8 - LATIN SMALL LETTER O WITH STROKE
r - U+00072 - LATIN SMALL LETTER R
e - U+00065 - LATIN SMALL LETTER E

Code Points, Grapheme Clusters and Bytes

As alluded to above, not all code points are fully-fledged characters on their own. Or put another way, some things that we visually identify as a single character are actually made up of several code points. Unicode calls these sequences of one base character and potentially several combining characters as a grapheme cluster.

Strings in Perl 6 are based on these grapheme clusters. If you get a list of characters in string with $str.comb, or extract a substring with $str.substr(0, 4), match a regex against a string, determine the length, or do any other operation on a string, the unit is always the grapheme cluster. This best fits our intuitive understanding of what a character is and avoids accidentally tearing apart a logical character through a substr, comb or similar operation:

my $s = "ø\c[COMBINING TILDE]";
say $s;         # ø̃
say $s.chars;   # 1

The Uni type is akin to a string and represents a sequence of codepoints. It is useful in edge cases, but doesn't support the same wealth of operations as Str. The typical way to go from Str to a Uni value is to use one of the NFC, NFD, NFKC, or NFKD methods, which yield a Uni value in the normalization form of the same name.

Below the Uni level you can also represent strings as bytes by choosing an encoding. If you want to get from string to the byte level, call the encode method:

my $bytes = 'Perl 6'.encode('UTF-8');

UTF-8 is the default encoding and also the one Perl 6 assumes when reading source files. The result is something that does the Blob role; you can access individual bytes with positional indexing, such as $bytes[0]. The decode method helps you to convert a Blob to a Str.

Numbers

Number literals in Perl 6 aren't limited to the Arabic digits we are so used to in the English speaking part of the world. All Unicode code points that have the Decimal_Number (short Nd) property are allowed, so you can for example use Bengali digits:

say ৪২;             # 42

The same holds true for string to number conversions:

say "৪২".Int;       # 42

For other numeric code points you can use the unival method to obtain its numeric value:

say "\c[TIBETAN DIGIT HALF ZERO]".unival;

which produces the output -0.5 and also illustrates how to use a codepoint by name inside a string literal.

Other Unicode Properties

The uniprop method in type Str returns the general category by default:

say "ø".uniprop;                            # Ll
say "\c[TIBETAN DIGIT HALF ZERO]".uniprop;  # No

The return value needs some Unicode knowledge in order to make sense of it, or one could read Unicode's Technical Report 44 for the gory details. Ll stands for Letter_Lowercase, No is Other_Number. This is what Unicode calls the General Category, but you can ask the uniprop (or uniprop-bool method if you're only interested in a boolean result) for other properties as well:

say "a".uniprop-bool('ASCII_Hex_Digit');    # True
say "ü".uniprop-bool('Numeric_Type');       # False
say ".".uniprop("Word_Break");              # MidNumLet

Collation

Sorting strings starts to become complicated when you're not limited to ASCII characters. Perl 6's sort method uses the cmp infix operator, which does a pretty standard lexicographic comparison based on the codepoint number.

If you need to use a more sophisticated collation algorithm, Rakudo 2017.02 and newer offer the Unicode Collation Algorithm as an experimental feature:

my @list = <a ö ä Ä o ø>;
say @list.sort;                     # (a o Ä ä ö ø)

use experimental :collation;
say @list.collate;                  # (a ä Ä o ö ø)
$*COLLATION.set(:tertiary(False));
say @list.collate;                  # (a Ä ä o ö ø)

The default sort considers any character with diacritics to be larger than ASCII characters, because that's how they appear in the code point list. On the other hand, collate knows that characters with diacritics belong directly after their base character, which is not perfect in every language, but internally a good compromise.

For Latin-based scripts, the primary sorting criteria is alphabetic, the secondary diacritics, and the third is case. $*COLLATION.set(:tertiary(False)) thus makes .collate ignore case, so it doesn't force lower case characters to come before upper case characters anymore.

At the time of writing, language specification of collation is not yet implemented.

Summary

Perl 6 takes languages other than English very seriously, and goes to great lengths to facilitate working with them and the characters they use.

This includes basing strings on grapheme clusters rather than code points, support for non-Arabic digits in numbers, and access to large parts of Unicode database through built-in methods.

Subscribe to the Perl 6 book mailing list

* indicates required

[/perl-6] Permanent link

Sun, 26 Feb 2017

Perl 6 By Example: Functional Refactorings for Directory Visualization Code


Permanent link

This blog post is part of my ongoing project to write a book about Perl 6.

If you're interested, either in this book project or any other Perl 6 book news, please sign up for the mailing list at the bottom of the article, or here. It will be low volume (less than an email per month, on average).


In the last installment we've seen some code that generated tree maps and flame graphs from a tree of directory and file sizes.

There's a pattern that occurs three times in that code: dividing an area based on the size of the files and directories in the tree associated with the area.

Extracting such common code into a function is a good idea, but it's slightly hindered by the fact that there is custom code inside the loop that's part of the common code. Functional programming offers a solution: Put the custom code inside a separate function and have the common code call it.

Applying this technique to the tree graph flame graph looks like this:

sub subdivide($tree, $lower, $upper, &todo) {
    my $base = ($upper - $lower ) / $tree.total-size;
    my $var  = $lower;
    for $tree.children -> $child {
        my $incremented = $var + $base * $child.total-size;
        todo($child, $var, $incremented);
        $var = $incremented,
    }
}

sub flame-graph($tree, :$x1!, :$x2!, :$y!, :$height!) {
    return if $y >= $height;
    take 'rect' => [
        x      => $x1,
        y      => $y,
        width  => $x2 - $x1,
        height => 15,
        style  => "fill:" ~ random-color(),
        title  => [$tree.name ~ ', ' ~ format-size($tree.total-size)],
    ];
    return if $tree ~~ File;
    subdivide( $tree, $x1, $x2, -> $child, $x1, $x2 {
        flame-graph( $child, :$x1, :$x2, :y($y + 15), :$height );
    });
}

sub tree-map($tree, :$x1!, :$x2!, :$y1!, :$y2) {
    return if ($x2 - $x1) * ($y2 - $y1) < 20;
    take 'rect' => [
        x      => $x1,
        y      => $y1,
        width  => $x2 - $x1,
        height => $y2 - $y1,
        style  => "fill:" ~ random-color(),
        title  => [$tree.name],
    ];
    return if $tree ~~ File;

    if $x2 - $x1 > $y2 - $y1 {
        # split along the x-axis
        subdivide $tree, $x1, $x2, -> $child, $x1, $x2 {
            tree-map $child, :$x1, :$x2, :$y1, :$y2;
        }
    }
    else {
        # split along the y-axis
        subdivide $tree, $y1, $y2, -> $child, $y1, $y2 {
            tree-map $child, :$x1, :$x2, :$y1, :$y2;
        }
    }
}

The newly introduced subroutine subdivide takes a directory tree, a start point and an end point, and finally a code object &todo. For each child of the directory tree it calculates the new coordinates and then calls the &todo function.

The usage in subroutine flame-graph looks like this:

subdivide( $tree, $x1, $x2, -> $child, $x1, $x2 {
flame-graph( $child, :$x1, :$x2, :y($y + 15), :$height );
});

The code object being passed to subdivide starts with ->, which introduces the signature of a block. The code block recurses into flame-graph, adding some extra arguments, and turning two positional arguments into named arguments along the way.

This refactoring shortened the code and made it overall more pleasant to work with. But there's still quite a bit of duplication between tree-map and flame-graph: both have an initial termination condition, a take of a rectangle, and then a call or two to subdivide. If we're willing to put all the small differences into small, separate functions, we can unify it further.

If we pass all those new functions as arguments to each call, we create an unpleasantly long argument list. Instead, we can use those functions to generate the previous functions flame-graph and tree-map:

sub svg-tree-gen(:&terminate!, :&base-height!, :&subdivide-x!, :&other!) {
    sub inner($tree, :$x1!, :$x2!, :$y1!, :$y2!) {
        return if terminate(:$x1, :$x2, :$y1, :$y2);
        take 'rect' => [
            x      => $x1,
            y      => $y1,
            width  => $x2 - $x1,
            height => base-height(:$y1, :$y2),
            style  => "fill:" ~ random-color(),
            title  => [$tree.name ~ ', ' ~ format-size($tree.total-size)],
        ];
        return if $tree ~~ File;
        if subdivide-x(:$x1, :$y1, :$x2, :$y2) {
            # split along the x-axis
            subdivide $tree, $x1, $x2, -> $child, $x1, $x2 {
                inner($child, :$x1, :$x2, :y1(other($y1)), :$y2);
            }
        }
        else {
            # split along the y-axis
            subdivide $tree, $y1, $y2, -> $child, $y1, $y2 {
                inner($child, :x1(other($x1)), :$x2, :$y1, :$y2);
            }
        }
    }
}

my &flame-graph = svg-tree-gen
    terminate   => -> :$y1, :$y2, | { $y1 > $y2 },
    base-height => -> | { 15 },
    subdivide-x => -> | { True },
    other       => -> $y1 { $y1 + 15 },

my &tree-map = svg-tree-gen
    terminate   => -> :$x1, :$y1, :$x2, :$y2 { ($x2 - $x1) * ($y2 - $y1) < 20 },
    base-height => -> :$y1, :$y2 {  $y2 - $y1 },
    subdivide-x => -> :$x1, :$x2, :$y1, :$y2 { $x2 - $x1 > $y2 - $y1 },
    other       => -> $a { $a },
    ;

So there's a new function svg-tree-gen, which returns a function. The behavior of the returned function depends on the four small functions that svg-tree-gen receives as arguments.

The first argument, terminate, determines under what condition the inner function should terminate early. For tree-map that's when the area is below 20 pixels, for flame-graph when the current y-coordinate $y1 exceeds the height of the whole image, which is stored in $y2. svg-tree-gen always calls this function with the four named arguments x1, x2, y1 and y2, so the terminate function must ignore the x1 and x2 values. It does this by adding | as a parameter, which is an anonymous capture. Such a parameter can bind arbitrary positional and named arguments, and since it's an anonymous parameter, it discards all the values.

The second configuration function, base-height, determines the height of the rectangle in the base case. For flame-graph it's a constant, so the configuration function must discard all arguments, again with a |. For tree-graph, it must return the difference between $y2 and $y1, as before the refactoring.

The third function determines when to subdivide along the x-axis. Flame graphs always divide along the x-axis, so -> | { True } accomplishes that. Our simplistic approach to tree graphs divides along the longer axis, so only along the x-axis if $x2 - $x1 > $y2 - $y1.

The fourth and final function we pass to svg-tree-gen calculates the coordinate of the axis that isn't being subdivided. In the case of flame-graph that's increasing over the previous value by the height of the bars, and for tree-map it's the unchanged coordinate, so we pass the identity function -> $a { $a }.

The inner function only needs a name because we need to call it from itself recursively; otherwise an anonymous function sub ($tree, :$x1!, :$x2!, :$y1!, :$y2!) { ... } would have worked fine.

Now that we have very compact definitions of flame-graph and tree-map, it's a good time to play with some of the parameters. For example we can introduce a bit of margin in the flame graph by having the increment in other greater than the bar height in base-height:

my &flame-graph = svg-tree-gen
    base-height => -> | { 15 },
    other       => -> $y1 { $y1 + 16 },
    # rest as before

Another knob to turn is to change the color generation to something more deterministic, and make it configurable from the outside:

sub svg-tree-gen(:&terminate!, :&base-height!, :&subdivide-x!, :&other!,
                 :&color=&random-color) {
    sub inner($tree, :$x1!, :$x2!, :$y1!, :$y2!) {
        return if terminate(:$x1, :$x2, :$y1, :$y2);
        take 'rect' => [
            x      => $x1,
            y      => $y1,
            width  => $x2 - $x1,
            height => base-height(:$y1, :$y2),
            style  => "fill:" ~ color(:$x1, :$x2, :$y1, :$y2),
            title  => [$tree.name ~ ', ' ~ format-size($tree.total-size)],
        ];
        # rest as before
}

We can, for example, keep state within the color generator and return a slightly different color during each iteration:

sub color-range(|) {
    state ($r, $g, $b) = (0, 240, 120);
    $r = ($r + 5) % 256;
    $g = ($g + 10) % 256;
    $b = ($b + 15) % 256;
    return "rgb($r,$g,$b)";
}

state variables keep their values between calls to the same subroutine and their initialization runs only on the first call. So this function slightly increases the lightness in each color channel for each invocation, except when it reaches 256, where the modulo operator % resets it back to a small value.

If we plug this into our functions by passing color => &color-range to the calls to svg-tree-gen, we get much less chaotic looking output:

Tree map with deterministic color generation

And the flame graph:

Flame graph with deterministic color generation and one pixel margin between
bars

More Language Support for Functional Programming

As you've seen in the examples above, functional programming typically involves writing lots of small functions. Perl 6 has some language features that make it very easy to write such small functions.

A common task is to write a function that calls a particular method on its argument, as we've seen here:

method total-size() {
    $!total-size //= $.size + @.children.map({.total-size}).sum;
    #                                        ^^^^^^^^^^^^^
}

This can be abbreviated to *.total-size:

method total-size() {
    $!total-size //= $.size + @.children.map(*.total-size).sum;
}

This works for chains of method calls too, so you could write @.children.map(*.total-size.round) if total-size returned a fractional number and you wanted to the call .round method on the result.

There are more cases where you can replace an expression with the "Whatever" star * to create a small function. To create a function that adds 15 to its argument, you can write * + 15 instead of -> $a { $a + 15 }.

If you need to write a function to just call another function, but pass more arguments to the second function, you can use the method assuming. For example -> $x { f(42, $x } can be replaced with &f.assuming(42). This works also for named arguments, so -> $x { f($x, height => 42 ) } can be replaced with &f.assuming(height => 42).

Summary

Functional programming offers techniques for extracting common logic into separate functions. The desired differences in behavior can be encoded in more functions that you pass in as arguments to other functions.

Perl 6 supports functional programming by making functions first class, so you can pass them around as ordinary objects. It also offers closures (access to outer lexical variables from functions), and various shortcuts that make it more pleasant to write short functions.

Subscribe to the Perl 6 book mailing list

* indicates required

[/perl-6] Permanent link

Sun, 19 Feb 2017

Perl 6 By Example: A File and Directory Usage Graph


Permanent link

This blog post is part of my ongoing project to write a book about Perl 6.

If you're interested, either in this book project or any other Perl 6 book news, please sign up for the mailing list at the bottom of the article, or here. It will be low volume (less than an email per month, on average).


A File and Directory Usage Graph

You bought a shiny new 2TB disk just a short while ago, and you're already getting low disk space warnings. What's taking up all that space?

To answer this question, and experiment a bit with data visualization, let's write a small tool that visualizes which files use up how much disk space.

To do that, we must first recursively read all directories and files in a given directory, and record their sizes. To get a listing of all elements in a directory, we can use the dir function, which returns a lazy list of IO::Path objects.

We distinguish between directories, which can have child entries, and files, which don't. Both can have a direct size, and in the case of directories also a total size, which includes files and subdirectories, recursively:

class File {
    has $.name;
    has $.size;
    method total-size() { $.size }
}

class Directory {
    has $.name;
    has $.size;
    has @.children;
    has $!total-size;
    method total-size() {
        $!total-size //= $.size + @.children.map({.total-size}).sum;
    }
}

sub tree(IO::Path $path) {
    if $path.d {
        return Directory.new(
            name     => $path.basename,
            size     => $path.s,
            children => dir($path).map(&tree),
        );
    }
    else {
        return File.new(
            name => $path.Str,
            size => $path.s,
        );
    }
}

Method total-size in class Directory uses the construct $var //= EXPR´. The//stands for *defined-OR*, so it returns the left-hand side if it has a defined value. Otherwise, it evalutes and returns the value ofEXPR`. Combined with the assignment operator, it evaluates the right-hand side only if the variable is undefined, and then stores the value of the expression in the variable. That's a short way to write a cache.

The code for reading a file tree recursively uses the d and s methods on IO::Path. d returns True for directories, and false for files. s returns the size. (Note that .s on directories used to throw an exception in older Rakudo versions. You must use Rakudo 2017.01-169 or newer for this to work; if you are stuck on an older version of Rakudo, you could hard code the size of a directory to a typical block size, like 4096 bytes. It typically won't skew your results too much).

Just to check that we've got a sensible data structure, we can write a short routine that prints it recursively, with indention to indicate nesting of directory entries:

sub print-tree($tree, Int $indent = 0) {
    say ' ' x $indent, format-size($tree.total-size), '  ', $tree.name;
    if $tree ~~ Directory {
        print-tree($_, $indent + 2) for $tree.children
    }
}

sub format-size(Int $bytes) {
    my @units = flat '', <k M G T P>;
    my @steps = (1, { $_ * 1024 } ... *).head(6);
    for @steps.kv -> $idx, $step {
        my $in-unit = $bytes / $step;
        if $in-unit < 1024 {
            return sprintf '%.1f%s', $in-unit, @units[$idx];
        }
    }
}

sub MAIN($dir = '.') {
    print-tree(tree($dir.IO));
}

The subroutine print-tree is pretty boring, if you're used to recursion. It prins out the name and size of the current node, and if the current node is a directory, recurses into each children with an increased indention. The indention is applied through the x string repetition operator, which when called as $string x $count repeates the $string for $count times.

To get a human-readable repretation of the size of a number, format-size knows a list of six units: the empty string for one, k (kilo) for 1024, M (mega) for 1024*1024 and so on. This list is stored in the array @units. The multiplies assoziated with each unit is stored in @steps, which is iniitliazed through the series operator. .... Its structure is INITIAL, CALLABLE ... LIMIT, and repeatedly applies CALLABLE first to the initial value, and then to next value generated and so on, until it hits LIMIT. The limit here is *, a special term called Whatever, which in this means it's unlimited. So the sequence operator returns a lazy, potentially infinite list, and the tailing .head(6) call limits it to 6 values.

To find the most appropriate unit to print the size in, we have to iterate over both the values and in the indexes of the array, which for @steps.kv -> $idx, $step { .. } accomplishes. sprintf, known from other programming languages, does the actual formatting to one digit after the dot, and appends the unit.

Generating a Tree Map

One possible visualization of file and directory sizes is a tree map, which represents each directory as a rectangle, and a each file inside it as a rectangle inside directory's rectangle. The size of each rectangle is proportional to the size of the file or directory it represents.

We'll generate an SVG file containing all those rectangles. Modern browsers support displaying those files, and also show mouse-over texts for each rectangle. This alleviates the burden to actually label the rectangnles, which can be quite a hassle.

To generate the SVG, we'll use the SVG module, which you can install with

$ zef install SVG

or

$ panda install SVG

depending on the module installer you have available.

This module provides a single static method, to which you pass nested pairs. Pairs whose values are arrays are turned into XML tags, and other pairs are turned into attributes. For example this Perl 6 script

use SVG;
print SVG.serialize(
    :svg[
        width => 100,
        height => 20,
        title => [
            'example',
        ]
    ],
);

produces this output:

<svg xmlns="http://www.w3.org/2000/svg"
     xmlns:svg="http://www.w3.org/2000/svg"
     xmlns:xlink="http://www.w3.org/1999/xlink"
     width="100"
     height="20">
        <title>example</title>
</svg>

(without the indention). The xmlns-tags are helpfully added by the SVG module, and are necessary for programs to recognize the file as SVG.

To return to the tree maps, a very simple way to lay out the rectangle is to recurse into areas, and for each area subdivide it either horizontally or vertically, depending on which axis is longer:

sub tree-map($tree, :$x1!, :$x2!, :$y1!, :$y2) {
    # do not produce rectangles for small files/dirs
    return if ($x2 - $x1) * ($y2 - $y1) < 20;

    # produce a rectangle for the current file or dir
    take 'rect' => [
        x      => $x1,
        y      => $y1,
        width  => $x2 - $x1,
        height => $y2 - $y1,
        style  => "fill:" ~ random-color(),
        title  => [$tree.name],
    ];
    return if $tree ~~ File;

    if $x2 - $x1 > $y2 - $y1 {
        # split along the x axis
        my $base = ($x2 - $x1) / $tree.total-size;
        my $new-x = $x1;
        for $tree.children -> $child {
            my $increment = $base * $child.total-size;
            tree-map(
                $child,
                x1 => $new-x,
                x2 => $new-x + $increment,
                :$y1,
                :$y2,
            );
            $new-x += $increment;
        }
    }
    else {
        # split along the y axis
        my $base = ($y2 - $y1) / $tree.total-size;
        my $new-y = $y1;
        for $tree.children -> $child {
            my $increment = $base * $child.total-size;
            tree-map(
                $child,
                :$x1,
                :$x2,
                y1 => $new-y,
                y2 => $new-y + $increment,
            );
            $new-y += $increment;
        }
    }
}

sub random-color {
    return 'rgb(' ~ (1..3).map({ (^256).pick }).join(',') ~ ')';
}

sub MAIN($dir = '.', :$type="flame") {
    my $tree = tree($dir.IO);
    use SVG;
    my $width = 1024;
    my $height = 768;
    say SVG.serialize(
        :svg[
            :$width,
            :$height,
            | gather tree-map $tree, x1 => 0, x2 => $width, y1 => 0, y2 => $height
        ]
    );
}


Tree map of an example directory, with random colors and a mouse-over hover identifying one of the files.

The generated file is not pretty, due to the random colors, and due to some files being identified as very narrow rectangles. But it does make it obvious that there are a few big files, and many mostly small files in a directory (which happens to be the .git directory of a repository). Viewing a file in a browser shows the name of the file on mouse over.

How did we generate this file?

Sub tree-map calls take to adds elements to a result list, so it must be called in the context of a gather statement. gather { take 1; take 2 } returns a lazy list of two elements, 1, 2. But the take calls don't have to occur in the lexical scope of the gather, they can be in any code that's directly or indirectly called from the gather. We call that the dynamic scope.

The rest of sub tree-map is mostly straight-forward. For each direction in which the remaining rectangle can be split, we calculate a base unit that signifies how many pixels a byte should take up. This is used to split up the current canvas into smaller ones, and use those to recurse into tree-map.

The random color generation uses ^256 to create a range from 0 to 256 (exclusive), and .pick returns a random element from this range. The result is a random CSS color string like rgb(120,240,5).

In sub MAIN, the gather returns a list, which would normally be nested inside the outer array. The pipe symbol | in :svg[ ..., | gather ... ] before the gather prevents the normal nesting, and flattens the list into the outer array.

Flame Graphs

The disadvantage of tree maps as generated before is that the human brain isn't very good at comparing rectangle sizes of different aspect ratios, so if their ratio of width to height is very different. Flame graphs prevent this perception errors by showing file sizes as horizontal bars. The vertical arrangement indicates the nesting of directories and files inside other directories. The disadvantage is that less of the available space is used for visualizing the file sizes.

Generating flame graphs is easier than tree maps, because you only need to subdivide in one direction, whereas the height of each bar is fixed, here to 15 pixel:

sub flame-graph($tree, :$x1!, :$x2!, :$y!, :$height!) {
    return if $y >= $height;
    take 'rect' => [
        x      => $x1,
        y      => $y,
        width  => $x2 - $x1,
        height => 15,
        style  => "fill:" ~ random-color(),
        title  => [$tree.name ~ ', ' ~ format-size($tree.total-size)],
    ];
    return if $tree ~~ File;

    my $base = ($x2 - $x1) / $tree.total-size;
    my $new-x = $x1;

    for $tree.children -> $child {
        my $increment = $base * $child.total-size;
        flame-graph(
            $child,
            x1 => $new-x,
            x2 => $new-x + $increment,
            y => $y + 15,
            :$height,
        );
        $new-x += $increment;
    }
}

We can add a switch to sub MAIN to call either tree-map or flame-graph, depending on a command line option:

sub MAIN($dir = '.', :$type="flame") {
    my $tree = tree($dir.IO);
    use SVG;
    my $width = 1024;
    my $height = 768;
    my &grapher = $type eq 'flame'
            ?? { flame-graph $tree, x1 => 0, x2 => $width, y => 0, :$height }
            !! { tree-map    $tree, x1 => 0, x2 => $width, y1 => 0, y2 => $height }
    say SVG.serialize(
        :svg[
            :$width,
            :$height,
            | gather grapher()
        ]
    );
}

Since SVG's coordinate system places the zero of the vertical axis at the top, this actually produces an inverted flame graph, sometimes called icicle graph:


Inverted flame graph with random colors, where the width of each bar represents a file/directory size, and the vertical position the nesting inside a directory.

Summary

We've explored tree maps and flame graphs to visualize which files and directories use up how much disk space.

But the code contains quite some duplications. Next week we'll explore techniques from functional programming to reduce code duplication. We'll also try to make the resulting files a bit prettier.

Subscribe to the Perl 6 book mailing list

* indicates required

[/perl-6] Permanent link

Sun, 12 Feb 2017

Perl 6 By Example: Generating Good Parse Errors from a Parser


Permanent link

This blog post is part of my ongoing project to write a book about Perl 6.

If you're interested, either in this book project or any other Perl 6 book news, please sign up for the mailing list at the bottom of the article, or here. It will be low volume (less than an email per month, on average).


Good error messages are paramount to the user experience of any product. Parsers are no exception to this. Consider the difference between the message "Square bracket [ on line 5 closed by curly bracket } on line 5", in contrast to Python's lazy and generic "SyntaxError: invalid syntax".

In addition to the textual message, knowing the location of the parse error helps tremendously in figuring out what's wrong.

We'll explore how to generate better parsing error messages from a Perl 6 grammar, using the INI file parse from the previous blog posts as an example.

Failure is Normal

Before we start, it's important to realize that in a grammar-based parser, it's normal for regex to fail to match. Even in an overall successful parse.

Let's recall a part of the parser:

token block { [<pair> | <comment>]* }
token section { <header> <block> }
token TOP { <block> <section>* }

When this grammar matches against the string

key=value
[header]
other=stuff

then TOP calls block, which calls both pair and comment. The pair match succeeds, the comment match fails. No big deal. But since there is a * quantifier in token block, it tries again to match pair or comment. neither succeeds, but the overall match of token block still succeeds.

A nice way to visualize passed and failed submatches is to install the Grammar::Tracer module (zef install Grammar::Tracer or panda install Grammar::Tracer), and simple add the statement use Grammar::Tracer before the grammar definition. This produces debug output showing which rules matched and which didn't:

TOP
|  block
|  |  pair
|  |  |  key
|  |  |  * MATCH "key"
|  |  |  ws
|  |  |  * MATCH ""
|  |  |  ws
|  |  |  * MATCH ""
|  |  |  value
|  |  |  * MATCH "value"
|  |  |  ws
|  |  |  * MATCH ""
|  |  |  ws
|  |  |  * MATCH ""
|  |  * MATCH "key=value\n"
|  |  pair
|  |  |  key
|  |  |  * FAIL
|  |  * FAIL
|  |  comment
|  |  * FAIL
|  * MATCH "key=value\n"
|  section
...

Detecting Harmful Failure

To produce good parsing error messages, you must distinguish between expected and unexpected parse failures. As explained above, a match failure of a single regex or token is not generally an indication of a malformed input. But you can identify points where you know that once the regex engine got this far, the rest of the match must succeed.

If you recall pair:

rule pair { <key>  '='  <value> \n+ }

we know that if a key was parsed, we really expect the next character to be an equals sign. If not, the input is malformed.

In code, this looks like this:

rule pair {
    <key> 
    [ '=' || <expect('=')> ]
     <value> \n+
}

|| is a sequential alternative, which first tries to match the subregex on the left-hand side, and only executes the right-hand side if that failed. On the other hand, | executes all alternatives notionally in parallel, and takes the long match.

So now we have to define expect:

method expect($what) {
    die "Cannot parse input as INI file: Expected $what";
}

Yes, you can call methods just like regexes, because regexes really are methods under the hood. die throws an exception, so now the malformed input justakey produces the error

Cannot parse input as INI file: Expected =

followed by a backtrace. That's already better than "invalid syntax", though the position is still missing. Inside method expect, we can find the current parsing position through method pos, a method supplied by the implicit parent class Grammar that the grammar declaration brings with it.

We can use that to improve the error message a bit:

method expect($what) {
    die "Cannot parse input as INI file: Expected $what at character {self.pos}";
}

Providing Context

For larger inputs, we really want to print the line number. To calculate that, we need to get hold of the target string, which is available as method target:

method expect($what) {
    my $parsed-so-far = self.target.substr(0, self.pos);
    my @lines = $parsed-so-far.lines;
    die "Cannot parse input as INI file: Expected $what at line @lines.elems(), after '@lines[*-1]'";
}

This brings us from the "meh" realm of error messages to quite good.

IniFile.parse(q:to/EOI/);
key=value
[section]
key_without_value
more=key
EOI

now dies with

Cannot parse input as INI file: Expected = at line 3, after 'key_without_value'

You can refine method expect more, for example by providing context both before and after the position of the parse failure.

And of course you have to apply the [ thing || <expect('thing')> ] pattern at more places inside the regex to get better error messages.

Finally you can provide different kinds of error messages too. For example when parsing a section header, once the initial [ is parsed, you likely don't want an error message "expected rest of section header", but rather "malformed section header, at line ...":

rule pair {
    <key> 
    [ '=' || <expect('=')> ] 
    [ <value> || <expect('value')>]
     \n+
}
token header { 
     '[' 
     [ ( <-[ \[ \] \n ]>+ )  ']'
         || <error("malformed section header")> ]
}
...

method expect($what) {
    self.error("expected $what");
}

method error($msg) {
    my $parsed-so-far = self.target.substr(0, self.pos);
    my @lines = $parsed-so-far.lines;
    die "Cannot parse input as INI file: $msg at line @lines.elems(), after '@lines[*-1]'";
}

Since Rakudo Perl 6 uses grammars to parse Perl 6 input, you can use Rakudo's own grammar as source of inspiration for more ways to make error reporting even better.

Summary

To generate good error messages from a parser, you need to distinguish between expected and unexpected match failures. The sequential alternative || is a tool you can use to turn unexpected match failures into error messages by raising an exception from the second branch of the alternative.

Subscribe to the Perl 6 book mailing list

* indicates required

[/perl-6] Permanent link

Sun, 05 Feb 2017

Perl 6 By Example: Improved INI Parsing with Grammars


Permanent link

This blog post is part of my ongoing project to write a book about Perl 6.

If you're interested, either in this book project or any other Perl 6 book news, please sign up for the mailing list at the bottom of the article, or here. It will be low volume (less than an email per month, on average).


Last week we've seen a collection of regexes that can parse a configuration file in the INI format that's popular in world of Microsoft Windows applications.

Here we'll explore grammars, a feature that groups regexes into a class-like structure, and how to extract structured data from a successful match.

Grammars

A grammar is class with some extra features that makes it suitable for parsing text. Along with methods and attributes you can put regexes into a grammar.

This is what the INI file parser looks like when formulated as a grammar:

grammar IniFile {
    token key     { \w+ }
    token value   { <!before \s> <-[\n;]>+ <!after \s> }
    token pair    { <key> \h* '=' \h* <value> \n+ }
    token header  { '[' <-[ \[ \] \n ]>+ ']' \n+ }
    token comment { ';' \N*\n+  }
    token block   { [<pair> | <comment>]* }
    token section { <header> <block> }
    token TOP     { <block> <section>* }
}

You can use it to parse some text by calling the parse method, which uses regex or token TOP as the entry point:

my $result = IniFile.parse($text);

Besides the standardized entry point, a grammar offers more advantages. You can inherit from it like from a normal class, thus bringing even more reusability to regexes. You can group extra functionality together with the regexes by adding methods to the grammar. And then there are some mechanisms in grammars that can make your life as a developer easier.

One of them is dealing with whitespace. In INI files, horizontal whitespace is generally considered to be insignificant, in that key=value and key = value lead to the same configuration of the application. So far we've dealt with that explicitly by adding \h* to token pair. But there are place we haven't actually considered. For example it's OK to have a comment that's not at start of the line.

The mechanism that grammars offer is that you can define a rule called ws, and when you declare a token with rule instead of token (or enable this feature in regex through the :sigspace modifier), Perl 6 inserts implicit <ws> calls for you where there is whitespace in the regex definition:

grammar IniFile {
    token ws { \h* }
    rule pair { <key>  '='  <value> \n+ }
    # rest as before
}

This might not be worth the effort for a single rule that needs to parse whitespace, but when there are more, this really pays off by keeping whitespace parsing in a singles space.

Note that you should only parse insignificant whitespace in token ws. For example for INI files, newlines are significant, so ws shouldn't match them.

Extracting Data from the Match

So far the IniFile grammar only checks whether a given input matches the grammar or not. But when it does match, we really want the result of the parse in a data structure that's easy to use. For example we could translate this example INI file:

key1=value2

[section1]
key2=value2
key3 = with spaces
; comment lines start with a semicolon, and are
; ignored by the parser

[section2]
more=stuff

Into this data structure of nested hashes:

{
    _ => {
        key1 => "value2"
    },
    section1 => {
        key2 => "value2",
        key3 => "with spaces"
    },
    section2 => {
        more => "stuff"
    }
}

Key-value pairs from outside of any section show up in the _ top-level key.

The result from the IniFile.parse call is a Match object that has (nearly) all the information necessary to extract the desired match. If you turn a Match object into a string, it becomes the matched string. But there's more. You can use it like a hash to extract the matches from named submatches. For example if the top-level match from

token TOP { <block> <section>* }

produces a Match object $m, then $m<block> is again a Match object, this one from the match of the call of token block´. And$m

is a list ofMatchobjects from the repeated calls to tokensection. So aMatch` is really a tree of matches.

We can walk this data structure to extract the nested hashes. Token header matches a string like "[section1]\n", and we're only interested in"section1". To get to the inner part, we can modify token header` by inserting a pair of round parenthesis around the subregex whose match we're interested in:

token header { '[' ( <-[ \[ \] \n ]>+ ) ']' \n+ }
#                  ^^^^^^^^^^^^^^^^^^^^  a capturing group

That's a capturing group, and we can get its match by using the top-level match for header as an array, and accessing its first element. This leads us to the full INI parser:

sub parse-ini(Str $input) {
    my $m = IniFile.parse($input);
    unless $m {
        die "The input is not a valid INI file.";
    }

    sub block(Match $m) {
        my %result;
        for $m<block><pair> -> $pair {
            %result{ $pair<key>.Str } = $pair<value>.Str;
        }
        return %result;
    }

    my %result;
    %result<_> = hash-from-block($m);
    for $m<section> -> $section {
        %result{ $section<header>[0].Str } = hash-from-block($section);
    }
    return %result;
}

This top-down approach works, but it requires a very intimate understanding of the grammar's structure. Which means that if you change the structure during maintenance, you'll have a hard time figuring out how to change the data extraction code.

So Perl 6 offers a bottom-up approach as well. It allows you to write a data extraction or action method for each regex, token or rule. The grammar engine passes in the match object as the single argument, and the action method can call the routine make to attach a result to the match object. The result is available through the .made method on the match object.

This execution of action methods happens as soon as a regex matches successfully, which means that an action method for a regex can rely on the fact that the action methods for subregex calls have already run. For example when the rule pair { <key> '=' <value> \n+ } is being executed, first token key matches successfully, and its action method runs immediately afterwards. Then token value matches, and its action method runs too. Then finally rule pair itself can match successfully, so its action method can rely on $m<key>.made and $m<value>.made being available, assuming that the match result is stored in variable $m.

Speaking of variables, a regex match implicitly stores its result in the special variable $/, and it is custom to use $/ as parameter in action methods. And there is a shortcut for accessing named submatches: instead of writing $/<key>, you can write $<key>. With this convention in mind, the action class becomes:

class IniFile::Actions {
    method key($/)     { make $/.Str }
    method value($/)   { make $/.Str }
    method header($/)  { make $/[0].Str }
    method pair($/)    { make $<key>.made => $<value>.made }
    method block($/)   { make $<pair>.map({ .made }).hash }
    method section($/) { make $<header>.made => $<block>.made }
    method TOP($/)     {
        make {
            _ => $<block>.made,
            $<section>.map: { .made },
        }
    }
}

The first two action methods are really simple. The result of a key or value match is simply the string that matched. For a header, it's just the substring inside the brackets. Fittingly, a pair returns a Pair object, composed from key and value. Method block constructs a hash from all the lines in the block by iterating over each pair submatch, extracting the already attached Pair object. One level above that in the match tree, section takes that hash and pairs it with the name of section, extracted from $<header>.made. Finally the top-level action method gathers the sectionless key-value pairs under they key _ as well as all the sections, and returns them in a hash.

In each method of the action class, we only rely on the knowledge of the first level of regexes called directly from the regex that corresponds to the action method, and the data types that they .made. Thus when you refactor one regex, you also have to change only the corresponding action method. Nobody needs to be aware of the global structure of the grammar.

Now we just have to tell Perl 6 to actually use the action class:

sub parse-ini(Str $input) {
    my $m = IniFile.parse($input, :actions(IniFile::Actions));
    unless $m {
        die "The input is not a valid INI file.";
    }

    return $m.made
}

If you want to start parsing with a different rule than TOP (which you might want to do in a test, for example), you can pass a named argument rule to method parse:

sub parse-ini(Str $input, :$rule = 'TOP') {
    my $m = IniFile.parse($input,
        :actions(IniFile::Actions),
        :$rule,
    );
    unless $m {
        die "The input is not a valid INI file.";
    }

    return $m.made
}

say parse-ini($ini).perl;

use Test;

is-deeply parse-ini("k = v\n", :rule<pair>), 'k' => 'v',
    'can parse a simple pair';
done-testing;

To better encapsulate all the parsing functionality within the grammar, we can turn parse-ini into a method:

grammar IniFile {
    # regexes/tokens unchanged as before

    method parse-ini(Str $input, :$rule = 'TOP') {
        my $m = self.parse($input,
            :actions(IniFile::Actions),
            :$rule,
        );
        unless $m {
            die "The input is not a valid INI file.";
        }

        return $m.made
    }
}

# Usage:

my $result = IniFile.parse-ini($text);

To make this work, the class IniFile::Actions either has to be declared before the grammar, or it needs to be pre-declared with class IniFile::Action { ... } at the top of the file (with literal three dots to mark it as a forward declaration).

Summary

Match objects are really a tree of matches, with nodes for each named submatch and for each capturing group. Action methods make it easy to decouple parsing from data extraction.

Next we'll explore how to generate better error messages from a failed parse.

Subscribe to the Perl 6 book mailing list

* indicates required

[/perl-6] Permanent link

Sun, 29 Jan 2017

Perl 6 By Example: Parsing INI files


Permanent link

This blog post is part of my ongoing project to write a book about Perl 6.

If you're interested, either in this book project or any other Perl 6 book news, please sign up for the mailing list at the bottom of the article, or here. It will be low volume (less than an email per month, on average).


You've probably seen .ini files before; they are quite common as configuration files on the Microsoft Windows platform, but are also in many other places like ODBC configuration files, Ansible's inventory files and so on.

This is what they look like:

key1=value2

[section1]
key2=value2
key3 = with spaces
; comment lines start with a semicolon, and are
; ignored by the parser

[section2]
more=stuff

Perl 6 offers regexes for parsing, and grammars for structuring and reusing regexes.

Regex Basics

A regex is a piece of code that acts as a pattern for strings with a common structure. It's derived from the computer science concept of a regular expression, but adopted to allow more constructs than pure regular expressions allow, and with some added features that make them easier to use.

We'll use named regexes to match the primitives, and then use regexes that call these named regexes to build a parser for the INI files. Since INI files have no universally accepted, formal grammar, we have to make stuff up a we go.

Let's start with parsing value pairs, like key1=value1. First the key. It may contain letters, digits and the underscore _. There's a shortcut for match such characters, \w, and matching more at least one works by appending a + character:

use v6;

my regex key { \w+ }

multi sub MAIN('test') {
    use Test;
    ok 'abc'    ~~ /^ <key> $/, '<key> matches a simple identifier';
    ok '[abc]' !~~ /^ <key> $/, '<key> does not match a section header';
    done-testing;
}

my regex key { \w+ } declares a lexically (my) scoped regex called key that matches one or more word character.

There is a long tradition in programming languages to support so-called Perl Compatible Regular Expressions, short PCRE. Many programming languages support some deviations from PCRE, including Perl itself, but common syntax elements remain throughout most implementations. Perl 6 still supports some of these elements, but deviates substantially in others.

Here \w+ is the same as in PCRE, but the fact that white space around the \w+ is ignored is not. In the testing routine, the slashes in 'abc' ~~ /^ <key> $/ delimit an anonymous regex. In this regex, ^ and $ stand for the start and the end of the matched string, respectively, which is familiar from PCRE. But then <key> calls the named regex key from earlier. This again is a Perl 6 extension. In PCRE, the < in a regex matches a literal <. In Perl 6 regexes, it introduces a subrule call.

In general, all non-word characters are reserved for "special" syntax, and you have to quote or backslash them to get the literal meaning. For example \< or '<' in a regex match a less-then sign. Word characters (letters, digits and the underscore) always match literally.

Parsing the INI primitives

Coming back the INI parsing, we have to think about what characters are allowed inside a value. Listing allowed characters seems to be like a futile exercise, since we are very likely to forget some. Instead, we should think about what's not allowed in a value. Newlines certainly aren't, because they introduce the next key/value pair or a section heading. Neither are semicolons, because they introduce a comment.

We can formulate this exclusion as a negated character class: <-[ \n ; ]> matches any single character that is neither a newline nor a semicolon. Note that inside a character class, nearly all characters lose their special meaning. Only backslash, whitespace and the closing bracket stand for anything other than themselves. Inside and outside of character classes alike, \n matches a single newline character, and \s and whitespace. The upper-case inverts that, so that for example \S matches any single character that is not whitespace.

This leads us to a version version of a regex for match a value in an ini file:

my regex value { <-[ \n ; ]>+ }

There is one problem with this regex: it also matches leading and trailing whitespace, which we don't want to consider as part of the value:

my regex value { <-[ \n ; ]>+ }
if ' abc ' ~~ /<value>/ {
    say "matched '$/'";         # matched ' abc '
}

If Perl 6 regexes were limited to a regular language in the Computer Science sense, we'd have to something like this:

my regex value { 
    # match a first non-whitespace character
    <-[ \s ; ]> 
    [
        # then arbitrarily many that can contain whitespace
        <-[ \n ; ]>* 
        # ... terminated by one non-whitespace 
        <-[ \s ; ]> 
    ]?  # and make it optional, in case the value is only
        # only one non-whitespace character
}

And now you know why people respond with "And now you have two problems" when proposing to solve problems with regexes. A simpler solution is to match a value as introduced first, and then introduce a constraint that neither the first nor the last character may be whitespace:

my regex value { <!before \s> <-[ \s ; ]>+ <!after \s> }

along with accompanying tests:

is ' abc ' ~~ /<value>/, 'abc', '<value> does not match leading or trailing whitespace';
is ' a' ~~ /<value>/, 'a', '<value> watches single non-whitespace too';
ok "a\nb" !~~ /^ <value> $/, '<valuee> does not match \n';

<!before regex> is a negated look-ahead, that is, the following text must not match the regex, and the text isn't consumed while matching. Unsurprisingly, <!after regex> is the negated look-behind, which tries to match text that's already been matched, and must not succeed in doing so for the whole match to be successful.

This being Perl 6, there is of course yet another way to approach this problem. If you formulate the requirements as "a value must not contain a newline or semicolon and start with a non-whitepsace and end with a non-whitespace", it become obvious that if we just had an AND operator in regexes, this could be easy. And it is:

my regex value { <-[ \s ; ]>+ & \S.* & .*\S }

The & operator delimits two or more smaller regex expressions that must all match the same string successfully for the whole match to succeed. \S.* matches any string that starts with a non-whitesspace character (\S), followed by any character (.) any number of times *. Likewise .*\S matches any string that ends with non-whitespace character.

Who would have thought that matching something as seemingly simple as a value in a configuration file could be so involved? Luckily, matching a pair of key and value is much simpler, now that we know how to match each on their own:

my regex pair { <key> '=' <value> }

And this works great, as long as there are no blanks surrounding the equality sign. If there is, we have to match it separately:

my regex pair { <key> \h* '=' \h* <value> }

\h matches a horizontal whitespace, that is, a blank, a tabulator character, or any other fancy space-like thing that Unicode has in store for us (for example also the non-breaking space), but not a newline.

Speaking of newlines, it's a good idea to match a newline at the end of regex pair, and since we ignore empty lines, let's match more than one too:

my regex pair { <key> \h* '=' \h* <value> \n+ }

Time to write some tests as well:

ok "key=vaule\n" ~~ /<pair>/, 'simple pair';
ok "key = value\n\n" ~~ /<pair>/, 'pair with blanks';
ok "key\n= value\n" !~~ /<pair>/, 'pair with newline before assignment';

A section header is a string in brackets, so the string itself shouldn't contain brackets or a newline:

my regex header { '[' <-[ \[ \] \n ]>+ ']' \n+ }

# and in multi sub MAIN('test'):
ok "[abc]\n"    ~~ /^ <header> $/, 'simple header';
ok "[a c]\n"    ~~ /^ <header> $/, 'header with spaces';
ok "[a [b]]\n" !~~ /^ <header> $/, 'cannot nest headers';
ok "[a\nb]\n"  !~~ /^ <header> $/, 'No newlines inside headers';

The last remaining primitive is the comment:

my regex comment { ';' \N*\n+ }

\N matches any character that's not a newline, so the comment is just a semicolon, and then anything until the end of the line.

Putting Things Together

A section of an INI file is a header followed by some key-value pairs or comment lines:

my regex section {
    <header>
    [ <pair> | <comment> ]*
}

[...] groups a part of a regex, so that the quantifier * after it applies to the whole group, not just to the last term.

The whole INI file consists of potentially some initial key/value pairs or comments followed by some sections:

my regex inifile {
    [ <pair> | <comment> ]*
    <section>*
}

The avid reader has noticed that the [ <pair> | <comment> ]* part of a regex has been used twice, so it's a good idea to extract it into a standalone regex:

my regex block   { [ <pair> | <comment> ]* }
my regex section { <header> <block> }
my regex inifile { <block> <section>* }

It's time for the "ultimate" test:

my $ini = q:to/EOI/;
key1=value2

[section1]
key2=value2
key3 = with spaces
; comment lines start with a semicolon, and are
; ignored by the parser

[section2]
more=stuff
EOI

ok $ini ~~ /^<inifile>$/, 'Can parse a full ini file';

Backtracking

Regex matching seems magical to many programmers. You just state the pattern, and the regex engine determines for you whether a string matches the pattern or not. While implementing a regex engine is a tricky business, the basics aren't too hard to understand.

The regex engine goes through the parts of a regex from left to right, trying to match each part of the regex. It keeps track of what part of the string it matched so far in a cursor. If a part of a regex can't find a match, the regex engine tries to alter the previous match to take up fewer characters, and the retry the failed match at the new position.

For example if you execute the regex match

'abc' ~~ /.* b/

the regex engine first evaluates the .*. The . matches any character. The * quantifier is greedy, which means it tries to match as many characters as it can. It ends up matching the whole string, abc. Then the regex engine tries to match the b, which is a literal. Since the previous match gobbled up the whole string, matching c against the remaining empty string fails. So the previous regex part, .*, must give up a character. It now matches ab, and the literal matcher for the b compares b and c, and fails again. So there is a final iteration where the .* once again gives up one character it matched, and now the b literal can match the second character in the string.

This back and forth between the parts of a regex is called backtracking. It's great feature when you search for a pattern in a string. But in a parser, it is usually not desirable. If for example the regex key matched the substring "key2" in the input"key2=value2`, you don't want it to match a shorter substring just because the next part of the regex can't match.

There are three major reasons why you don't want that. The first is that it makes debugging harder. When humans think about how a text is structured, they usually commit pretty quickly to basic tokenization, such as where a word or a sentence ends. Thus backtracking can be very uninuitive. If you generate error messages based on which regexes failed to match, backtracking basically always leads to the error message being pretty useless.

The second reason is that backtracking can lead to unexpected regex matches. For example you want to match two words, optionally separated by whitespaces, and you try to translate this directly to a regex:

say "two words" ~~ /\w+\s*\w+/;     # 「two words」

This seems to work: the first \w+ matches the first word, the seccond oen matches the second word, all fine and good. Until you find that it actually matches a single word too:

say "two" ~~ /\w+\s*\w+/;           # 「two」

How did that happen? Well, the first \w+ matched the whole word, \s* successfully matched the empty string, and then the second \w+ failed, forcing the previous to parts of the regex to match differently. So in the second iteration, the first \w+only matchestw, and the second\w+ matcheso`. And then you realize that if two words aren't delimtied by whitespace, how do you even tell where one word ends and the next one starts? With backtracking disabled, the regex fails to match instead of matching in an unintended way.

The third reason is performance. When you disable backtracking, the regex engine has to look at each character only once, or once for each branch it can take in the case of alternatives. With backtracking, the regex engine can be stuck in backtracking loops that take over-proportionally longer with increasing length of the input string.

To disable backtracking, you simply have to replace the word regex by token in the declaration, or by using the :ratchet modifier inside the regex.

In the INI file parser, only regex value needs backtracking (though other formualtions discussed above don't need it), all the other regexes can be switched over to tokens safely:

my token key     { \w+ }
my regex value   {  <!before \s> <-[\n;]>+ <!after \s> }
my token pair    { <key> \h* '=' \h* <value> \n+ }
my token header  { '[' <-[ \[ \] \n ]>+ ']' \n+ }
my token comment { ';' \N*\n+  }
my token block { [ <pair> | <comment> ]* }
my token section { <header> <block> }
my token inifile { <block> <section>* }

Summary

Perl 6 allows regex reuse by treating them as first-class citizens, allowing them to be named and called like normal routines. Further clutter is removed by allowing whitespace inside regexes.

These features allows you to write regexes to parse proper file formats and even programming languages. So far we have only seen a binary decision about whether a string matches a regex or not. In the future, we'll explore ways to improve code reuse even more, extract structured data from the match, and give better error messages when the parse fails.

Subscribe to the Perl 6 book mailing list

* indicates required

[/perl-6] Permanent link

Sun, 22 Jan 2017

Perl 6 By Example: Perl 6 Review


Permanent link

This blog post is part of my ongoing project to write a book about Perl 6.

If you're interested, either in this book project or any other Perl 6 book news, please sign up for the mailing list at the bottom of the article, or here. It will be low volume (less than an email per month, on average).


In the previous "Perl 6 by Example" blog posts we've discussed some examples interleaved with the Perl 6 mechanics that make them work. Here I want to summarize and deepen the Perl 6 knowledge that we've touched on so far, removed from the original examples.

Variables and Scoping

In Perl 6, variable names are made of a sigil, $, @, % or &, followed by an identifier. The sigil implies a type constraint, where $ is the most general one (no restriction by default), @ is for arrays, % for hashes (associative arrays/maps), and & for code objects.

Identifiers can contain - and ' characters, as long as the character after it is a letter. Identifiers must start with a letter or underscore.

Subroutines and variables declared with my are lexically scoped. They are visible from the point of the declaration to the end of the current {}-enclosed block (or the current file, in case the declaration is outside a block). Subroutine parameters are visible in the signature and block of the subroutine.

An optional twigil between the sigil and identifier can influence the scoping. The * twigil marks a dynamically scoped variable, thus lookup is performed in the current call stack. ! marks attributes, that is, a per-instance variable that's attached to an object.

Subroutines

A subroutine, or short sub, is a piece of code with its own scope and usually also a name. It has a signature which specifies what kind of values you have to pass in when you call it:

sub chunks(Str $s, Int $chars) {
#         ^^^^^^^^^^^^^^^^^^^^ signature
#   ^^^^^^ name
    gather for 0 .. $s.chars / $chars - 1 -> $idx {
        take substr($s, $idx * $chars, $chars);
    }
}

The variables used in the signature are called parameters, whereas we call the values that you pass in arguments.

To refer to a subroutine without calling it, put an ampersand character in front of it, for example

say &chunks.^name;      # Sub

to call it, simply use its name, followed by the list of arguments, which can optionally be in parentheses:

say chunks 'abcd', 2;   # (ab cd)
say chunks('abcd', 2);  # (ab cd)

You only need the parentheses if some other construct would otherwise interfere with the subroutine call. For example if you intend to write

say chunks(join('x', 'ab', 'c'), 2);

and you leave out the inner pair of parentheses:

say chunks(join 'x', 'ab', 'c', 2);

then all the arguments go to the join function, leaving only one argument to the chunks function. On the other hand it is fine to leave out the outer pair of parentheses and write

say chunks join('x', 'ab', 'c'), 2;

because there's no ambiguity here.

One case worth noting is that if you call a subroutine without arguments as the block of an if condition or a for loop (or similar constructs), you have to include the parentheses, because otherwise the block is parsed as an argument to the function.

sub random-choice() {
    Bool.pick;
}

# right way:
if random-choice() {
    say 'You were lucky.';
}

# wrong way:
if random-choice {
    say 'You were lucky.';
}

If you do happen to make this mistake, the Perl 6 compiler tries very hard to detect it. In the example above, it says

Function 'random-choice' needs parens to avoid gobbling block

and when it tries to parse the block for the if-statement, it doesn't find one:

Missing block (apparently claimed by 'random-choice')

When you have a sub called MAIN, Perl 6 uses its signature to parse the command line arguments and pass those command line arguments to MAIN.

multi subs are several subroutines with the same name but different signatures. The compiler decides at run time which of the candidates it calls based on the best match between arguments and parameters.

Classes and Objects

Class declarations follow the same syntactic schema as subroutine declarations: the keyword class, followed by the name, followed by the body in curly braces:

class OutputCapture {
    has @!lines;
    method print(\s) {
        @!lines.push(s);
    }
    method captured() {
        @!lines.join;
    }
}

By default, type names are scoped to the current namespace, however you can make it lexically scoped by adding a my in front of class:

my class OutputCapture { ... }

Creating a new instance generally works by calling the new method on the type object. The new method is inherited from the implicit parent class Any that all types get:

my $c = OutputCapture.new;

Per-instance state is stored in attributes, which are declared with the has keyword, as seen above in has @!lines. Attributes are always private, as indicated by the ! twigil. If you use the dot . twigil in the declaration instead, you have both the private attribute @!lines and a public, read-only accessor method:

my class OutputCapture {
    has @.lines;
    method print(\s) {
        # the private name with ! still works
        @!lines.push(s);
    }
    method captured() {
        @!lines.join;
    }
}
my $c = OutputCapture.new;
$c.print('42');
# use the `lines` accessor method:
say $c.lines;       # [42]

When you declare attributes with the dot twigil, you can also initialize the attributes from the constructor through named arguments, as in OutputCapture.new( lines => [42] ).

Private methods start with a ! and can only be called from inside the class body as self!private-method.

Methods are basically just subroutines, with two differences. The first is that they get an implicit parameter called self, which contains the object the method is called on (which we call the invocant). The second is that if you call a subroutine, the compiler searches for this subroutine in the current lexical scope, and outer scopes. On the other hand, the methods for a method calls are looked up in the class of the object and its superclasses.

Concurrency

Perl 6 provides high-level primitives for concurrency and parallel execution. Instead of explicitly spawning new threads, you are encouraged to run a computation with start, which returns a Promise. This is an object that promises that in the future the computation will yield a result. The status can thus be Planned, Kept or Broken. You can chain promises, combine them, and wait for them.

In the background, a scheduler distributes such computations to operating system level threads. The default scheduler is a thread pool scheduler with an upper limit to the number of threads to use.

Communication between parallel computations should happen through thread-safe data structures. Foremost among them are the Channel, a thread-safe queue, and Supply, Perl 6's implementation of the Observer Pattern. Supplies are very powerful, because you can transform them with methods such as map, grep, throttle or delayed, and use their actor semantic to ensure that a consumer is run in only one thread at a time.

Subscribe to the Perl 6 book mailing list

* indicates required

[/perl-6] Permanent link

Sun, 15 Jan 2017

Perl 6 By Example: Stateful Silent Cron


Permanent link

This blog post is part of my ongoing project to write a book about Perl 6.

If you're interested, either in this book project or any other Perl 6 book news, please sign up for the mailing list at the bottom of the article, or here. It will be low volume (less than an email per month, on average).


In the last two installments we've looked at silent-cron, a wrapper around external programs that silences them in case their exit status is zero. But to make it really practical, it should also silence occasional failures.

External APIs fail, networks become congested, and other things happen that prevent a job from succeeding, so some kind of retry mechanism is desirable. In case of a cron job, cron already takes care of retrying a job on a regular basis, so silent-cron should just suppress occasional errors. On the other hand, if a job fails consistently, this is usually something that an operator or developer should look into, so it's a problem worth reporting.

To implement this functionality, silent-cron needs to store persistent state between separate runs. It needs to record the results from the current run and then analyze if the failure history qualifies as "occasional".

Persistent Storage

The storage backend needs to write and retrieve structured data, and protect concurrent access to the state file with locking. A good library for such a storage backend is SQLite, a zero-maintenance SQL engine that's available as a C library. It's public domain software and in use in most major browsers, operating systems and even some airliners.

Perl 6 gives you access to SQLite's functionality through DBIish, a generic database interface with backend drivers for SQLite, MySQL, PostgreSQL and Oracle DB. To use it, first make sure that SQLite3 is installed, including its header files. On a Debian-based Linux system, for example, you can achieve this with apt-get install libsqlite3-dev. If you are using the Rakudo Star distribution, DBIish is already available. If not, you can use one of the module installers to retrieve and install it: panda install DBIish or zef install DBIish.

To use the DBIish's SQLite backend, you first have to create a database handle by selecting the backend and supplying connection information:

use DBIish;
my $dbh = DBIish.connect('SQLite', :database('database-file.sqlite3'));

Connecting to a database file that does not yet exist creates that file.

One-off SQL statements can be executed directly on the database handle:

$dbh.do('INSERT INTO player (name) VALUES ?', 'John');

The ? in the SQL is a placeholder that is passed out-of-band as a separate argument to the do method, which avoids potential errors such as SQL injection vulnerabilities.

Queries tend to work by first preparing a statement which returns a statement handle. You can execute a statement once or multiple times, and retrieve result rows after each execute call:

my $sth = $dbh.prepare('SELECT id FROM player WHERE name = ?');

my %ids;
for <John Jack> -> $name {
    $sth.execute($name);
    %ids{ $name } = $sth.row[0];
}
$sth.finish;

Developing the Storage Backend

We shouldn't just stuff all the storage handling code into sub MAIN, we should instead carefully consider the creation of a useful API for the storage backend. At first, we need only two pieces of functionality: insert the result of a job execution; and retrieve the most recent results.

Since silent-cron can be used to guard multiple cron jobs on the same machine, we might need something to distinguish the different jobs so that one of them succeeding doesn't prevent error reporting for one that is constantly failing. For that we introduce a job name, which can default to the command (including arguments) being executed but which can be set explicitly on the command line.

The API for the storage backend could look something like this:

my $repo = ExecutionResultRepository.new(
    jobname   => 'refresh cache',
    statefile => 'silent-cron.sqlite3',
);
$repo.insert($result);
my @last-results = $repo.tail(5);

This API isn't specific to the SQLite backend at all; a storage backend that works with plain text files could have the exact same API.

Let's implement this API. First we need the class and the two attributes that should be obvious from the usage example above:

class ExecutionResultRepository {
    has $.jobname   is required;
    has $.statefile is required;
    # ... more code

To implement the insert method, we need to connect to the database and create the relevant table if it doesn't exist yet.

has $!db;
method !db() {
    return $!db if $!db;
    $!db = DBIish.connect('SQLite', :database($.statefile));
    self!create-schema();
    return $!db;
}

This code uses a private attribute $!db to cache the database handle and a private method !db to create the handle if it doesn't exist yet.

Private methods are declared like ordinary methods, except that the name starts with an exclamation mark. To call one, substitute the method call dot for the exclamation mark, in other words, use self!db() instead of self.db().

The !db method also calls the next private method, !create-schema, which creates the storage table and some indexes:

method !create-schema() {
    $!db.do(qq:to/SCHEMA/);
        CREATE TABLE IF NOT EXISTS $table (
            id          INTEGER PRIMARY KEY,
            jobname     VARCHAR NOT NULL,
            exitcode    INTEGER NOT NULL,
            timed_out   INTEGER NOT NULL,
            output      VARCHAR NOT NULL,
            executed    TIMESTAMP NOT NULL DEFAULT (DATETIME('NOW'))
        );
    SCHEMA
    $!db.do(qq:to/INDEX/);
        CREATE INDEX IF NOT EXISTS {$table}_jobname_exitcode ON $table ( jobname, exitcode );
    INDEX
    $!db.do(qq:to/INDEX/);
        CREATE INDEX IF NOT EXISTS {$table}_jobname_executed ON $table ( jobname, executed );
    INDEX
}

Multi-line string literals are best written with the heredoc syntax. qq:to/DELIMITER/ tells Perl 6 to finish parsing the current statement so that you can still close the method call parenthesis and add the statement-ending semicolon. The next line starts the string literal, which goes on until Perl 6 finds the delimiter on a line on its own. Leading whitespace is stripped from each line of the string literal by as much as the closing delimiter is indented.

For example

print q:to/EOS/;
    Not indented
        Indented four spaces
    EOS

Produces the output

Not indented
    Indented four spaces

Now that we have a working database connection and know that the database table exists, inserting a new record becomes easy:

method insert(ExecutionResult $r) {
    self!db.do(qq:to/INSERT/, $.jobname, $r.exitcode, $r.timed-out, $r.output);
        INSERT INTO $table (jobname, exitcode, timed_out, output)
        VALUES(?, ?, ?, ?)
    INSERT
}

Selecting the most recent records is a bit more work, partially because we need to convert the table rows into objects:

method tail(Int $count) {
    my $sth = self!db.prepare(qq:to/SELECT/);
        SELECT exitcode, timed_out, output
          FROM $table
         WHERE jobname = ?
      ORDER BY executed DESC
         LIMIT $count
    SELECT
    $sth.execute($.jobname);
    $sth.allrows(:array-of-hash).map: -> %h {
        ExecutionResult.new(
            exitcode  => %h<exitcode>,
            timed-out => ?%h<timed_out>,
            output    => %h<output>,
        );
    }
}

The last statement in the tail method deserves a bit of extra attention. $sth.allrows(:array-of-hash) produces the database rows as a list of hashes. This list is lazy, that is, it's generated on-demand. Lazy lists are a very convenient feature because they allow you to use iterators and lists with the same API. For example when reading lines from a file, you can write for $handle.lines -> $line { ... }, and the lines method doesn't have to load the whole file into memory; instead it can read a line whenever it is accessed.

$sth.allrows(...) is lazy, and so is the .map call that comes after it. map transforms a list one element at a time by calling the code object that's passed to it. And that is done lazily as well. So SQLite only retrieves rows from the database file when elements of the resulting list are actually accessed.

Using the Storage Backend

With the storage API in place, it's time to use it:

multi sub MAIN(*@cmd, :$timeout, :$jobname is copy,
               :$statefile='silent-cron.sqlite3', Int :$tries = 3) {
    $jobname //= @cmd.Str;
    my $result = run-with-timeout(@cmd, :$timeout);
    my $repo = ExecutionResultRepository.new(:$jobname, :$statefile);
    $repo.insert($result);

    my @runs = $repo.tail($tries);

    unless $result.is-success or @runs.grep({.is-success}) {
        say "The last @runs.elems() runs of @cmd[] all failed, the last execution ",
            $result.timed-out ?? "ran into a timeout"
                              !! "exited with code $result.exitcode()";

        print "Output:\n", $result.output if $result.output;
    }
    exit $result.exitcode // 2;
}

Now a job that succeeds a few times, and then fails up to two times in a row doesn't produce any error output, and only the third failed execution in a row produces output. You can override that on the command line with --tries=5.

Summary

We've discussed DBIish, a database API with pluggable backend, and explored using it with SQLite to store persistent data. In the process we also came across lazy lists and a new form of string literals called heredocs.

Subscribe to the Perl 6 book mailing list

* indicates required

[/perl-6] Permanent link

Sun, 08 Jan 2017

Perl 6 By Example: Testing Silent Cron


Permanent link

This blog post is part of my ongoing project to write a book about Perl 6.

If you're interested, either in this book project or any other Perl 6 book news, please sign up for the mailing list at the bottom of the article, or here. It will be low volume (less than an email per month, on average).


The previous blog post left us with a bare-bones silent-cron implementation, but without tests. I probably sound like a broken record for bringing this up time and again, but I really want some tests when I start refactoring or extending my programs. And this time, getting the tests in is a bit harder, so I think it's worth discussing how to do it.

Refactoring

As a short reminder, this is what the program looks like:

#!/usr/bin/env perl6

sub MAIN(*@cmd, :$timeout) {
    my $proc = Proc::Async.new(|@cmd);
    my $collector = Channel.new;
    for $proc.stdout, $proc.stderr -> $supply {
        $supply.tap: { $collector.send($_) }
    }
    my $promise = $proc.start;
    my $waitfor = $promise;
    $waitfor = Promise.anyof(Promise.in($timeout), $promise)
        if $timeout;
    $ = await $waitfor;

    $collector.close;
    my $output = $collector.list.join;

    if !$timeout || $promise.status ~~ Kept {
        my $exitcode = $promise.result.exitcode;
        if $exitcode != 0 {
            say "Program @cmd[] exited with code $exitcode";
            print "Output:\n", $output if $output;
        }
        exit $exitcode;
    }
    else {
        $proc.kill;
        say "Program @cmd[] did not finish after $timeout seconds";
        sleep 1 if $promise.status ~~ Planned;
        $proc.kill(9);
        $ = await $promise;
        exit 2;
    }
}

There's logic in there for executing external programs with a timeout, and then there's logic for dealing with two possible outcomes. In terms of both testability and for future extensions it makes sense to factor out the execution of external programs into a subroutine. The result of this code is not a single value, we're potentially interested in the output it produced, the exit code, and whether it ran into a timeout. We could write a subroutine that returns a list or a hash of these values, but here I chose to write a small class instead:

class ExecutionResult {
    has Int $.exitcode = -1;
    has Str $.output is required;
    has Bool $.timed-out = False;
    method is-success {
        !$.timed-out && $.exitcode == 0;
    }
}

We've seen classes before, but this one has a few new features. Attributes declared with the . twigil automatically get an accessor method, so

has Int $.exitcode;

is roughly the same as

has Int $!exitcode;
method exitcode() { $!exitcode }

So it allows a user of the class to access the value in the attribute from the outside. As a bonus, you can also initialize it from the standard constructor as a named argument, ExecutionResult.new( exitcode => 42 ). The exit code is not a required attribute, because we can't know the exit code of a program that has timed out. So with has Int $.exitcode = -1 we give it a default value that applies if the attribute hasn't been initialized.

The output is a required attribute, so we mark it as such with is required. That's a trait. Traits are pieces of code that modify the behavior of other things, here of an attribute. They crop up in several places, for example in subroutine signatures (is copy on a parameter), variable declarations and classes. If you try to call ExecutionResult.new() without specifying an output, you get such an error:

The attribute '$!output' is required, but you did not provide a value for it.

Mocking and Testing

Now that we have a convenient way to return more than one value from a hypothetical subroutine, let's look at what this subroutine might look like:

sub run-with-timeout(@cmd, :$timeout) {
    my $proc = Proc::Async.new(|@cmd);
    my $collector = Channel.new;
    for $proc.stdout, $proc.stderr -> $supply {
        $supply.tap: { $collector.send($_) }
    }
    my $promise = $proc.start;
    my $waitfor = $promise;
    $waitfor = Promise.anyof(Promise.in($timeout), $promise)
        if $timeout;
    $ = await $waitfor;

    $collector.close;
    my $output = $collector.list.join;

    if !$timeout || $promise.status ~~ Kept {
        say "No timeout";
        return ExecutionResult.new(
            :$output,
            :exitcode($promise.result.exitcode),
        );
    }
    else {
        $proc.kill;
        sleep 1 if $promise.status ~~ Planned;
        $proc.kill(9);
        $ = await $promise;
        return ExecutionResult.new(
            :$output,
            :timed-out,
        );
    }
}

The usage of Proc::Async has remained the same, but instead of printing this when an error occurs, the routine now returns ExecutionResult objects.

This simplifies the MAIN sub quite a bit:

multi sub MAIN(*@cmd, :$timeout) {
    my $result = run-with-timeout(@cmd, :$timeout);
    unless $result.is-success {
        say "Program @cmd[] ",
            $result.timed-out ?? "ran into a timeout"
                              !! "exited with code $result.exitcode()";

        print "Output:\n", $result.output if $result.output;
    }
    exit $result.exitcode // 2;
}

A new syntactic feature here is the ternary operator, CONDITION ?? TRUE-BRANCH !! FALSE-BRANCH, which you might know from other programming languages such as C or Perl 5 as CONDITION ? TRUE-BRANCH : FALSE-BRANCH.

Finally, the logical defined-or operator LEFT // RIGHT returns the LEFT side if it's defined, and if not, runs the RIGHT side and returns its value. It works like the || and or infix operators, except that those check for the boolean value of the left, not whether they are defined.

In Perl 6, we distinguish between defined and true values. By default, all instances are true and defined, and all type objects are false and undefined. Several built-in types override what they consider to be true. Numbers that equal 0 evaluate to False in a boolean context, as do empty strings and empty containers such as arrays, hashes and sets. On the other hand, only the built-in type Failure overrides definedness. You can override the truth value of a custom type by implementing a method Bool (which should return True or False), and the definedness with a method defined.

Now we could start testing the sub run-with-timeout by writing custom external commands with defined characteristics (output, run time, exit code), but that's rather fiddly to do so in a reliable, cross-platform way. So instead I want to replace Proc::Async with a mock implementation, and give the sub a way to inject that:

sub run-with-timeout(@cmd, :$timeout, :$executer = Proc::Async) {
    my $proc = $executer.defined ?? $executer !! $executer.new(|@cmd);
    # rest as before

Looking through sub run-with-timeout, we can make a quick list of methods that the stub Proc::Async implementation needs: stdout, stderr, start and kill. Both stdout and stderr need to return a Supply. The simplest thing that could possibly work is to return a Supply that will emit just a single value:

my class Mock::Proc::Async {
    has $.out = '';
    has $.err = '';
    method stdout {
        Supply.from-list($.out);
    }
    method stderr {
        Supply.from-list($.err);
    }

Supply.from-list returns a Supply that will emit all the arguments passed to it; in this case just a single string.

The simplest possible implementation of kill just does nothing:

    method kill($?) {}

$? in a signature is an optional argument ($foo?) without a name.

Only one method remains that needs to be stubbed: start. It's supposed to return a Promise that, after a defined number of seconds, returns a Proc object or a mock thereof. Since the code only calls the exitcode method on it, writing a stub for it is easy:

has $.exitcode = 0;
has $.execution-time = 1;
method start {
    Promise.in($.execution-time).then({
        (class {
            has $.exitcode;
        }).new(:$.exitcode);
    });
}

Since we don't need the class for the mock Proc anywhere else, we don't even need to give it a name. class { ... } creates an anonymous class, and the .new call on it creates a new object from it.

As mentioned before, a Proc with a non-zero exit code throws an exception when evaluated in void context, or sink context as we call it in Perl 6. We can emulate this behavior by extending the anonymous class a bit:

class {
    has $.exitcode;
    method sink() {
        die "mock Proc used in sink context";
    }
}

With all this preparation in place, we can finally write some tests:

multi sub MAIN('test') {
    use Test;

    my class Mock::Proc::Async {
        has $.exitcode = 0;
        has $.execution-time = 0;
        has $.out = '';
        has $.err = '';
        method kill($?) {}
        method stdout {
            Supply.from-list($.out);
        }
        method stderr {
            Supply.from-list($.err);
        }
        method start {
            Promise.in($.execution-time).then({
                (class {
                    has $.exitcode;
                    method sink() {
                        die "mock Proc used in sink context";
                    }
                }).new(:$.exitcode);
            });
        }
    }

    # no timeout, success
    my $result = run-with-timeout([],
        timeout => 2,
        executer => Mock::Proc::Async.new(
            out => 'mocked output',
        ),
    );
    isa-ok $result, ExecutionResult;
    is $result.exitcode, 0, 'exit code';
    is $result.output, 'mocked output', 'output';
    ok $result.is-success, 'success';

    # timeout
    $result = run-with-timeout([],
        timeout => 0.1,
        executer => Mock::Proc::Async.new(
            execution-time => 1,
            out => 'mocked output',
        ),
    );
    isa-ok $result, ExecutionResult;
    is $result.output, 'mocked output', 'output';
    ok $result.timed-out, 'timeout reported';
    nok $result.is-success, 'success';
}

This runs through two scenarios, one where a timeout is configured but not used (because the mocked external program exits first), and one where the timeout takes effect.

Improving Reliability and Timing

Relying on timing in tests is always unattractive. If the times are too short (or too slow together), you risk sporadic test failures on slow or heavily loaded machines. If you use more conservative temporal spacing of tests, the tests can become very slow.

There's a module (not distributed with Rakudo) to alleviate this pain: Test::Scheduler provides a thread scheduler with virtualized time, allowing you to write the tests like this:

use Test::Scheduler;
my $*SCHEDULER = Test::Scheduler.new;
my $result = start run-with-timeout([],
    timeout => 5,
    executer => Mock::Proc::Async.new(
        execution-time => 2,
        out => 'mocked output',
    ),
);
$*SCHEDULER.advance-by(5);
$result = $result.result;
isa-ok $result, ExecutionResult;
# more tests here

This installs the custom scheduler, and $*SCHEDULER.advance-by(5) instructs it to advance the virtual time by 5 seconds, without having to wait five actual seconds. At the time of writing (December 2016), Test::Scheduler is rather new module, and has a bug that prevents the second test case from working this way.

Installing a Module

If you want to try out Test::Scheduler, you need to install it first. If you run Rakudo Star, it has already provided you with the panda module installer. You can use that to download and install the module for you:

$ panda install Test::Scheduler

If you don't have panda available, you can instead bootstrap zef, an alternative module installer:

$ git clone https://github.com/ugexe/zef.git
$ cd zef
$ perl6 -Ilib bin/zef install .

and then use zef to install the module:

$ zef install Test::Scheduler

Summary

In this installment, we've seen attributes with accessors, the ternary operator and anonymous classes. Testing of threaded code has been discussed, and how a third-party module can help. Finally we had a very small glimpse at the two module installers, panda and zef.

Subscribe to the Perl 6 book mailing list

* indicates required

[/perl-6] Permanent link

Sun, 01 Jan 2017

Perl 6 By Example: Silent Cron, a Cron Wrapper


Permanent link

This blog post is part of my ongoing project to write a book about Perl 6.

If you're interested, either in this book project or any other Perl 6 book news, please sign up for the mailing list at the bottom of the article, or here. It will be low volume (less than an email per month, on average).


On Linux and UNIX-Like systems, a program called cron periodically executes user-defined commands in the background. It is used for system maintenance tasks such as refreshing or removing caches, rotating and deleting old log files and so on.

If such a command produces any output, cron typically sends an email containing the output so that an operator can look at it and judge if some action is required.

But not all command line programs are written for usage with cron. For example they might produce output even on successful execution, and indicate failure through a non-zero exit code. Or they might hang, or otherwise misbehave.

To deal with such commands, we'll develop a small program called silent-cron, which wraps such commands and suppresses output when the exit code is zero. It also allows you to specify a timeout that kills the wrapped program if it takes too long:

$ silent-cron -- command-that-might-fail args
$ silent-cron --timeout=5 -- command-that-might-hang

Running Commands Asynchronously

When you want to run external commands, Perl 6 gives you basically two choices: run, a simple, synchronous interface, and Proc::Async, an asynchronous and slightly more complex option. Even though we will omit the timeout in the first iteration, we need to be aware that implementing the timeout is easier in the asynchronous interface, so that's what we'll use:

#!/usr/bin/env perl6

sub MAIN(*@cmd) {
    my $proc = Proc::Async.new(@cmd);
    my $collector = Channel.new;
    for $proc.stdout, $proc.stderr -> $supply {
        $supply.tap: { $collector.send($_) }
    }
    my $result = $proc.start.result;
    $collector.close;
    my $output = $collector.list.join;

    my $exitcode = $result.exitcode;
    if $exitcode != 0 {
        say "Program @cmd[] exited with code $exitcode";
        print "Output:\n", $output if $output;
    }
    exit $exitcode;
}

There's a big chunk of new features and concepts in here, so let's go through the code bit by bit.

sub MAIN(*@cmd) {
    my $proc = Proc::Async.new(@cmd);

This collects all the command line arguments in the array variable @cmd, where the first element is the command to be executed, and any further elements are arguments passed to this command. The second line creates a new Proc::Async instance, but doesn't yet run the command.

We need to capture all output from the command; thus we capture the output of the STDOUT and STDERR streams (file handles 1 and 2 on Linux), and combine it into a single string. In the asynchronous API, STDOUT and STDERR are modeled as objects of type Supply, and hence are streams of events. Since supplies can emit events in parallel, we need a thread-safe data structure for collecting the result, and Perl 6 conveniently provides a Channel for that:

my $collector = Channel.new;

To actually get the output from the program, we need to tap into the STDOUT and STDERR streams:

for $proc.stdout, $proc.stderr -> $supply {
    $supply.tap: { $collector.send($_) }
}

Each supply executes the block { $collector.send($_) } for each string it receives. The string can be a character, a line or something larger if the stream is buffered. All we do with it is put the string into the channel $collector via the send method.

Now that the streams are tapped, we can start the program and wait for it to finish:

my $result = $proc.start.result;

Proc::Async.start executes the external process and returns a Promise. A promise wraps a piece of code that potentially runs on another thread, has a status (Planned, Kept or Broken), and once it's finished, a result. Accessing the result automatically waits for the wrapped code to finish. Here the code is the one that runs the external program and the result is an object of type Proc (which happens to be the same as the run() function from the synchronous interface).

After this line, we can be sure that the external command has terminated, and thus no more output will come from $proc.stdout and $proc.stderr. Hence we can safely close the channel and access all its elements through Channel.list:

$collector.close;
my $output = $collector.list.join;

Finally it's time to check if the external command was successful -- by checking its exit code -- and to exit the wrapper with the command's exit code:

my $exitcode = $result.exitcode;
if $exitcode != 0 {
    say "Program @cmd[] exited with code $exitcode";
    print "Output:\n", $output if $output;
}
exit $exitcode;

Implementing Timeouts

The idiomatic way to implement timeouts in Perl 6 is to use the Promise.anyof combinator together with a timer:

sub MAIN(*@cmd, :$timeout) {
    my $proc = Proc::Async.new(|@cmd);
    my $collector = Channel.new;
    for $proc.stdout, $proc.stderr -> $supply {
        $supply.tap: { $collector.send($_) }
    }
    my $promise = $proc.start;
    my $waitfor = $promise;
    $waitfor = Promise.anyof(Promise.in($timeout), $promise)
        if $timeout;
    await $waitfor;

The initialization of $proc hasn't changed. But instead of accessing $proc.start.result, we store the promise returned from $proc.start. If the user specified a timeout, we run this piece of code:

$waitfor = Promise.anyof(Promise.in($timeout), $promise)

Promise.in($seconds) returns a promise that will be fulfilled in $seconds seconds. It's basically the same as start { sleep $seconds }, but the scheduler can be a bit smarter about not allocating a whole thread just for sleeping.

Promise.anyof($p1, $p2) returns a promise that is fulfilled as soon as one of the arguments (which should also be promises) is fulfilled. So we wait either until the external program finished, or until the sleep promise is fulfilled.

With await $waitfor; the program waits for the promise to be fulfilled (or broken). When that is the case, we can't simply access $promise.result as before, because $promise (which is the promise for the external program) might not be fulfilled in the case of a timeout. So we have to check the status of the promise first and only then can we safely access $promise.result:

if !$timeout || $promise.status ~~ Kept {
    my $exitcode = $promise.result.exitcode;
    if $exitcode != 0 {
        say "Program @cmd[] exited with code $exitcode";
        print "Output:\n", $output if $output;
    }
    exit $exitcode;
}
else {
    ...
}

In the else { ... } branch, we need to handle the timeout case. This might be as simple as printing a statement that a timeout has occurred, and when silent-cron exits immediately afterwards, that might be acceptable. But we might want to do more in the future, so we should kill the external program. And if the program doesn't terminate after the friendly kill signal, it should receive a kill(9), which on UNIX systems forcefully terminates the program:

else {
    $proc.kill;
    say "Program @cmd[] did not finish after $timeout seconds";
    sleep 1 if $promise.status ~~ Planned;
    $proc.kill(9);
    await $promise;
    exit 2;
}

await $promise returns the result of the promise, so here a Proc object. Proc has a safety feature built in that if the command returned with a non-zero exit code, evaluating the object in void context throws an exception.

Since we explicitly handle the non-zero exit code in the code, we can suppress the generation of this exception by assigning the return value from await to a dummy variable:

my $dummy = await $promise

Since we don't need the value, we can also assign it to an anonymous variable instead:

$ = await $promise

More on Promises

If you have worked with concurrent or parallel programs in other languages, you might have come across threads, locks, mutexes, and other low-level constructs. These exist in Perl 6 too, but their direct usage is discouraged.

The problem with such low-level primitives is that they don't compose well. You can have two libraries that use threads and work fine on their own, but lead to deadlocks when combined within the same program. Or different components might launch threads on their own, which can lead to too many threads and high memory consumption when several such components come together in the same process.

Perl 6 provides higher-level primitives. Instead of spawning a thread, you use start to run code asynchronously and the scheduler decides which thread to run this on. If more start calls happen that ask for threads to schedule things on, some will run serially.

Here is a very simple example of running a computation in the background:

sub count-primes(Int $upto) {
    (1..$upto).grep(&is-prime).elems;
}

my $p = start count-primes 10_000;
say $p.status;
await $p;
say $p.result;

It gives this output:

Planned
1229

You can see that the main line of execution continued after the start call, and $p immediately had a value -- the promise, with status Planned.

As we've seen before, there are combinators for promises, anyof and allof. You can also chain actions to a promise using the then method:

sub count-primes(Int $upto) {
    (1..$upto).grep(&is-prime).elems;
}

my $p1 = start count-primes 10_000;
my $p2 = $p1.then({ say .result });
await $p2;

If an exception is thrown inside asynchronously executing code, the status of the promise becomes Broken, and calling its .result method re-throws the exception.

As a demonstration of the scheduler distributing tasks, let's consider a small Monte Carlo simulation to calculate an approximation for π. We generate a pair of random numbers between zero and one, and interpret them as dots in a square. A quarter circle with radius one covers the area of π/4, so the ratio of randomly placed dots within the quarter circle to the total number of dots approaches π/4, if we use enough dots.

sub pi-approx($iterations) {
    my $inside = 0;
    for 1..$iterations {
        my $x = 1.rand;
        my $y = 1.rand;
        $inside++ if $x * $x + $y * $y <= 1;
    }
    return ($inside / $iterations) * 4;
}
my @approximations = (1..1000).map({ start pi-approx(80) });
await @approximations;

say @approximations.map({.result}).sum / @approximations;

The program starts one thousand computations asynchronously, but if you look at a system monitoring tool while it runs, you'll observe only 16 threads running. This magic number comes from the default thread scheduler, and we can override it by providing our own instance of a scheduler above the previous code:

my $*SCHEDULER = ThreadPoolScheduler.new(:max_threads(3));

For CPU bound tasks like this Monte Carlo Simulation, it is a good idea to limit the number of threads roughly to the number of (possibly virtual) CPU cores; if many threads are stuck waiting for IO, a higher number of threads can yield better performance.

Possible Extensions

If you want to play with silent-cron, you could add a retry mechanism. If a command fails because of an external dependency (like an API or an NFS share), it might take time for that external dependency to recover. Hence you should add a quadratic or exponential backoff, that is, the wait time between retries should increase quadratically (1, 2, 4, 9, 16, ...) or exponentially (1, 2, 4, 8, 16, 32, ...).

Summary

We've seen an asynchronous API for running external programs and how to use Promises to implement timeouts. We've also discussed how promises are distributed to threads by a scheduler, allowing you to start an arbitrary number of promises without overloading your computer.

Subscribe to the Perl 6 book mailing list

* indicates required

[/perl-6] Permanent link

Sun, 25 Dec 2016

Perl 6 By Example: Testing the Say Function


Permanent link

This blog post is part of my ongoing project to write a book about Perl 6.

If you're interested, either in this book project or any other Perl 6 book news, please sign up for the mailing list at the bottom of the article, or here. It will be low volume (less than an email per month, on average).


Testing say()

In the previous installment I changed some code so that it wouldn't produce output, and instead did the output in the MAIN sub, which conveniently went untested.

Changing code to make it easier to test is a legitimate practice. But if you do have to test code that produces output by calling say, there's a small trick you can use. say works on a file handle, and you can swap out the default file handle, which is connected to standard output. Instead, you can put a dummy file handle in its place that captures the lower-level commands issued to it, and record this for testing.

There's a ready-made module for that, IO::String, but for the sake of learning we'll look at how it works:

use v6;

# function to be tested
sub doublespeak($x) {
    say $x ~ $x;
}

use Test;
plan 1;

my class OutputCapture {
    has @!lines;
    method print(\s) {
        @!lines.push(s);
    }
    method captured() {
        @!lines.join;
    }
}

my $output = do {
    my $*OUT = OutputCapture.new;
    doublespeak(42);
    $*OUT.captured;
};

is $output, "4242\n", 'doublespeak works';

The first part of the code is the function we want to test, sub doublespeak. It concatenates its argument with itself using the ~ string concatenation operator. The result is passed to say.

Under the hood, say does a bit of formatting, and then looks up the variable $*OUT. The * after the sigil marks it as a dynamic variable. The lookup for the dynamic variable goes through the call stack, and in each stack frame looks for a declaration of the variable, taking the first it finds. say then calls the method print on that object.

Normally, $*OUT contains an object of type IO::Handle, but the say function doesn't really care about that, as long as it can call a print method on that object. That's called duck typing: we don't really care about the type of the object, as long as it can quack like a duck. Or in this case, print like a duck.

Then comes the loading of the test module, followed by the declaration of how many tests to run:

use Test;
plan 1;

You can leave out the second line, and instead call done-testing after your tests. But if there's a chance that the test code itself might be buggy, and not run tests it's supposed to, it's good to have an up-front declaration of the number of expected tests, so that the Test module or the test harness can catch such errors.

The next part of the example is the declaration of type which we can use to emulate the IO::Handle:

my class OutputCapture {
    has @!lines;
    method print(\s) {
        @!lines.append(s);
    }
    method captured() {
        @!lines.join;
    }
}

class introduces a class, and the my prefix makes the name lexically scoped, just like in a my $var declaration.

has @!lines declares an attribute, that is, a variable that exists separately for each instance of class OutputCapture. The ! marks it as an attribute. We could leave it out, but having it right there means you always know where the name comes from when reading a larger class.

The attribute @!lines starts with an @, not a $ as other variables we have seen so far. The @ is the sigil for an array variable.

You might be seeing a trend now: the first letter of a variable or attribute name denotes its rough type (scalar, array, & for routines, and later we'll learn about % for hashes), and if the second letter is not a letter, it specifies its scope. We call this second letter a twigil. So far we've seen * for dynamic variables, and ! for attributes. Stay tuned for more.

Then penultimate block of our example is this:

my $output = do {
    my $*OUT = OutputCapture.new;
    doublespeak(42);
    $*OUT.captured;
};

do { ... } just executes the code inside the curly braces and returns the value of the last statement. Like all code blocks in Perl 6, it also introduces a new lexical scope.

The new scope comes in handy in the next line, where my $*OUT declares a new dynamic variable $*OUT, which is however only valid in the scope of the block. It is initialized with OutputCapture.new, a new instance of the class declared earlier. new isn't magic, it's simply inherited from OutputCapture's superclass. We didn't declare one, but by default, classes get type Any as a superclass, which provides (among other things) the method new as a constructor.

The call to doublespeak calls say, which in turn calls $*OUT.print. And since $*OUT is an instance of OutputCapture in this dynamic scope, the string passed to say lands in OutputCapture's attribute @!lines, where $*OUT.captured can access it again.

The final line,

is $output, "4242\n", 'doublespeak works';

calls the is function from the Test module.

In good old testing tradition, this produces output in the TAP format:

1..1
ok 1 - doublespeak works

Summary

We've seen that say() uses a dynamically scoped variable, $*OUT, as its output file handle. For testing purposes, we can substitute that with an object of our making. Which made us stumble upon the first glimpses of how classes are written in Perl 6.

Subscribe to the Perl 6 book mailing list

* indicates required

[/perl-6] Permanent link