Categories

Ads

Your advertisement could be here -- Contact me!.

Sun, 19 Feb 2017

Perl 6 By Example: A File and Directory Usage Graph


Permanent link

This blog post is part of my ongoing project to write a book about Perl 6.

If you're interested, please sign up for the mailing list at the bottom of the article, or here. It will be low volume (less than an email per month, on average).


A File and Directory Usage Graph

You bought a shiny new 2TB disk just a short while ago, and you're already getting low disk space warnings. What's taking up all that space?

To answer this question, and experiment a bit with data visualization, let's write a small tool that visualizes which files use up how much disk space.

To do that, we must first recursively read all directories and files in a given directory, and record their sizes. To get a listing of all elements in a directory, we can use the dir function, which returns a lazy list of IO::Path objects.

We distinguish between directories, which can have child entries, and files, which don't. Both can have a direct size, and in the case of directories also a total size, which includes files and subdirectories, recursively:

class File {
    has $.name;
    has $.size;
    method total-size() { $.size }
}

class Directory {
    has $.name;
    has $.size;
    has @.children;
    has $!total-size;
    method total-size() {
        $!total-size //= $.size + @.children.map({.total-size}).sum;
    }
}

sub tree(IO::Path $path) {
    if $path.d {
        return Directory.new(
            name     => $path.basename,
            size     => $path.s,
            children => dir($path).map(&tree),
        );
    }
    else {
        return File.new(
            name => $path.Str,
            size => $path.s,
        );
    }
}

Method total-size in class Directory uses the construct $var //= EXPR´. The//stands for *defined-OR*, so it returns the left-hand side if it has a defined value. Otherwise, it evalutes and returns the value ofEXPR`. Combined with the assignment operator, it evaluates the right-hand side only if the variable is undefined, and then stores the value of the expression in the variable. That's a short way to write a cache.

The code for reading a file tree recursively uses the d and s methods on IO::Path. d returns True for directories, and false for files. s returns the size. (Note that .s on directories used to throw an exception in older Rakudo versions. You must use Rakudo 2017.01-169 or newer for this to work; if you are stuck on an older version of Rakudo, you could hard code the size of a directory to a typical block size, like 4096 bytes. It typically won't skew your results too much).

Just to check that we've got a sensible data structure, we can write a short routine that prints it recursively, with indention to indicate nesting of directory entries:

sub print-tree($tree, Int $indent = 0) {
    say ' ' x $indent, format-size($tree.total-size), '  ', $tree.name;
    if $tree ~~ Directory {
        print-tree($_, $indent + 2) for $tree.children
    }
}

sub format-size(Int $bytes) {
    my @units = flat '', <k M G T P>;
    my @steps = (1, { $_ * 1024 } ... *).head(6);
    for @steps.kv -> $idx, $step {
        my $in-unit = $bytes / $step;
        if $in-unit < 1024 {
            return sprintf '%.1f%s', $in-unit, @units[$idx];
        }
    }
}

sub MAIN($dir = '.') {
    print-tree(tree($dir.IO));
}

The subroutine print-tree is pretty boring, if you're used to recursion. It prins out the name and size of the current node, and if the current node is a directory, recurses into each children with an increased indention. The indention is applied through the x string repetition operator, which when called as $string x $count repeates the $string for $count times.

To get a human-readable repretation of the size of a number, format-size knows a list of six units: the empty string for one, k (kilo) for 1024, M (mega) for 1024*1024 and so on. This list is stored in the array @units. The multiplies assoziated with each unit is stored in @steps, which is iniitliazed through the series operator. .... Its structure is INITIAL, CALLABLE ... LIMIT, and repeatedly applies CALLABLE first to the initial value, and then to next value generated and so on, until it hits LIMIT. The limit here is *, a special term called Whatever, which in this means it's unlimited. So the sequence operator returns a lazy, potentially infinite list, and the tailing .head(6) call limits it to 6 values.

To find the most appropriate unit to print the size in, we have to iterate over both the values and in the indexes of the array, which for @steps.kv -> $idx, $step { .. } accomplishes. sprintf, known from other programming languages, does the actual formatting to one digit after the dot, and appends the unit.

Generating a Tree Map

One possible visualization of file and directory sizes is a tree map, which represents each directory as a rectangle, and a each file inside it as a rectangle inside directory's rectangle. The size of each rectangle is proportional to the size of the file or directory it represents.

We'll generate an SVG file containing all those rectangles. Modern browsers support displaying those files, and also show mouse-over texts for each rectangle. This alleviates the burden to actually label the rectangnles, which can be quite a hassle.

To generate the SVG, we'll use the SVG module, which you can install with

$ zef install SVG

or

$ panda install SVG

depending on the module installer you have available.

This module provides a single static method, to which you pass nested pairs. Pairs whose values are arrays are turned into XML tags, and other pairs are turned into attributes. For example this Perl 6 script

use SVG;
print SVG.serialize(
    :svg[
        width => 100,
        height => 20,
        title => [
            'example',
        ]
    ],
);

produces this output:

<svg xmlns="http://www.w3.org/2000/svg"
     xmlns:svg="http://www.w3.org/2000/svg"
     xmlns:xlink="http://www.w3.org/1999/xlink"
     width="100"
     height="20">
        <title>example</title>
</svg>

(without the indention). The xmlns-tags are helpfully added by the SVG module, and are necessary for programs to recognize the file as SVG.

To return to the tree maps, a very simple way to lay out the rectangle is to recurse into areas, and for each area subdivide it either horizontally or vertically, depending on which axis is longer:

sub tree-map($tree, :$x1!, :$x2!, :$y1!, :$y2) {
    # do not produce rectangles for small files/dirs
    return if ($x2 - $x1) * ($y2 - $y1) < 20;

    # produce a rectangle for the current file or dir
    take 'rect' => [
        x      => $x1,
        y      => $y1,
        width  => $x2 - $x1,
        height => $y2 - $y1,
        style  => "fill:" ~ random-color(),
        title  => [$tree.name],
    ];
    return if $tree ~~ File;

    if $x2 - $x1 > $y2 - $y1 {
        # split along the x axis
        my $base = ($x2 - $x1) / $tree.total-size;
        my $new-x = $x1;
        for $tree.children -> $child {
            my $increment = $base * $child.total-size;
            tree-map(
                $child,
                x1 => $new-x,
                x2 => $new-x + $increment,
                :$y1,
                :$y2,
            );
            $new-x += $increment;
        }
    }
    else {
        # split along the y axis
        my $base = ($y2 - $y1) / $tree.total-size;
        my $new-y = $y1;
        for $tree.children -> $child {
            my $increment = $base * $child.total-size;
            tree-map(
                $child,
                :$x1,
                :$x2,
                y1 => $new-y,
                y2 => $new-y + $increment,
            );
            $new-y += $increment;
        }
    }
}

sub random-color {
    return 'rgb(' ~ (1..3).map({ (^256).pick }).join(',') ~ ')';
}

sub MAIN($dir = '.', :$type="flame") {
    my $tree = tree($dir.IO);
    use SVG;
    my $width = 1024;
    my $height = 768;
    say SVG.serialize(
        :svg[
            :$width,
            :$height,
            | gather tree-map $tree, x1 => 0, x2 => $width, y1 => 0, y2 => $height
        ]
    );
}


Tree map of an example directory, with random colors and a mouse-over hover identifying one of the files.

The generated file is not pretty, due to the random colors, and due to some files being identified as very narrow rectangles. But it does make it obvious that there are a few big files, and many mostly small files in a directory (which happens to be the .git directory of a repository). Viewing a file in a browser shows the name of the file on mouse over.

How did we generate this file?

Sub tree-map calls take to adds elements to a result list, so it must be called in the context of a gather statement. gather { take 1; take 2 } returns a lazy list of two elements, 1, 2. But the take calls don't have to occur in the lexical scope of the gather, they can be in any code that's directly or indirectly called from the gather. We call that the dynamic scope.

The rest of sub tree-map is mostly straight-forward. For each direction in which the remaining rectangle can be split, we calculate a base unit that signifies how many pixels a byte should take up. This is used to split up the current canvas into smaller ones, and use those to recurse into tree-map.

The random color generation uses ^256 to create a range from 0 to 256 (exclusive), and .pick returns a random element from this range. The result is a random CSS color string like rgb(120,240,5).

In sub MAIN, the gather returns a list, which would normally be nested inside the outer array. The pipe symbol | in :svg[ ..., | gather ... ] before the gather prevents the normal nesting, and flattens the list into the outer array.

Flame Graphs

The disadvantage of tree maps as generated before is that the human brain isn't very good at comparing rectangle sizes of different aspect ratios, so if their ratio of width to height is very different. Flame graphs prevent this perception errors by showing file sizes as horizontal bars. The vertical arrangement indicates the nesting of directories and files inside other directories. The disadvantage is that less of the available space is used for visualizing the file sizes.

Generating flame graphs is easier than tree maps, because you only need to subdivide in one direction, whereas the height of each bar is fixed, here to 15 pixel:

sub flame-graph($tree, :$x1!, :$x2!, :$y!, :$height!) {
    return if $y >= $height;
    take 'rect' => [
        x      => $x1,
        y      => $y,
        width  => $x2 - $x1,
        height => 15,
        style  => "fill:" ~ random-color(),
        title  => [$tree.name ~ ', ' ~ format-size($tree.total-size)],
    ];
    return if $tree ~~ File;

    my $base = ($x2 - $x1) / $tree.total-size;
    my $new-x = $x1;

    for $tree.children -> $child {
        my $increment = $base * $child.total-size;
        flame-graph(
            $child,
            x1 => $new-x,
            x2 => $new-x + $increment,
            y => $y + 15,
            :$height,
        );
        $new-x += $increment;
    }
}

We can add a switch to sub MAIN to call either tree-map or flame-graph, depending on a command line option:

sub MAIN($dir = '.', :$type="flame") {
    my $tree = tree($dir.IO);
    use SVG;
    my $width = 1024;
    my $height = 768;
    my &grapher = $type eq 'flame'
            ?? { flame-graph $tree, x1 => 0, x2 => $width, y => 0, :$height }
            !! { tree-map    $tree, x1 => 0, x2 => $width, y1 => 0, y2 => $height }
    say SVG.serialize(
        :svg[
            :$width,
            :$height,
            | gather grapher()
        ]
    );
}

Since SVG's coordinate system places the zero of the vertical axis at the top, this actually produces an inverted flame graph, sometimes called icicle graph:


Inverted flame graph with random colors, where the width of each bar represents a file/directory size, and the vertical position the nesting inside a directory.

Summary

We've explored tree maps and flame graphs to visualize which files and directories use up how much disk space.

But the code contains quite some duplications. Next week we'll explore techniques from functional programming to reduce code duplication. We'll also try to make the resulting files a bit prettier.

Subscribe to the Perl 6 book mailing list

* indicates required

[/perl-6] Permanent link

Sun, 12 Feb 2017

Perl 6 By Example: Generating Good Parse Errors from a Parser


Permanent link

This blog post is part of my ongoing project to write a book about Perl 6.

If you're interested, please sign up for the mailing list at the bottom of the article, or here. It will be low volume (less than an email per month, on average).


Good error messages are paramount to the user experience of any product. Parsers are no exception to this. Consider the difference between the message "Square bracket [ on line 5 closed by curly bracket } on line 5", in contrast to Python's lazy and generic "SyntaxError: invalid syntax".

In addition to the textual message, knowing the location of the parse error helps tremendously in figuring out what's wrong.

We'll explore how to generate better parsing error messages from a Perl 6 grammar, using the INI file parse from the previous blog posts as an example.

Failure is Normal

Before we start, it's important to realize that in a grammar-based parser, it's normal for regex to fail to match. Even in an overall successful parse.

Let's recall a part of the parser:

token block { [<pair> | <comment>]* }
token section { <header> <block> }
token TOP { <block> <section>* }

When this grammar matches against the string

key=value
[header]
other=stuff

then TOP calls block, which calls both pair and comment. The pair match succeeds, the comment match fails. No big deal. But since there is a * quantifier in token block, it tries again to match pair or comment. neither succeeds, but the overall match of token block still succeeds.

A nice way to visualize passed and failed submatches is to install the Grammar::Tracer module (zef install Grammar::Tracer or panda install Grammar::Tracer), and simple add the statement use Grammar::Tracer before the grammar definition. This produces debug output showing which rules matched and which didn't:

TOP
|  block
|  |  pair
|  |  |  key
|  |  |  * MATCH "key"
|  |  |  ws
|  |  |  * MATCH ""
|  |  |  ws
|  |  |  * MATCH ""
|  |  |  value
|  |  |  * MATCH "value"
|  |  |  ws
|  |  |  * MATCH ""
|  |  |  ws
|  |  |  * MATCH ""
|  |  * MATCH "key=value\n"
|  |  pair
|  |  |  key
|  |  |  * FAIL
|  |  * FAIL
|  |  comment
|  |  * FAIL
|  * MATCH "key=value\n"
|  section
...

Detecting Harmful Failure

To produce good parsing error messages, you must distinguish between expected and unexpected parse failures. As explained above, a match failure of a single regex or token is not generally an indication of a malformed input. But you can identify points where you know that once the regex engine got this far, the rest of the match must succeed.

If you recall pair:

rule pair { <key>  '='  <value> \n+ }

we know that if a key was parsed, we really expect the next character to be an equals sign. If not, the input is malformed.

In code, this looks like this:

rule pair {
    <key> 
    [ '=' || <expect('=')> ]
     <value> \n+
}

|| is a sequential alternative, which first tries to match the subregex on the left-hand side, and only executes the right-hand side if that failed. On the other hand, | executes all alternatives notionally in parallel, and takes the long match.

So now we have to define expect:

method expect($what) {
    die "Cannot parse input as INI file: Expected $what";
}

Yes, you can call methods just like regexes, because regexes really are methods under the hood. die throws an exception, so now the malformed input justakey produces the error

Cannot parse input as INI file: Expected =

followed by a backtrace. That's already better than "invalid syntax", though the position is still missing. Inside method expect, we can find the current parsing position through method pos, a method supplied by the implicit parent class Grammar that the grammar declaration brings with it.

We can use that to improve the error message a bit:

method expect($what) {
    die "Cannot parse input as INI file: Expected $what at character {self.pos}";
}

Providing Context

For larger inputs, we really want to print the line number. To calculate that, we need to get hold of the target string, which is available as method target:

method expect($what) {
    my $parsed-so-far = self.target.substr(0, self.pos);
    my @lines = $parsed-so-far.lines;
    die "Cannot parse input as INI file: Expected $what at line @lines.elems(), after '@lines[*-1]'";
}

This brings us from the "meh" realm of error messages to quite good.

IniFile.parse(q:to/EOI/);
key=value
[section]
key_without_value
more=key
EOI

now dies with

Cannot parse input as INI file: Expected = at line 3, after 'key_without_value'

You can refine method expect more, for example by providing context both before and after the position of the parse failure.

And of course you have to apply the [ thing || <expect('thing')> ] pattern at more places inside the regex to get better error messages.

Finally you can provide different kinds of error messages too. For example when parsing a section header, once the initial [ is parsed, you likely don't want an error message "expected rest of section header", but rather "malformed section header, at line ...":

rule pair {
    <key> 
    [ '=' || <expect('=')> ] 
    [ <value> || <expect('value')>]
     \n+
}
token header { 
     '[' 
     [ ( <-[ \[ \] \n ]>+ )  ']'
         || <error("malformed section header")> ]
}
...

method expect($what) {
    self.error("expected $what");
}

method error($msg) {
    my $parsed-so-far = self.target.substr(0, self.pos);
    my @lines = $parsed-so-far.lines;
    die "Cannot parse input as INI file: $msg at line @lines.elems(), after '@lines[*-1]'";
}

Since Rakudo Perl 6 uses grammars to parse Perl 6 input, you can use Rakudo's own grammar as source of inspiration for more ways to make error reporting even better.

Summary

To generate good error messages from a parser, you need to distinguish between expected and unexpected match failures. The sequential alternative || is a tool you can use to turn unexpected match failures into error messages by raising an exception from the second branch of the alternative.

Subscribe to the Perl 6 book mailing list

* indicates required

[/perl-6] Permanent link

Sun, 05 Feb 2017

Perl 6 By Example: Improved INI Parsing with Grammars


Permanent link

This blog post is part of my ongoing project to write a book about Perl 6.

If you're interested, please sign up for the mailing list at the bottom of the article, or here. It will be low volume (less than an email per month, on average).


Last week we've seen a collection of regexes that can parse a configuration file in the INI format that's popular in world of Microsoft Windows applications.

Here we'll explore grammars, a feature that groups regexes into a class-like structure, and how to extract structured data from a successful match.

Grammars

A grammar is class with some extra features that makes it suitable for parsing text. Along with methods and attributes you can put regexes into a grammar.

This is what the INI file parser looks like when formulated as a grammar:

grammar IniFile {
    token key     { \w+ }
    token value   { <!before \s> <-[\n;]>+ <!after \s> }
    token pair    { <key> \h* '=' \h* <value> \n+ }
    token header  { '[' <-[ \[ \] \n ]>+ ']' \n+ }
    token comment { ';' \N*\n+  }
    token block   { [<pair> | <comment>]* }
    token section { <header> <block> }
    token TOP     { <block> <section>* }
}

You can use it to parse some text by calling the parse method, which uses regex or token TOP as the entry point:

my $result = IniFile.parse($text);

Besides the standardized entry point, a grammar offers more advantages. You can inherit from it like from a normal class, thus bringing even more reusability to regexes. You can group extra functionality together with the regexes by adding methods to the grammar. And then there are some mechanisms in grammars that can make your life as a developer easier.

One of them is dealing with whitespace. In INI files, horizontal whitespace is generally considered to be insignificant, in that key=value and key = value lead to the same configuration of the application. So far we've dealt with that explicitly by adding \h* to token pair. But there are place we haven't actually considered. For example it's OK to have a comment that's not at start of the line.

The mechanism that grammars offer is that you can define a rule called ws, and when you declare a token with rule instead of token (or enable this feature in regex through the :sigspace modifier), Perl 6 inserts implicit <ws> calls for you where there is whitespace in the regex definition:

grammar IniFile {
    token ws { \h* }
    rule pair { <key>  '='  <value> \n+ }
    # rest as before
}

This might not be worth the effort for a single rule that needs to parse whitespace, but when there are more, this really pays off by keeping whitespace parsing in a singles space.

Note that you should only parse insignificant whitespace in token ws. For example for INI files, newlines are significant, so ws shouldn't match them.

Extracting Data from the Match

So far the IniFile grammar only checks whether a given input matches the grammar or not. But when it does match, we really want the result of the parse in a data structure that's easy to use. For example we could translate this example INI file:

key1=value2

[section1]
key2=value2
key3 = with spaces
; comment lines start with a semicolon, and are
; ignored by the parser

[section2]
more=stuff

Into this data structure of nested hashes:

{
    _ => {
        key1 => "value2"
    },
    section1 => {
        key2 => "value2",
        key3 => "with spaces"
    },
    section2 => {
        more => "stuff"
    }
}

Key-value pairs from outside of any section show up in the _ top-level key.

The result from the IniFile.parse call is a Match object that has (nearly) all the information necessary to extract the desired match. If you turn a Match object into a string, it becomes the matched string. But there's more. You can use it like a hash to extract the matches from named submatches. For example if the top-level match from

token TOP { <block> <section>* }

produces a Match object $m, then $m<block> is again a Match object, this one from the match of the call of token block´. And$m

is a list ofMatchobjects from the repeated calls to tokensection. So aMatch` is really a tree of matches.

We can walk this data structure to extract the nested hashes. Token header matches a string like "[section1]\n", and we're only interested in"section1". To get to the inner part, we can modify token header` by inserting a pair of round parenthesis around the subregex whose match we're interested in:

token header { '[' ( <-[ \[ \] \n ]>+ ) ']' \n+ }
#                  ^^^^^^^^^^^^^^^^^^^^  a capturing group

That's a capturing group, and we can get its match by using the top-level match for header as an array, and accessing its first element. This leads us to the full INI parser:

sub parse-ini(Str $input) {
    my $m = IniFile.parse($input);
    unless $m {
        die "The input is not a valid INI file.";
    }

    sub block(Match $m) {
        my %result;
        for $m<block><pair> -> $pair {
            %result{ $pair<key>.Str } = $pair<value>.Str;
        }
        return %result;
    }

    my %result;
    %result<_> = hash-from-block($m);
    for $m<section> -> $section {
        %result{ $section<header>[0].Str } = hash-from-block($section);
    }
    return %result;
}

This top-down approach works, but it requires a very intimate understanding of the grammar's structure. Which means that if you change the structure during maintenance, you'll have a hard time figuring out how to change the data extraction code.

So Perl 6 offers a bottom-up approach as well. It allows you to write a data extraction or action method for each regex, token or rule. The grammar engine passes in the match object as the single argument, and the action method can call the routine make to attach a result to the match object. The result is available through the .made method on the match object.

This execution of action methods happens as soon as a regex matches successfully, which means that an action method for a regex can rely on the fact that the action methods for subregex calls have already run. For example when the rule pair { <key> '=' <value> \n+ } is being executed, first token key matches successfully, and its action method runs immediately afterwards. Then token value matches, and its action method runs too. Then finally rule pair itself can match successfully, so its action method can rely on $m<key>.made and $m<value>.made being available, assuming that the match result is stored in variable $m.

Speaking of variables, a regex match implicitly stores its result in the special variable $/, and it is custom to use $/ as parameter in action methods. And there is a shortcut for accessing named submatches: instead of writing $/<key>, you can write $<key>. With this convention in mind, the action class becomes:

class IniFile::Actions {
    method key($/)     { make $/.Str }
    method value($/)   { make $/.Str }
    method header($/)  { make $/[0].Str }
    method pair($/)    { make $<key>.made => $<value>.made }
    method block($/)   { make $<pair>.map({ .made }).hash }
    method section($/) { make $<header>.made => $<block>.made }
    method TOP($/)     {
        make {
            _ => $<block>.made,
            $<section>.map: { .made },
        }
    }
}

The first two action methods are really simple. The result of a key or value match is simply the string that matched. For a header, it's just the substring inside the brackets. Fittingly, a pair returns a Pair object, composed from key and value. Method block constructs a hash from all the lines in the block by iterating over each pair submatch, extracting the already attached Pair object. One level above that in the match tree, section takes that hash and pairs it with the name of section, extracted from $<header>.made. Finally the top-level action method gathers the sectionless key-value pairs under they key _ as well as all the sections, and returns them in a hash.

In each method of the action class, we only rely on the knowledge of the first level of regexes called directly from the regex that corresponds to the action method, and the data types that they .made. Thus when you refactor one regex, you also have to change only the corresponding action method. Nobody needs to be aware of the global structure of the grammar.

Now we just have to tell Perl 6 to actually use the action class:

sub parse-ini(Str $input) {
    my $m = IniFile.parse($input, :actions(IniFile::Actions));
    unless $m {
        die "The input is not a valid INI file.";
    }

    return $m.made
}

If you want to start parsing with a different rule than TOP (which you might want to do in a test, for example), you can pass a named argument rule to method parse:

sub parse-ini(Str $input, :$rule = 'TOP') {
    my $m = IniFile.parse($input,
        :actions(IniFile::Actions),
        :$rule,
    );
    unless $m {
        die "The input is not a valid INI file.";
    }

    return $m.made
}

say parse-ini($ini).perl;

use Test;

is-deeply parse-ini("k = v\n", :rule<pair>), 'k' => 'v',
    'can parse a simple pair';
done-testing;

To better encapsulate all the parsing functionality within the grammar, we can turn parse-ini into a method:

grammar IniFile {
    # regexes/tokens unchanged as before

    method parse-ini(Str $input, :$rule = 'TOP') {
        my $m = self.parse($input,
            :actions(IniFile::Actions),
            :$rule,
        );
        unless $m {
            die "The input is not a valid INI file.";
        }

        return $m.made
    }
}

# Usage:

my $result = IniFile.parse-ini($text);

To make this work, the class IniFile::Actions either has to be declared before the grammar, or it needs to be pre-declared with class IniFile::Action { ... } at the top of the file (with literal three dots to mark it as a forward declaration).

Summary

Match objects are really a tree of matches, with nodes for each named submatch and for each capturing group. Action methods make it easy to decouple parsing from data extraction.

Next we'll explore how to generate better error messages from a failed parse.

Subscribe to the Perl 6 book mailing list

* indicates required

[/perl-6] Permanent link

Sun, 29 Jan 2017

Perl 6 By Example: Parsing INI files


Permanent link

This blog post is part of my ongoing project to write a book about Perl 6.

If you're interested, please sign up for the mailing list at the bottom of the article, or here. It will be low volume (less than an email per month, on average).


You've probably seen .ini files before; they are quite common as configuration files on the Microsoft Windows platform, but are also in many other places like ODBC configuration files, Ansible's inventory files and so on.

This is what they look like:

key1=value2

[section1]
key2=value2
key3 = with spaces
; comment lines start with a semicolon, and are
; ignored by the parser

[section2]
more=stuff

Perl 6 offers regexes for parsing, and grammars for structuring and reusing regexes.

Regex Basics

A regex is a piece of code that acts as a pattern for strings with a common structure. It's derived from the computer science concept of a regular expression, but adopted to allow more constructs than pure regular expressions allow, and with some added features that make them easier to use.

We'll use named regexes to match the primitives, and then use regexes that call these named regexes to build a parser for the INI files. Since INI files have no universally accepted, formal grammar, we have to make stuff up a we go.

Let's start with parsing value pairs, like key1=value1. First the key. It may contain letters, digits and the underscore _. There's a shortcut for match such characters, \w, and matching more at least one works by appending a + character:

use v6;

my regex key { \w+ }

multi sub MAIN('test') {
    use Test;
    ok 'abc'    ~~ /^ <key> $/, '<key> matches a simple identifier';
    ok '[abc]' !~~ /^ <key> $/, '<key> does not match a section header';
    done-testing;
}

my regex key { \w+ } declares a lexically (my) scoped regex called key that matches one or more word character.

There is a long tradition in programming languages to support so-called Perl Compatible Regular Expressions, short PCRE. Many programming languages support some deviations from PCRE, including Perl itself, but common syntax elements remain throughout most implementations. Perl 6 still supports some of these elements, but deviates substantially in others.

Here \w+ is the same as in PCRE, but the fact that white space around the \w+ is ignored is not. In the testing routine, the slashes in 'abc' ~~ /^ <key> $/ delimit an anonymous regex. In this regex, ^ and $ stand for the start and the end of the matched string, respectively, which is familiar from PCRE. But then <key> calls the named regex key from earlier. This again is a Perl 6 extension. In PCRE, the < in a regex matches a literal <. In Perl 6 regexes, it introduces a subrule call.

In general, all non-word characters are reserved for "special" syntax, and you have to quote or backslash them to get the literal meaning. For example \< or '<' in a regex match a less-then sign. Word characters (letters, digits and the underscore) always match literally.

Parsing the INI primitives

Coming back the INI parsing, we have to think about what characters are allowed inside a value. Listing allowed characters seems to be like a futile exercise, since we are very likely to forget some. Instead, we should think about what's not allowed in a value. Newlines certainly aren't, because they introduce the next key/value pair or a section heading. Neither are semicolons, because they introduce a comment.

We can formulate this exclusion as a negated character class: <-[ \n ; ]> matches any single character that is neither a newline nor a semicolon. Note that inside a character class, nearly all characters lose their special meaning. Only backslash, whitespace and the closing bracket stand for anything other than themselves. Inside and outside of character classes alike, \n matches a single newline character, and \s and whitespace. The upper-case inverts that, so that for example \S matches any single character that is not whitespace.

This leads us to a version version of a regex for match a value in an ini file:

my regex value { <-[ \n ; ]>+ }

There is one problem with this regex: it also matches leading and trailing whitespace, which we don't want to consider as part of the value:

my regex value { <-[ \n ; ]>+ }
if ' abc ' ~~ /<value>/ {
    say "matched '$/'";         # matched ' abc '
}

If Perl 6 regexes were limited to a regular language in the Computer Science sense, we'd have to something like this:

my regex value { 
    # match a first non-whitespace character
    <-[ \s ; ]> 
    [
        # then arbitrarily many that can contain whitespace
        <-[ \n ; ]>* 
        # ... terminated by one non-whitespace 
        <-[ \s ; ]> 
    ]?  # and make it optional, in case the value is only
        # only one non-whitespace character
}

And now you know why people respond with "And now you have two problems" when proposing to solve problems with regexes. A simpler solution is to match a value as introduced first, and then introduce a constraint that neither the first nor the last character may be whitespace:

my regex value { <!before \s> <-[ \s ; ]>+ <!after \s> }

along with accompanying tests:

is ' abc ' ~~ /<value>/, 'abc', '<value> does not match leading or trailing whitespace';
is ' a' ~~ /<value>/, 'a', '<value> watches single non-whitespace too';
ok "a\nb" !~~ /^ <value> $/, '<valuee> does not match \n';

<!before regex> is a negated look-ahead, that is, the following text must not match the regex, and the text isn't consumed while matching. Unsurprisingly, <!after regex> is the negated look-behind, which tries to match text that's already been matched, and must not succeed in doing so for the whole match to be successful.

This being Perl 6, there is of course yet another way to approach this problem. If you formulate the requirements as "a value must not contain a newline or semicolon and start with a non-whitepsace and end with a non-whitespace", it become obvious that if we just had an AND operator in regexes, this could be easy. And it is:

my regex value { <-[ \s ; ]>+ & \S.* & .*\S }

The & operator delimits two or more smaller regex expressions that must all match the same string successfully for the whole match to succeed. \S.* matches any string that starts with a non-whitesspace character (\S), followed by any character (.) any number of times *. Likewise .*\S matches any string that ends with non-whitespace character.

Who would have thought that matching something as seemingly simple as a value in a configuration file could be so involved? Luckily, matching a pair of key and value is much simpler, now that we know how to match each on their own:

my regex pair { <key> '=' <value> }

And this works great, as long as there are no blanks surrounding the equality sign. If there is, we have to match it separately:

my regex pair { <key> \h* '=' \h* <value> }

\h matches a horizontal whitespace, that is, a blank, a tabulator character, or any other fancy space-like thing that Unicode has in store for us (for example also the non-breaking space), but not a newline.

Speaking of newlines, it's a good idea to match a newline at the end of regex pair, and since we ignore empty lines, let's match more than one too:

my regex pair { <key> \h* '=' \h* <value> \n+ }

Time to write some tests as well:

ok "key=vaule\n" ~~ /<pair>/, 'simple pair';
ok "key = value\n\n" ~~ /<pair>/, 'pair with blanks';
ok "key\n= value\n" !~~ /<pair>/, 'pair with newline before assignment';

A section header is a string in brackets, so the string itself shouldn't contain brackets or a newline:

my regex header { '[' <-[ \[ \] \n ]>+ ']' \n+ }

# and in multi sub MAIN('test'):
ok "[abc]\n"    ~~ /^ <header> $/, 'simple header';
ok "[a c]\n"    ~~ /^ <header> $/, 'header with spaces';
ok "[a [b]]\n" !~~ /^ <header> $/, 'cannot nest headers';
ok "[a\nb]\n"  !~~ /^ <header> $/, 'No newlines inside headers';

The last remaining primitive is the comment:

my regex comment { ';' \N*\n+ }

\N matches any character that's not a newline, so the comment is just a semicolon, and then anything until the end of the line.

Putting Things Together

A section of an INI file is a header followed by some key-value pairs or comment lines:

my regex section {
    <header>
    [ <pair> | <comment> ]*
}

[...] groups a part of a regex, so that the quantifier * after it applies to the whole group, not just to the last term.

The whole INI file consists of potentially some initial key/value pairs or comments followed by some sections:

my regex inifile {
    [ <pair> | <comment> ]*
    <section>*
}

The avid reader has noticed that the [ <pair> | <comment> ]* part of a regex has been used twice, so it's a good idea to extract it into a standalone regex:

my regex block   { [ <pair> | <comment> ]* }
my regex section { <header> <block> }
my regex inifile { <block> <section>* }

It's time for the "ultimate" test:

my $ini = q:to/EOI/;
key1=value2

[section1]
key2=value2
key3 = with spaces
; comment lines start with a semicolon, and are
; ignored by the parser

[section2]
more=stuff
EOI

ok $ini ~~ /^<inifile>$/, 'Can parse a full ini file';

Backtracking

Regex matching seems magical to many programmers. You just state the pattern, and the regex engine determines for you whether a string matches the pattern or not. While implementing a regex engine is a tricky business, the basics aren't too hard to understand.

The regex engine goes through the parts of a regex from left to right, trying to match each part of the regex. It keeps track of what part of the string it matched so far in a cursor. If a part of a regex can't find a match, the regex engine tries to alter the previous match to take up fewer characters, and the retry the failed match at the new position.

For example if you execute the regex match

'abc' ~~ /.* b/

the regex engine first evaluates the .*. The . matches any character. The * quantifier is greedy, which means it tries to match as many characters as it can. It ends up matching the whole string, abc. Then the regex engine tries to match the b, which is a literal. Since the previous match gobbled up the whole string, matching c against the remaining empty string fails. So the previous regex part, .*, must give up a character. It now matches ab, and the literal matcher for the b compares b and c, and fails again. So there is a final iteration where the .* once again gives up one character it matched, and now the b literal can match the second character in the string.

This back and forth between the parts of a regex is called backtracking. It's great feature when you search for a pattern in a string. But in a parser, it is usually not desirable. If for example the regex key matched the substring "key2" in the input"key2=value2`, you don't want it to match a shorter substring just because the next part of the regex can't match.

There are three major reasons why you don't want that. The first is that it makes debugging harder. When humans think about how a text is structured, they usually commit pretty quickly to basic tokenization, such as where a word or a sentence ends. Thus backtracking can be very uninuitive. If you generate error messages based on which regexes failed to match, backtracking basically always leads to the error message being pretty useless.

The second reason is that backtracking can lead to unexpected regex matches. For example you want to match two words, optionally separated by whitespaces, and you try to translate this directly to a regex:

say "two words" ~~ /\w+\s*\w+/;     # 「two words」

This seems to work: the first \w+ matches the first word, the seccond oen matches the second word, all fine and good. Until you find that it actually matches a single word too:

say "two" ~~ /\w+\s*\w+/;           # 「two」

How did that happen? Well, the first \w+ matched the whole word, \s* successfully matched the empty string, and then the second \w+ failed, forcing the previous to parts of the regex to match differently. So in the second iteration, the first \w+only matchestw, and the second\w+ matcheso`. And then you realize that if two words aren't delimtied by whitespace, how do you even tell where one word ends and the next one starts? With backtracking disabled, the regex fails to match instead of matching in an unintended way.

The third reason is performance. When you disable backtracking, the regex engine has to look at each character only once, or once for each branch it can take in the case of alternatives. With backtracking, the regex engine can be stuck in backtracking loops that take over-proportionally longer with increasing length of the input string.

To disable backtracking, you simply have to replace the word regex by token in the declaration, or by using the :ratchet modifier inside the regex.

In the INI file parser, only regex value needs backtracking (though other formualtions discussed above don't need it), all the other regexes can be switched over to tokens safely:

my token key     { \w+ }
my regex value   {  <!before \s> <-[\n;]>+ <!after \s> }
my token pair    { <key> \h* '=' \h* <value> \n+ }
my token header  { '[' <-[ \[ \] \n ]>+ ']' \n+ }
my token comment { ';' \N*\n+  }
my token block { [ <pair> | <comment> ]* }
my token section { <header> <block> }
my token inifile { <block> <section>* }

Summary

Perl 6 allows regex reuse by treating them as first-class citizens, allowing them to be named and called like normal routines. Further clutter is removed by allowing whitespace inside regexes.

These features allows you to write regexes to parse proper file formats and even programming languages. So far we have only seen a binary decision about whether a string matches a regex or not. In the future, we'll explore ways to improve code reuse even more, extract structured data from the match, and give better error messages when the parse fails.

Subscribe to the Perl 6 book mailing list

* indicates required

[/perl-6] Permanent link

Sun, 22 Jan 2017

Perl 6 By Example: Perl 6 Review


Permanent link

This blog post is part of my ongoing project to write a book about Perl 6.

If you're interested, please sign up for the mailing list at the bottom of the article, or here. It will be low volume (less than an email per month, on average).


In the previous "Perl 6 by Example" blog posts we've discussed some examples interleaved with the Perl 6 mechanics that make them work. Here I want to summarize and deepen the Perl 6 knowledge that we've touched on so far, removed from the original examples.

Variables and Scoping

In Perl 6, variable names are made of a sigil, $, @, % or &, followed by an identifier. The sigil implies a type constraint, where $ is the most general one (no restriction by default), @ is for arrays, % for hashes (associative arrays/maps), and & for code objects.

Identifiers can contain - and ' characters, as long as the character after it is a letter. Identifiers must start with a letter or underscore.

Subroutines and variables declared with my are lexically scoped. They are visible from the point of the declaration to the end of the current {}-enclosed block (or the current file, in case the declaration is outside a block). Subroutine parameters are visible in the signature and block of the subroutine.

An optional twigil between the sigil and identifier can influence the scoping. The * twigil marks a dynamically scoped variable, thus lookup is performed in the current call stack. ! marks attributes, that is, a per-instance variable that's attached to an object.

Subroutines

A subroutine, or short sub, is a piece of code with its own scope and usually also a name. It has a signature which specifies what kind of values you have to pass in when you call it:

sub chunks(Str $s, Int $chars) {
#         ^^^^^^^^^^^^^^^^^^^^ signature
#   ^^^^^^ name
    gather for 0 .. $s.chars / $chars - 1 -> $idx {
        take substr($s, $idx * $chars, $chars);
    }
}

The variables used in the signature are called parameters, whereas we call the values that you pass in arguments.

To refer to a subroutine without calling it, put an ampersand character in front of it, for example

say &chunks.^name;      # Sub

to call it, simply use its name, followed by the list of arguments, which can optionally be in parentheses:

say chunks 'abcd', 2;   # (ab cd)
say chunks('abcd', 2);  # (ab cd)

You only need the parentheses if some other construct would otherwise interfere with the subroutine call. For example if you intend to write

say chunks(join('x', 'ab', 'c'), 2);

and you leave out the inner pair of parentheses:

say chunks(join 'x', 'ab', 'c', 2);

then all the arguments go to the join function, leaving only one argument to the chunks function. On the other hand it is fine to leave out the outer pair of parentheses and write

say chunks join('x', 'ab', 'c'), 2;

because there's no ambiguity here.

One case worth noting is that if you call a subroutine without arguments as the block of an if condition or a for loop (or similar constructs), you have to include the parentheses, because otherwise the block is parsed as an argument to the function.

sub random-choice() {
    Bool.pick;
}

# right way:
if random-choice() {
    say 'You were lucky.';
}

# wrong way:
if random-choice {
    say 'You were lucky.';
}

If you do happen to make this mistake, the Perl 6 compiler tries very hard to detect it. In the example above, it says

Function 'random-choice' needs parens to avoid gobbling block

and when it tries to parse the block for the if-statement, it doesn't find one:

Missing block (apparently claimed by 'random-choice')

When you have a sub called MAIN, Perl 6 uses its signature to parse the command line arguments and pass those command line arguments to MAIN.

multi subs are several subroutines with the same name but different signatures. The compiler decides at run time which of the candidates it calls based on the best match between arguments and parameters.

Classes and Objects

Class declarations follow the same syntactic schema as subroutine declarations: the keyword class, followed by the name, followed by the body in curly braces:

class OutputCapture {
    has @!lines;
    method print(\s) {
        @!lines.push(s);
    }
    method captured() {
        @!lines.join;
    }
}

By default, type names are scoped to the current namespace, however you can make it lexically scoped by adding a my in front of class:

my class OutputCapture { ... }

Creating a new instance generally works by calling the new method on the type object. The new method is inherited from the implicit parent class Any that all types get:

my $c = OutputCapture.new;

Per-instance state is stored in attributes, which are declared with the has keyword, as seen above in has @!lines. Attributes are always private, as indicated by the ! twigil. If you use the dot . twigil in the declaration instead, you have both the private attribute @!lines and a public, read-only accessor method:

my class OutputCapture {
    has @.lines;
    method print(\s) {
        # the private name with ! still works
        @!lines.push(s);
    }
    method captured() {
        @!lines.join;
    }
}
my $c = OutputCapture.new;
$c.print('42');
# use the `lines` accessor method:
say $c.lines;       # [42]

When you declare attributes with the dot twigil, you can also initialize the attributes from the constructor through named arguments, as in OutputCapture.new( lines => [42] ).

Private methods start with a ! and can only be called from inside the class body as self!private-method.

Methods are basically just subroutines, with two differences. The first is that they get an implicit parameter called self, which contains the object the method is called on (which we call the invocant). The second is that if you call a subroutine, the compiler searches for this subroutine in the current lexical scope, and outer scopes. On the other hand, the methods for a method calls are looked up in the class of the object and its superclasses.

Concurrency

Perl 6 provides high-level primitives for concurrency and parallel execution. Instead of explicitly spawning new threads, you are encouraged to run a computation with start, which returns a Promise. This is an object that promises that in the future the computation will yield a result. The status can thus be Planned, Kept or Broken. You can chain promises, combine them, and wait for them.

In the background, a scheduler distributes such computations to operating system level threads. The default scheduler is a thread pool scheduler with an upper limit to the number of threads to use.

Communication between parallel computations should happen through thread-safe data structures. Foremost among them are the Channel, a thread-safe queue, and Supply, Perl 6's implementation of the Observer Pattern. Supplies are very powerful, because you can transform them with methods such as map, grep, throttle or delayed, and use their actor semantic to ensure that a consumer is run in only one thread at a time.

Subscribe to the Perl 6 book mailing list

* indicates required

[/perl-6] Permanent link

Sun, 15 Jan 2017

Perl 6 By Example: Stateful Silent Cron


Permanent link

This blog post is part of my ongoing project to write a book about Perl 6.

If you're interested, please sign up for the mailing list at the bottom of the article, or here. It will be low volume (less than an email per month, on average).


In the last two installments we've looked at silent-cron, a wrapper around external programs that silences them in case their exit status is zero. But to make it really practical, it should also silence occasional failures.

External APIs fail, networks become congested, and other things happen that prevent a job from succeeding, so some kind of retry mechanism is desirable. In case of a cron job, cron already takes care of retrying a job on a regular basis, so silent-cron should just suppress occasional errors. On the other hand, if a job fails consistently, this is usually something that an operator or developer should look into, so it's a problem worth reporting.

To implement this functionality, silent-cron needs to store persistent state between separate runs. It needs to record the results from the current run and then analyze if the failure history qualifies as "occasional".

Persistent Storage

The storage backend needs to write and retrieve structured data, and protect concurrent access to the state file with locking. A good library for such a storage backend is SQLite, a zero-maintenance SQL engine that's available as a C library. It's public domain software and in use in most major browsers, operating systems and even some airliners.

Perl 6 gives you access to SQLite's functionality through DBIish, a generic database interface with backend drivers for SQLite, MySQL, PostgreSQL and Oracle DB. To use it, first make sure that SQLite3 is installed, including its header files. On a Debian-based Linux system, for example, you can achieve this with apt-get install libsqlite3-dev. If you are using the Rakudo Star distribution, DBIish is already available. If not, you can use one of the module installers to retrieve and install it: panda install DBIish or zef install DBIish.

To use the DBIish's SQLite backend, you first have to create a database handle by selecting the backend and supplying connection information:

use DBIish;
my $dbh = DBIish.connect('SQLite', :database('database-file.sqlite3'));

Connecting to a database file that does not yet exist creates that file.

One-off SQL statements can be executed directly on the database handle:

$dbh.do('INSERT INTO player (name) VALUES ?', 'John');

The ? in the SQL is a placeholder that is passed out-of-band as a separate argument to the do method, which avoids potential errors such as SQL injection vulnerabilities.

Queries tend to work by first preparing a statement which returns a statement handle. You can execute a statement once or multiple times, and retrieve result rows after each execute call:

my $sth = $dbh.prepare('SELECT id FROM player WHERE name = ?');

my %ids;
for <John Jack> -> $name {
    $sth.execute($name);
    %ids{ $name } = $sth.row[0];
}
$sth.finish;

Developing the Storage Backend

We shouldn't just stuff all the storage handling code into sub MAIN, we should instead carefully consider the creation of a useful API for the storage backend. At first, we need only two pieces of functionality: insert the result of a job execution; and retrieve the most recent results.

Since silent-cron can be used to guard multiple cron jobs on the same machine, we might need something to distinguish the different jobs so that one of them succeeding doesn't prevent error reporting for one that is constantly failing. For that we introduce a job name, which can default to the command (including arguments) being executed but which can be set explicitly on the command line.

The API for the storage backend could look something like this:

my $repo = ExecutionResultRepository.new(
    jobname   => 'refresh cache',
    statefile => 'silent-cron.sqlite3',
);
$repo.insert($result);
my @last-results = $repo.tail(5);

This API isn't specific to the SQLite backend at all; a storage backend that works with plain text files could have the exact same API.

Let's implement this API. First we need the class and the two attributes that should be obvious from the usage example above:

class ExecutionResultRepository {
    has $.jobname   is required;
    has $.statefile is required;
    # ... more code

To implement the insert method, we need to connect to the database and create the relevant table if it doesn't exist yet.

has $!db;
method !db() {
    return $!db if $!db;
    $!db = DBIish.connect('SQLite', :database($.statefile));
    self!create-schema();
    return $!db;
}

This code uses a private attribute $!db to cache the database handle and a private method !db to create the handle if it doesn't exist yet.

Private methods are declared like ordinary methods, except that the name starts with an exclamation mark. To call one, substitute the method call dot for the exclamation mark, in other words, use self!db() instead of self.db().

The !db method also calls the next private method, !create-schema, which creates the storage table and some indexes:

method !create-schema() {
    $!db.do(qq:to/SCHEMA/);
        CREATE TABLE IF NOT EXISTS $table (
            id          INTEGER PRIMARY KEY,
            jobname     VARCHAR NOT NULL,
            exitcode    INTEGER NOT NULL,
            timed_out   INTEGER NOT NULL,
            output      VARCHAR NOT NULL,
            executed    TIMESTAMP NOT NULL DEFAULT (DATETIME('NOW'))
        );
    SCHEMA
    $!db.do(qq:to/INDEX/);
        CREATE INDEX IF NOT EXISTS {$table}_jobname_exitcode ON $table ( jobname, exitcode );
    INDEX
    $!db.do(qq:to/INDEX/);
        CREATE INDEX IF NOT EXISTS {$table}_jobname_executed ON $table ( jobname, executed );
    INDEX
}

Multi-line string literals are best written with the heredoc syntax. qq:to/DELIMITER/ tells Perl 6 to finish parsing the current statement so that you can still close the method call parenthesis and add the statement-ending semicolon. The next line starts the string literal, which goes on until Perl 6 finds the delimiter on a line on its own. Leading whitespace is stripped from each line of the string literal by as much as the closing delimiter is indented.

For example

print q:to/EOS/;
    Not indented
        Indented four spaces
    EOS

Produces the output

Not indented
    Indented four spaces

Now that we have a working database connection and know that the database table exists, inserting a new record becomes easy:

method insert(ExecutionResult $r) {
    self!db.do(qq:to/INSERT/, $.jobname, $r.exitcode, $r.timed-out, $r.output);
        INSERT INTO $table (jobname, exitcode, timed_out, output)
        VALUES(?, ?, ?, ?)
    INSERT
}

Selecting the most recent records is a bit more work, partially because we need to convert the table rows into objects:

method tail(Int $count) {
    my $sth = self!db.prepare(qq:to/SELECT/);
        SELECT exitcode, timed_out, output
          FROM $table
         WHERE jobname = ?
      ORDER BY executed DESC
         LIMIT $count
    SELECT
    $sth.execute($.jobname);
    $sth.allrows(:array-of-hash).map: -> %h {
        ExecutionResult.new(
            exitcode  => %h<exitcode>,
            timed-out => ?%h<timed_out>,
            output    => %h<output>,
        );
    }
}

The last statement in the tail method deserves a bit of extra attention. $sth.allrows(:array-of-hash) produces the database rows as a list of hashes. This list is lazy, that is, it's generated on-demand. Lazy lists are a very convenient feature because they allow you to use iterators and lists with the same API. For example when reading lines from a file, you can write for $handle.lines -> $line { ... }, and the lines method doesn't have to load the whole file into memory; instead it can read a line whenever it is accessed.

$sth.allrows(...) is lazy, and so is the .map call that comes after it. map transforms a list one element at a time by calling the code object that's passed to it. And that is done lazily as well. So SQLite only retrieves rows from the database file when elements of the resulting list are actually accessed.

Using the Storage Backend

With the storage API in place, it's time to use it:

multi sub MAIN(*@cmd, :$timeout, :$jobname is copy,
               :$statefile='silent-cron.sqlite3', Int :$tries = 3) {
    $jobname //= @cmd.Str;
    my $result = run-with-timeout(@cmd, :$timeout);
    my $repo = ExecutionResultRepository.new(:$jobname, :$statefile);
    $repo.insert($result);

    my @runs = $repo.tail($tries);

    unless $result.is-success or @runs.grep({.is-success}) {
        say "The last @runs.elems() runs of @cmd[] all failed, the last execution ",
            $result.timed-out ?? "ran into a timeout"
                              !! "exited with code $result.exitcode()";

        print "Output:\n", $result.output if $result.output;
    }
    exit $result.exitcode // 2;
}

Now a job that succeeds a few times, and then fails up to two times in a row doesn't produce any error output, and only the third failed execution in a row produces output. You can override that on the command line with --tries=5.

Summary

We've discussed DBIish, a database API with pluggable backend, and explored using it with SQLite to store persistent data. In the process we also came across lazy lists and a new form of string literals called heredocs.

Subscribe to the Perl 6 book mailing list

* indicates required

[/perl-6] Permanent link

Sun, 08 Jan 2017

Perl 6 By Example: Testing Silent Cron


Permanent link

This blog post is part of my ongoing project to write a book about Perl 6.

If you're interested, please sign up for the mailing list at the bottom of the article, or here. It will be low volume (less than an email per month, on average).


The previous blog post left us with a bare-bones silent-cron implementation, but without tests. I probably sound like a broken record for bringing this up time and again, but I really want some tests when I start refactoring or extending my programs. And this time, getting the tests in is a bit harder, so I think it's worth discussing how to do it.

Refactoring

As a short reminder, this is what the program looks like:

#!/usr/bin/env perl6

sub MAIN(*@cmd, :$timeout) {
    my $proc = Proc::Async.new(|@cmd);
    my $collector = Channel.new;
    for $proc.stdout, $proc.stderr -> $supply {
        $supply.tap: { $collector.send($_) }
    }
    my $promise = $proc.start;
    my $waitfor = $promise;
    $waitfor = Promise.anyof(Promise.in($timeout), $promise)
        if $timeout;
    $ = await $waitfor;

    $collector.close;
    my $output = $collector.list.join;

    if !$timeout || $promise.status ~~ Kept {
        my $exitcode = $promise.result.exitcode;
        if $exitcode != 0 {
            say "Program @cmd[] exited with code $exitcode";
            print "Output:\n", $output if $output;
        }
        exit $exitcode;
    }
    else {
        $proc.kill;
        say "Program @cmd[] did not finish after $timeout seconds";
        sleep 1 if $promise.status ~~ Planned;
        $proc.kill(9);
        $ = await $promise;
        exit 2;
    }
}

There's logic in there for executing external programs with a timeout, and then there's logic for dealing with two possible outcomes. In terms of both testability and for future extensions it makes sense to factor out the execution of external programs into a subroutine. The result of this code is not a single value, we're potentially interested in the output it produced, the exit code, and whether it ran into a timeout. We could write a subroutine that returns a list or a hash of these values, but here I chose to write a small class instead:

class ExecutionResult {
    has Int $.exitcode = -1;
    has Str $.output is required;
    has Bool $.timed-out = False;
    method is-success {
        !$.timed-out && $.exitcode == 0;
    }
}

We've seen classes before, but this one has a few new features. Attributes declared with the . twigil automatically get an accessor method, so

has Int $.exitcode;

is roughly the same as

has Int $!exitcode;
method exitcode() { $!exitcode }

So it allows a user of the class to access the value in the attribute from the outside. As a bonus, you can also initialize it from the standard constructor as a named argument, ExecutionResult.new( exitcode => 42 ). The exit code is not a required attribute, because we can't know the exit code of a program that has timed out. So with has Int $.exitcode = -1 we give it a default value that applies if the attribute hasn't been initialized.

The output is a required attribute, so we mark it as such with is required. That's a trait. Traits are pieces of code that modify the behavior of other things, here of an attribute. They crop up in several places, for example in subroutine signatures (is copy on a parameter), variable declarations and classes. If you try to call ExecutionResult.new() without specifying an output, you get such an error:

The attribute '$!output' is required, but you did not provide a value for it.

Mocking and Testing

Now that we have a convenient way to return more than one value from a hypothetical subroutine, let's look at what this subroutine might look like:

sub run-with-timeout(@cmd, :$timeout) {
    my $proc = Proc::Async.new(|@cmd);
    my $collector = Channel.new;
    for $proc.stdout, $proc.stderr -> $supply {
        $supply.tap: { $collector.send($_) }
    }
    my $promise = $proc.start;
    my $waitfor = $promise;
    $waitfor = Promise.anyof(Promise.in($timeout), $promise)
        if $timeout;
    $ = await $waitfor;

    $collector.close;
    my $output = $collector.list.join;

    if !$timeout || $promise.status ~~ Kept {
        say "No timeout";
        return ExecutionResult.new(
            :$output,
            :exitcode($promise.result.exitcode),
        );
    }
    else {
        $proc.kill;
        sleep 1 if $promise.status ~~ Planned;
        $proc.kill(9);
        $ = await $promise;
        return ExecutionResult.new(
            :$output,
            :timed-out,
        );
    }
}

The usage of Proc::Async has remained the same, but instead of printing this when an error occurs, the routine now returns ExecutionResult objects.

This simplifies the MAIN sub quite a bit:

multi sub MAIN(*@cmd, :$timeout) {
    my $result = run-with-timeout(@cmd, :$timeout);
    unless $result.is-success {
        say "Program @cmd[] ",
            $result.timed-out ?? "ran into a timeout"
                              !! "exited with code $result.exitcode()";

        print "Output:\n", $result.output if $result.output;
    }
    exit $result.exitcode // 2;
}

A new syntactic feature here is the ternary operator, CONDITION ?? TRUE-BRANCH !! FALSE-BRANCH, which you might know from other programming languages such as C or Perl 5 as CONDITION ? TRUE-BRANCH : FALSE-BRANCH.

Finally, the logical defined-or operator LEFT // RIGHT returns the LEFT side if it's defined, and if not, runs the RIGHT side and returns its value. It works like the || and or infix operators, except that those check for the boolean value of the left, not whether they are defined.

In Perl 6, we distinguish between defined and true values. By default, all instances are true and defined, and all type objects are false and undefined. Several built-in types override what they consider to be true. Numbers that equal 0 evaluate to False in a boolean context, as do empty strings and empty containers such as arrays, hashes and sets. On the other hand, only the built-in type Failure overrides definedness. You can override the truth value of a custom type by implementing a method Bool (which should return True or False), and the definedness with a method defined.

Now we could start testing the sub run-with-timeout by writing custom external commands with defined characteristics (output, run time, exit code), but that's rather fiddly to do so in a reliable, cross-platform way. So instead I want to replace Proc::Async with a mock implementation, and give the sub a way to inject that:

sub run-with-timeout(@cmd, :$timeout, :$executer = Proc::Async) {
    my $proc = $executer.defined ?? $executer !! $executer.new(|@cmd);
    # rest as before

Looking through sub run-with-timeout, we can make a quick list of methods that the stub Proc::Async implementation needs: stdout, stderr, start and kill. Both stdout and stderr need to return a Supply. The simplest thing that could possibly work is to return a Supply that will emit just a single value:

my class Mock::Proc::Async {
    has $.out = '';
    has $.err = '';
    method stdout {
        Supply.from-list($.out);
    }
    method stderr {
        Supply.from-list($.err);
    }

Supply.from-list returns a Supply that will emit all the arguments passed to it; in this case just a single string.

The simplest possible implementation of kill just does nothing:

    method kill($?) {}

$? in a signature is an optional argument ($foo?) without a name.

Only one method remains that needs to be stubbed: start. It's supposed to return a Promise that, after a defined number of seconds, returns a Proc object or a mock thereof. Since the code only calls the exitcode method on it, writing a stub for it is easy:

has $.exitcode = 0;
has $.execution-time = 1;
method start {
    Promise.in($.execution-time).then({
        (class {
            has $.exitcode;
        }).new(:$.exitcode);
    });
}

Since we don't need the class for the mock Proc anywhere else, we don't even need to give it a name. class { ... } creates an anonymous class, and the .new call on it creates a new object from it.

As mentioned before, a Proc with a non-zero exit code throws an exception when evaluated in void context, or sink context as we call it in Perl 6. We can emulate this behavior by extending the anonymous class a bit:

class {
    has $.exitcode;
    method sink() {
        die "mock Proc used in sink context";
    }
}

With all this preparation in place, we can finally write some tests:

multi sub MAIN('test') {
    use Test;

    my class Mock::Proc::Async {
        has $.exitcode = 0;
        has $.execution-time = 0;
        has $.out = '';
        has $.err = '';
        method kill($?) {}
        method stdout {
            Supply.from-list($.out);
        }
        method stderr {
            Supply.from-list($.err);
        }
        method start {
            Promise.in($.execution-time).then({
                (class {
                    has $.exitcode;
                    method sink() {
                        die "mock Proc used in sink context";
                    }
                }).new(:$.exitcode);
            });
        }
    }

    # no timeout, success
    my $result = run-with-timeout([],
        timeout => 2,
        executer => Mock::Proc::Async.new(
            out => 'mocked output',
        ),
    );
    isa-ok $result, ExecutionResult;
    is $result.exitcode, 0, 'exit code';
    is $result.output, 'mocked output', 'output';
    ok $result.is-success, 'success';

    # timeout
    $result = run-with-timeout([],
        timeout => 0.1,
        executer => Mock::Proc::Async.new(
            execution-time => 1,
            out => 'mocked output',
        ),
    );
    isa-ok $result, ExecutionResult;
    is $result.output, 'mocked output', 'output';
    ok $result.timed-out, 'timeout reported';
    nok $result.is-success, 'success';
}

This runs through two scenarios, one where a timeout is configured but not used (because the mocked external program exits first), and one where the timeout takes effect.

Improving Reliability and Timing

Relying on timing in tests is always unattractive. If the times are too short (or too slow together), you risk sporadic test failures on slow or heavily loaded machines. If you use more conservative temporal spacing of tests, the tests can become very slow.

There's a module (not distributed with Rakudo) to alleviate this pain: Test::Scheduler provides a thread scheduler with virtualized time, allowing you to write the tests like this:

use Test::Scheduler;
my $*SCHEDULER = Test::Scheduler.new;
my $result = start run-with-timeout([],
    timeout => 5,
    executer => Mock::Proc::Async.new(
        execution-time => 2,
        out => 'mocked output',
    ),
);
$*SCHEDULER.advance-by(5);
$result = $result.result;
isa-ok $result, ExecutionResult;
# more tests here

This installs the custom scheduler, and $*SCHEDULER.advance-by(5) instructs it to advance the virtual time by 5 seconds, without having to wait five actual seconds. At the time of writing (December 2016), Test::Scheduler is rather new module, and has a bug that prevents the second test case from working this way.

Installing a Module

If you want to try out Test::Scheduler, you need to install it first. If you run Rakudo Star, it has already provided you with the panda module installer. You can use that to download and install the module for you:

$ panda install Test::Scheduler

If you don't have panda available, you can instead bootstrap zef, an alternative module installer:

$ git clone https://github.com/ugexe/zef.git
$ cd zef
$ perl6 -Ilib bin/zef install .

and then use zef to install the module:

$ zef install Test::Scheduler

Summary

In this installment, we've seen attributes with accessors, the ternary operator and anonymous classes. Testing of threaded code has been discussed, and how a third-party module can help. Finally we had a very small glimpse at the two module installers, panda and zef.

Subscribe to the Perl 6 book mailing list

* indicates required

[/perl-6] Permanent link

Sun, 01 Jan 2017

Perl 6 By Example: Silent Cron, a Cron Wrapper


Permanent link

This blog post is part of my ongoing project to write a book about Perl 6.

If you're interested, please sign up for the mailing list at the bottom of the article, or here. It will be low volume (less than an email per month, on average).


On Linux and UNIX-Like systems, a program called cron periodically executes user-defined commands in the background. It is used for system maintenance tasks such as refreshing or removing caches, rotating and deleting old log files and so on.

If such a command produces any output, cron typically sends an email containing the output so that an operator can look at it and judge if some action is required.

But not all command line programs are written for usage with cron. For example they might produce output even on successful execution, and indicate failure through a non-zero exit code. Or they might hang, or otherwise misbehave.

To deal with such commands, we'll develop a small program called silent-cron, which wraps such commands and suppresses output when the exit code is zero. It also allows you to specify a timeout that kills the wrapped program if it takes too long:

$ silent-cron -- command-that-might-fail args
$ silent-cron --timeout=5 -- command-that-might-hang

Running Commands Asynchronously

When you want to run external commands, Perl 6 gives you basically two choices: run, a simple, synchronous interface, and Proc::Async, an asynchronous and slightly more complex option. Even though we will omit the timeout in the first iteration, we need to be aware that implementing the timeout is easier in the asynchronous interface, so that's what we'll use:

#!/usr/bin/env perl6

sub MAIN(*@cmd) {
    my $proc = Proc::Async.new(@cmd);
    my $collector = Channel.new;
    for $proc.stdout, $proc.stderr -> $supply {
        $supply.tap: { $collector.send($_) }
    }
    my $result = $proc.start.result;
    $collector.close;
    my $output = $collector.list.join;

    my $exitcode = $result.exitcode;
    if $exitcode != 0 {
        say "Program @cmd[] exited with code $exitcode";
        print "Output:\n", $output if $output;
    }
    exit $exitcode;
}

There's a big chunk of new features and concepts in here, so let's go through the code bit by bit.

sub MAIN(*@cmd) {
    my $proc = Proc::Async.new(@cmd);

This collects all the command line arguments in the array variable @cmd, where the first element is the command to be executed, and any further elements are arguments passed to this command. The second line creates a new Proc::Async instance, but doesn't yet run the command.

We need to capture all output from the command; thus we capture the output of the STDOUT and STDERR streams (file handles 1 and 2 on Linux), and combine it into a single string. In the asynchronous API, STDOUT and STDERR are modeled as objects of type Supply, and hence are streams of events. Since supplies can emit events in parallel, we need a thread-safe data structure for collecting the result, and Perl 6 conveniently provides a Channel for that:

my $collector = Channel.new;

To actually get the output from the program, we need to tap into the STDOUT and STDERR streams:

for $proc.stdout, $proc.stderr -> $supply {
    $supply.tap: { $collector.send($_) }
}

Each supply executes the block { $collector.send($_) } for each string it receives. The string can be a character, a line or something larger if the stream is buffered. All we do with it is put the string into the channel $collector via the send method.

Now that the streams are tapped, we can start the program and wait for it to finish:

my $result = $proc.start.result;

Proc::Async.start executes the external process and returns a Promise. A promise wraps a piece of code that potentially runs on another thread, has a status (Planned, Kept or Broken), and once it's finished, a result. Accessing the result automatically waits for the wrapped code to finish. Here the code is the one that runs the external program and the result is an object of type Proc (which happens to be the same as the run() function from the synchronous interface).

After this line, we can be sure that the external command has terminated, and thus no more output will come from $proc.stdout and $proc.stderr. Hence we can safely close the channel and access all its elements through Channel.list:

$collector.close;
my $output = $collector.list.join;

Finally it's time to check if the external command was successful -- by checking its exit code -- and to exit the wrapper with the command's exit code:

my $exitcode = $result.exitcode;
if $exitcode != 0 {
    say "Program @cmd[] exited with code $exitcode";
    print "Output:\n", $output if $output;
}
exit $exitcode;

Implementing Timeouts

The idiomatic way to implement timeouts in Perl 6 is to use the Promise.anyof combinator together with a timer:

sub MAIN(*@cmd, :$timeout) {
    my $proc = Proc::Async.new(|@cmd);
    my $collector = Channel.new;
    for $proc.stdout, $proc.stderr -> $supply {
        $supply.tap: { $collector.send($_) }
    }
    my $promise = $proc.start;
    my $waitfor = $promise;
    $waitfor = Promise.anyof(Promise.in($timeout), $promise)
        if $timeout;
    await $waitfor;

The initialization of $proc hasn't changed. But instead of accessing $proc.start.result, we store the promise returned from $proc.start. If the user specified a timeout, we run this piece of code:

$waitfor = Promise.anyof(Promise.in($timeout), $promise)

Promise.in($seconds) returns a promise that will be fulfilled in $seconds seconds. It's basically the same as start { sleep $seconds }, but the scheduler can be a bit smarter about not allocating a whole thread just for sleeping.

Promise.anyof($p1, $p2) returns a promise that is fulfilled as soon as one of the arguments (which should also be promises) is fulfilled. So we wait either until the external program finished, or until the sleep promise is fulfilled.

With await $waitfor; the program waits for the promise to be fulfilled (or broken). When that is the case, we can't simply access $promise.result as before, because $promise (which is the promise for the external program) might not be fulfilled in the case of a timeout. So we have to check the status of the promise first and only then can we safely access $promise.result:

if !$timeout || $promise.status ~~ Kept {
    my $exitcode = $promise.result.exitcode;
    if $exitcode != 0 {
        say "Program @cmd[] exited with code $exitcode";
        print "Output:\n", $output if $output;
    }
    exit $exitcode;
}
else {
    ...
}

In the else { ... } branch, we need to handle the timeout case. This might be as simple as printing a statement that a timeout has occurred, and when silent-cron exits immediately afterwards, that might be acceptable. But we might want to do more in the future, so we should kill the external program. And if the program doesn't terminate after the friendly kill signal, it should receive a kill(9), which on UNIX systems forcefully terminates the program:

else {
    $proc.kill;
    say "Program @cmd[] did not finish after $timeout seconds";
    sleep 1 if $promise.status ~~ Planned;
    $proc.kill(9);
    await $promise;
    exit 2;
}

await $promise returns the result of the promise, so here a Proc object. Proc has a safety feature built in that if the command returned with a non-zero exit code, evaluating the object in void context throws an exception.

Since we explicitly handle the non-zero exit code in the code, we can suppress the generation of this exception by assigning the return value from await to a dummy variable:

my $dummy = await $promise

Since we don't need the value, we can also assign it to an anonymous variable instead:

$ = await $promise

More on Promises

If you have worked with concurrent or parallel programs in other languages, you might have come across threads, locks, mutexes, and other low-level constructs. These exist in Perl 6 too, but their direct usage is discouraged.

The problem with such low-level primitives is that they don't compose well. You can have two libraries that use threads and work fine on their own, but lead to deadlocks when combined within the same program. Or different components might launch threads on their own, which can lead to too many threads and high memory consumption when several such components come together in the same process.

Perl 6 provides higher-level primitives. Instead of spawning a thread, you use start to run code asynchronously and the scheduler decides which thread to run this on. If more start calls happen that ask for threads to schedule things on, some will run serially.

Here is a very simple example of running a computation in the background:

sub count-primes(Int $upto) {
    (1..$upto).grep(&is-prime).elems;
}

my $p = start count-primes 10_000;
say $p.status;
await $p;
say $p.result;

It gives this output:

Planned
1229

You can see that the main line of execution continued after the start call, and $p immediately had a value -- the promise, with status Planned.

As we've seen before, there are combinators for promises, anyof and allof. You can also chain actions to a promise using the then method:

sub count-primes(Int $upto) {
    (1..$upto).grep(&is-prime).elems;
}

my $p1 = start count-primes 10_000;
my $p2 = $p1.then({ say .result });
await $p2;

If an exception is thrown inside asynchronously executing code, the status of the promise becomes Broken, and calling its .result method re-throws the exception.

As a demonstration of the scheduler distributing tasks, let's consider a small Monte Carlo simulation to calculate an approximation for π. We generate a pair of random numbers between zero and one, and interpret them as dots in a square. A quarter circle with radius one covers the area of π/4, so the ratio of randomly placed dots within the quarter circle to the total number of dots approaches π/4, if we use enough dots.

sub pi-approx($iterations) {
    my $inside = 0;
    for 1..$iterations {
        my $x = 1.rand;
        my $y = 1.rand;
        $inside++ if $x * $x + $y * $y <= 1;
    }
    return ($inside / $iterations) * 4;
}
my @approximations = (1..1000).map({ start pi-approx(80) });
await @approximations;

say @approximations.map({.result}).sum / @approximations;

The program starts one thousand computations asynchronously, but if you look at a system monitoring tool while it runs, you'll observe only 16 threads running. This magic number comes from the default thread scheduler, and we can override it by providing our own instance of a scheduler above the previous code:

my $*SCHEDULER = ThreadPoolScheduler.new(:max_threads(3));

For CPU bound tasks like this Monte Carlo Simulation, it is a good idea to limit the number of threads roughly to the number of (possibly virtual) CPU cores; if many threads are stuck waiting for IO, a higher number of threads can yield better performance.

Possible Extensions

If you want to play with silent-cron, you could add a retry mechanism. If a command fails because of an external dependency (like an API or an NFS share), it might take time for that external dependency to recover. Hence you should add a quadratic or exponential backoff, that is, the wait time between retries should increase quadratically (1, 2, 4, 9, 16, ...) or exponentially (1, 2, 4, 8, 16, 32, ...).

Summary

We've seen an asynchronous API for running external programs and how to use Promises to implement timeouts. We've also discussed how promises are distributed to threads by a scheduler, allowing you to start an arbitrary number of promises without overloading your computer.

Subscribe to the Perl 6 book mailing list

* indicates required

[/perl-6] Permanent link

Sun, 25 Dec 2016

Perl 6 By Example: Testing the Say Function


Permanent link

This blog post is part of my ongoing project to write a book about Perl 6.

If you're interested, please sign up for the mailing list at the bottom of the article, or here. It will be low volume (less than an email per month, on average).


Testing say()

In the previous installment I changed some code so that it wouldn't produce output, and instead did the output in the MAIN sub, which conveniently went untested.

Changing code to make it easier to test is a legitimate practice. But if you do have to test code that produces output by calling say, there's a small trick you can use. say works on a file handle, and you can swap out the default file handle, which is connected to standard output. Instead, you can put a dummy file handle in its place that captures the lower-level commands issued to it, and record this for testing.

There's a ready-made module for that, IO::String, but for the sake of learning we'll look at how it works:

use v6;

# function to be tested
sub doublespeak($x) {
    say $x ~ $x;
}

use Test;
plan 1;

my class OutputCapture {
    has @!lines;
    method print(\s) {
        @!lines.push(s);
    }
    method captured() {
        @!lines.join;
    }
}

my $output = do {
    my $*OUT = OutputCapture.new;
    doublespeak(42);
    $*OUT.captured;
};

is $output, "4242\n", 'doublespeak works';

The first part of the code is the function we want to test, sub doublespeak. It concatenates its argument with itself using the ~ string concatenation operator. The result is passed to say.

Under the hood, say does a bit of formatting, and then looks up the variable $*OUT. The * after the sigil marks it as a dynamic variable. The lookup for the dynamic variable goes through the call stack, and in each stack frame looks for a declaration of the variable, taking the first it finds. say then calls the method print on that object.

Normally, $*OUT contains an object of type IO::Handle, but the say function doesn't really care about that, as long as it can call a print method on that object. That's called duck typing: we don't really care about the type of the object, as long as it can quack like a duck. Or in this case, print like a duck.

Then comes the loading of the test module, followed by the declaration of how many tests to run:

use Test;
plan 1;

You can leave out the second line, and instead call done-testing after your tests. But if there's a chance that the test code itself might be buggy, and not run tests it's supposed to, it's good to have an up-front declaration of the number of expected tests, so that the Test module or the test harness can catch such errors.

The next part of the example is the declaration of type which we can use to emulate the IO::Handle:

my class OutputCapture {
    has @!lines;
    method print(\s) {
        @!lines.append(s);
    }
    method captured() {
        @!lines.join;
    }
}

class introduces a class, and the my prefix makes the name lexically scoped, just like in a my $var declaration.

has @!lines declares an attribute, that is, a variable that exists separately for each instance of class OutputCapture. The ! marks it as an attribute. We could leave it out, but having it right there means you always know where the name comes from when reading a larger class.

The attribute @!lines starts with an @, not a $ as other variables we have seen so far. The @ is the sigil for an array variable.

You might be seeing a trend now: the first letter of a variable or attribute name denotes its rough type (scalar, array, & for routines, and later we'll learn about % for hashes), and if the second letter is not a letter, it specifies its scope. We call this second letter a twigil. So far we've seen * for dynamic variables, and ! for attributes. Stay tuned for more.

Then penultimate block of our example is this:

my $output = do {
    my $*OUT = OutputCapture.new;
    doublespeak(42);
    $*OUT.captured;
};

do { ... } just executes the code inside the curly braces and returns the value of the last statement. Like all code blocks in Perl 6, it also introduces a new lexical scope.

The new scope comes in handy in the next line, where my $*OUT declares a new dynamic variable $*OUT, which is however only valid in the scope of the block. It is initialized with OutputCapture.new, a new instance of the class declared earlier. new isn't magic, it's simply inherited from OutputCapture's superclass. We didn't declare one, but by default, classes get type Any as a superclass, which provides (among other things) the method new as a constructor.

The call to doublespeak calls say, which in turn calls $*OUT.print. And since $*OUT is an instance of OutputCapture in this dynamic scope, the string passed to say lands in OutputCapture's attribute @!lines, where $*OUT.captured can access it again.

The final line,

is $output, "4242\n", 'doublespeak works';

calls the is function from the Test module.

In good old testing tradition, this produces output in the TAP format:

1..1
ok 1 - doublespeak works

Summary

We've seen that say() uses a dynamically scoped variable, $*OUT, as its output file handle. For testing purposes, we can substitute that with an object of our making. Which made us stumble upon the first glimpses of how classes are written in Perl 6.

Subscribe to the Perl 6 book mailing list

* indicates required

[/perl-6] Permanent link

Wed, 21 Dec 2016

Progress in Icinga2 Land


Permanent link

Last month I blogged about my negative experiences with the Icinga2 API. Since then, stuff has happened in Icinga2 land, including a very productive and friendly meeting with the Icinga2 product manager and two of the core developers.

My first issue was that objects created through the API sometimes don't show up in the web interface. We learned that this can be avoided by explicitly specifying the zone attribute, and that Icinga 2.6 doesn't require this anymore.

The second issue, CREATE + UPDATE != CREATE isn't easy to fix in Icinga2. Apply rules can contain basically arbitrary code, and tracking which ones to run, possibly run in reverse etc. for an update is not easy, and while the Icinga2 developers want to fix this eventually, the fix isn't yet in sight. The Icinga Director has a workaround, but it involves restarting the Icinga daemon for each change or batch of changes, an operational characteristic we'd like to avoid. The inability to write templates through the API stems from the same underlying problem, so we won't see write support for templates soon.

The API quirks will likely remain. With support from the Icinga2 developers, I was able to get selection by element of an array variable working, though the process did involve finding at least one more bug in Icinga. The developers are working on that one, though :-).

Even though my bug report requesting more documentation has been closed, the Netways folks have improved the documentation thoroughly, and there are plans for even more improvements, for example making it easier to find the relevant docs by more cross-linking.

All in all, there is a definitive upwards trend in Icinga2's quality, and in my perception of it. I'd like to thank the Icinga2 folks at Netways for their work, and for taking the time to talk to a very vocal critic.

[/misc] Permanent link

Sun, 18 Dec 2016

Perl 6 By Example: Testing the Timestamp Converter


Permanent link

This blog post is part of my ongoing project to write a book about Perl 6.

If you're interested, please sign up for the mailing list at the bottom of the article, or here. It will be low volume (less than an email per month, on average).


In the previous installment, we've seen some code go through several iterations of refactoring. Refactoring without automated tests tends to make me uneasy, so I actually had a small shell script that called the script under development with several different argument combinations and compared it to an expected result.

Let's now look at a way to write test code in Perl 6 itself.

As a reminder, this is what the code looked like when we left it:

#!/usr/bin/env perl6

#| Convert timestamp to ISO date
multi sub MAIN(Int \timestamp) {
    sub formatter($_) {
        sprintf '%04d-%02d-%02d %02d:%02d:%02d',
                .year, .month,  .day,
                .hour, .minute, .second,
    }
    given DateTime.new(+timestamp, :&formatter) {
        when .Date.DateTime == $_ { say .Date }
        default { .say }
    }
}

#| Convert ISO date to timestamp
multi sub MAIN(Str $date where { try Date.new($_) }, Str $time?) {
    my $d = Date.new($date);
    if $time {
        my ( $hour, $minute, $second ) = $time.split(':');
        say DateTime.new(date => $d, :$hour, :$minute, :$second).posix;
    }
    else {
        say $d.DateTime.posix;
    }
}

In the Perl community it's common to move logic into modules to make it easier to test with external test scripts. In Perl 6, that's still common, but for small tools such as this, I prefer to stick with a single file containing code and tests, and to run the tests via a separate test command.

To make testing easier, let's first separate I/O from the application logic:

#!/usr/bin/env perl6

sub from-timestamp(Int \timestamp) {
    sub formatter($_) {
        sprintf '%04d-%02d-%02d %02d:%02d:%02d',
                .year, .month,  .day,
                .hour, .minute, .second,
    }
    given DateTime.new(+timestamp, :&formatter) {
        when .Date.DateTime == $_ { return .Date }
        default { return $_ }
    }
}

sub from-date-string(Str $date, Str $time?) {
    my $d = Date.new($date);
    if $time {
        my ( $hour, $minute, $second ) = $time.split(':');
        return DateTime.new(date => $d, :$hour, :$minute, :$second);
    }
    else {
        return $d.DateTime;
    }
}

#| Convert timestamp to ISO date
multi sub MAIN(Int \timestamp) {
    say from-timestamp(+timestamp);
}

#| Convert ISO date to timestamp
multi sub MAIN(Str $date where { try Date.new($_) }, Str $time?) {
    say from-date-string($date, $time).posix;
}

With this small refactoring out of the way, let's add some tests:

#| Run internal tests
multi sub MAIN('test') {
    use Test;
    plan 4;
    is-deeply from-timestamp(1450915200), Date.new('2015-12-24'),
        'Timestamp to Date';;

    my $dt = from-timestamp(1450915201);
    is $dt, "2015-12-24 00:00:01",
        'Timestamp to DateTime with string formatting';

    is from-date-string('2015-12-24').posix, 1450915200,
        'from-date-string, one argument';
    is from-date-string('2015-12-24', '00:00:01').posix, 1450915201,
        'from-date-string, two arguments';
}

And you can run it:

./autotime test
1..4
ok 1 - Timestamp to Date
ok 2 - Timestamp to DateTime with string formatting
ok 3 - from-date-string, one argument
ok 4 - from-date-string, two arguments

The output format is that of the Test Anything Protocol (TAP), which is the de facto standard in the Perl community, but is now also used in other communities. For larger output strings it is a good idea to run the tests through a test harness. For our four lines of test output, this isn't yet necessary, but if you want to do that anyway, you can use the prove program that's shipped with Perl 5:

$ prove -e "" "./autotime test"
./autotime-tested.p6 test .. ok
All tests successful.
Files=1, Tests=4,  0 wallclock secs ( 0.02 usr  0.01 sys +  0.23 cusr  0.02 csys =  0.28 CPU)
Result: PASS

In a terminal, this even colors the "All tests successful" output in green, to make it easier to spot. Test failures are marked up in red.

How does the testing work? The first line of code uses a new feature we haven't seen yet:

multi sub MAIN('test') {

What's that, a literal instead of a parameter in the subroutine signature? That's right. And it's a shortcut for

multi sub MAIN(Str $anon where {$anon eq 'test'}) {

except that it does not declare the variable $anon. So it's a multi candidate that you can only call by supplying the string 'test' as the sole argument.

The next line, use Test;, loads the test module that's shipped with Rakudo Perl 6. It also imports into the current lexical scope all the symbols that Test exports by default. This includes the functions plan, is and is-deeply that are used later on.

plan 4 declares that we want to run four tests. This is useful for detecting unplanned, early exits from the test code, or errors in looping logic in the test code that leads to running fewer tests than planned. If you can't be bothered to count your tests in advance, you can leave out the plan call, and instead call done-testing after your tests are done.

Both is-deeply and is expect the value to be tested as the first argument, the expected value as the second argument, and an optional test label string as the third argument. The difference is that is() compares the first two arguments as strings, whereas is-deeply uses a deep equality comparison logic using the eqv operator. Such tests only pass if the two arguments are of the same type, and recursively are (or contain) the same values.

More testing functions are available, like ok(), which succeeds for a true argument, and nok(), which expects a false argument. You can also nest tests with subtest:

#| Run internal tests
multi sub MAIN('test') {
    use Test;
    plan 2;
    subtest 'timestamp', {
        plan 2;
        is-deeply from-timestamp(1450915200), Date.new('2015-12-24'),
            'Date';;

        my $dt = from-timestamp(1450915201);
        is $dt, "2015-12-24 00:00:01",
            'DateTime with string formatting';
    };

    subtest 'from-date-string', {
        plan 2;
        is from-date-string('2015-12-24').posix, 1450915200,
            'one argument';
        is from-date-string('2015-12-24', '00:00:01').posix, 1450915201,
            'two arguments';
    };
}

Each call to subtest counts as a single test to the outer test run, so plan 4; has become plan 2;. The subtest call has a test label itself, and then inside a subtest, you have a plan again, and calls to test functions as below. This is very useful when writing custom test functions that execute a variable number of individual tests.

The output from the nested tests looks like this:

1..2
    1..2
    ok 1 - Date
    ok 2 - DateTime with string formatting
ok 1 - timestamp
    1..2
    ok 1 - one argument
    ok 2 - two arguments
ok 2 - from-date-string

The test harness now reports just the two top-level tests as the number of run (and passed) tests.

And yes, you can nest subtests within subtests, should you really feel the urge to do so.

Subscribe to the Perl 6 book mailing list

* indicates required

[/perl-6] Permanent link

Sun, 11 Dec 2016

Perl 6 By Example: Datetime Conversion for the Command Line


Permanent link

This blog post is part of my ongoing project to write a book about Perl 6.

If you're interested, please sign up for the mailing list at the bottom of the article, or here. It will be low volume (less than an email per month, on average).


Occasionally I work with a database that stores dates and datetimes as UNIX timestamps, aka the number of seconds since midnight 1970-01-01. Unlike the original author of the database and surrounding code, I cannot convert between UNIX timestamps and human readable date formats in my head, so I write tools for that.

Our goal here is to write a small tool that converts back and forth between UNIX timestamps and dates/times:

$ autotime 2015-12-24
1450915200
$ autotime 2015-12-24 11:23:00
1450956180
$ autotime 1450915200
2015-12-24
$ autotime 1450956180
2015-12-24 11:23:00

Libraries To The Rescue

Date and Time arithmetics are surprisingly hard to get right, and at the same time rather boring, hence I'm happy to delegate that part to libraries.

Perl 6 ships with DateTime (somewhat inspired by the Perl 5 module of the same name) and Date (mostly blatantly stolen from Perl 5's Date::Simple module) in the core library. Those two will do the actual conversion, so we can focus on the input and output, and detecting the formats to decide in which direction to convert.

For the conversion from a UNIX timestamp to a date or datetime, the DateTime.new constructor comes in handy. It has a variant that accepts a single integer as a UNIX timestamp:

$ perl6 -e "say DateTime.new(1450915200)"
2015-12-24T00:00:00Z

Looks like we're almost done with one direction, right?

#!/usr/bin/env perl6
sub MAIN(Int $timestamp) {
    say DateTime.new($timestamp)
}

Let's run it:

$ autotime 1450915200
Invalid DateTime string '1450915200'; use an ISO 8601 timestamp (yyyy-mm-ddThh:mm:ssZ or yyyy-mm-ddThh:mm:ss+01:00) instead
  in sub MAIN at autotime line 2
  in block <unit> at autotime line 2

Oh my, what happened? It seems that the DateTime constructor seems to view the argument as a string, even though the parameter to sub MAIN is declared as an Int. How can that be? Let's add some debugging output:

#!/usr/bin/env perl6
sub MAIN(Int $timestamp) {
    say $timestamp.^name;
    say DateTime.new($timestamp)
}

Running it now with the same invocation as before, there's an extra line of output before the error:

IntStr

$thing.^name is a call to a method of the meta class of $thing, and name asks it for its name. In other words, the name of the class. IntStr is a subclass of both Int and Str, which is why the DateTime constructor legitimately considers it a Str. The mechanism that parses command line arguments before they are passed on to MAIN converts the string from the command line to IntStr instead of Str, in order to not lose information in case we do want to treat it as a string.

Cutting a long story short, we can force the argument into a "real" integer by adding a + prefix, which is the general mechanism for conversion to a numeric value:

#!/usr/bin/env perl6
sub MAIN(Int $timestamp) {
    say DateTime.new(+$timestamp)
}

A quick test shows that it now works:

$ ./autotime-01.p6 1450915200
2015-12-24T00:00:00Z

The output is in the ISO 8601 timestamp format, which might not be the easiest on the eye. For a date (when hour, minute and second are zero), we really want just the date:

#!/usr/bin/env perl6
sub MAIN(Int $timestamp) {
    my $dt = DateTime.new(+$timestamp);
    if $dt.hour == 0 && $dt.minute == 0 && $dt.second == 0 {
        say $dt.Date;
    }
    else {
        say $dt;
    }
}

Better:

$ ./autotime 1450915200
2015-12-24

But the conditional is a bit clunky. Really, three comparisons to 0?

Perl 6 has a neat little feature that lets you write this more compactly:

if all($dt.hour, $dt.minute, $dt.second) == 0 {
    say $dt.Date;
}

all(...) creates a Junction, a composite value of several other values, that also stores a logical mode. When you compare a junction to another value, that comparison automatically applies to all the values in the junction. The if statement evaluates the junction in a boolean context, and in this case only returns True if all comparisons returned True as well.

Other types of junctions exist: any, all, none and one. Considering that 0 is the only integer that is false in a boolean context, we could even write the statement above as:

if none($dt.hour, $dt.minute, $dt.second) {
    say $dt.Date;
}

Neat, right?

But you don't always need fancy language constructs to write concise programs. In this case, approaching the problem from a slightly different angle yields even shorter and clearer code. If the DateTime object round-trips a conversion to Date and back to DateTime without loss of information, it's clearly a Date:

if $dt.Date.DateTime == $dt {
    say $dt.Date;
}
else {
    say $dt;
}

DateTime Formatting

For a timestamp that doesn't resolve to a full day, the output from our script currently looks like this:

2015-12-24T00:00:01Z

where "Z" indicates the UTC or "Zulu" timezone.

Instead I'd like it to be

2015-12-24 00:00:01

The DateTime class supports custom formatters, so let's write one:

sub MAIN(Int $timestamp) {
    my $dt = DateTime.new(+$timestamp, formatter => sub ($o) {
            sprintf '%04d-%02d-%02d %02d:%02d:%02d',
                    $o.year, $o.month,  $o.day,
                    $o.hour, $o.minute, $o.second,
    });
    if $dt.Date.DateTime == $dt {
        say $dt.Date;
    }
    else {
        say $dt.Str;
    }
}

Now the output looks better:

 ./autotime 1450915201
2015-12-24 00:00:01

The syntax formatter => ... in the context of an argument denotes a named argument, which means the name and not position in the argument list decides which parameter to bind to. This is very handy if there are a bunch of parameters.

I don't like the code anymore, because the formatter is inline in the DateTime.new(...) call, which I find unclear.

Let's make this a separate routine:

#!/usr/bin/env perl6
sub MAIN(Int $timestamp) {
    sub formatter($o) {
        sprintf '%04d-%02d-%02d %02d:%02d:%02d',
                $o.year, $o.month,  $o.day,
                $o.hour, $o.minute, $o.second,
    }
    my $dt = DateTime.new(+$timestamp, formatter => &formatter);
    if $dt.Date.DateTime == $dt {
        say $dt.Date;
    }
    else {
        say $dt.Str;
    }
}

Yes, you can put a subroutine declaration inside the body of another subroutine declaration; a subroutine is just an ordinary lexical symbol, like a variable declared with my.

In the line my $dt = DateTime.new(+$timestamp, formatter => &formatter);, the syntax &formatter refers to the subroutine as an object, without calling it.

This being Perl 6, formatter => &formatter has a shortcut: :&formatter. As a general rule, if you want to fill a named parameter whose name is the name of a variable, and whose value is the value of the variable, you can create it by writing :$variable. And as an extension, :thing is short for thing => True.

Looking the Other Way

Now that the conversion from timestamps to dates and times works fine, let's look in the other direction. Our small tool needs to parse the input, and decide whether the input is a timestamp, or a date and optionally a time.

The boring way would be to use a conditional:

sub MAIN($input) {
    if $input ~~ / ^ \d+ $ / {
        # convert from timestamp to date/datetime
    }
    else {
        # convert from date to timestamp

    }
}

But I hate boring, so I want to look at a more exciting (end extensible) approach.

Perl 6 supports multiple dispatch. That means you can have multiple subroutines with the same name, but different signatures. And Perl 6 automatically decides which one to call. You have to explicitly enable this feature by writing multi sub instead of sub, so that Perl 6 can catch accidental redeclaration for you.

Let's see it in action:

#!/usr/bin/env perl6

multi sub MAIN(Int $timestamp) {
    sub formatter($o) {
        sprintf '%04d-%02d-%02d %02d:%02d:%02d',
                $o.year, $o.month,  $o.day,
                $o.hour, $o.minute, $o.second,
    }
    my $dt = DateTime.new(+$timestamp, :&formatter);
    if $dt.Date.DateTime == $dt {
        say $dt.Date;
    }
    else {
        say $dt.Str;
    }
}


multi sub MAIN(Str $date) {
    say Date.new($date).DateTime.posix
}

Let's see it in action:

$ ./autotime 2015-12-24
1450915200
$ ./autotime 1450915200
Ambiguous call to 'MAIN'; these signatures all match:
:(Int $timestamp)
:(Str $date)
  in block <unit> at ./autotime line 17

Not quite what I had envisioned. The problem is again that the integer argument is converted automatically to IntStr, and both the Int and the Str multi (or candidate) accept that as an argument.

The easiest approach to avoiding this error is narrowing down the kinds of strings that the Str candidate accepts. The classical approach would be to have a regex that roughly validates the incoming argument:

multi sub MAIN(Str $date where /^ \d+ \- \d+ \- \d+ $ /) {
    say Date.new($date).DateTime.posix
}

And indeed it works, but why duplicate the logic that Date.new already has for validating date strings? If you pass a string argument that doesn't look like a date, you get such an error:

Invalid Date string 'foobar'; use yyyy-mm-dd instead

We can use this behavior in constraining the string parameter of the MAIN multi candidate:

multi sub MAIN(Str $date where { try Date.new($_) }) {
    say Date.new($date).DateTime.posix
}

The additional try in here is because subtype constraints behind a where are not supposed to throw an exception, just return a false value.

And now it works as intended:

$ ./autotime 2015-12-24;
1450915200
$ ./autotime 1450915200
2015-12-24

Dealing With Time

The only feature left to implement is conversion of date and time to a timestamp. In other words, we want to handle calls like autotime 2015-12-24 11:23:00:

multi sub MAIN(Str $date where { try Date.new($_) }, Str $time?) {
    my $d = Date.new($date);
    if $time {
        my ( $hour, $minute, $second ) = $time.split(':');
        say DateTime.new(date => $d, :$hour, :$minute, :$second).posix;
    }
    else {
        say $d.DateTime.posix;
    }
}

The new second argument is optional by virtue of the trailing ?. If it is present, we split the time string on the colon to get hour, minute and second. My first instinct while writing this code was to use shorter variable names, my ($h, $m, $s) = $time.split(':'), but then the call to the DateTime constructor would have looked like this:

DateTime.new(date => $d, hour => $h, minute => $m, second => $s);

So the named arguments to the constructor made me choose more self-explanatory variable names.

So, this works:

./autotime 2015-12-24 11:23:00
1450956180

And we can check that it round-trips:

$ ./autotime 1450956180
2015-12-24 11:23:00

Tighten Your Seat Belt

Now that the program is feature complete, we should strive to remove some clutter, and explore a few more awesome Perl 6 features.

The first feature that I want to exploit is that of an implicit variable or topic. A quick demonstration:

for 1..3 {
    .say
}

produces the output

1
2
3

There is no explicit iteration variable, so Perl implicitly binds the current value of the loop to a variable called $_. The method call .say is a shortcut for $_.say. And since there is a subroutine that calls six methods on the same variable, using $_ here is a nice visual optimization:

sub formatter($_) {
    sprintf '%04d-%02d-%02d %02d:%02d:%02d',
            .year, .month,  .day,
            .hour, .minute, .second,
}

If you want to set $_ in a lexical scope without resorting to a function definition, you can use the given VALUE BLOCK construct:

given DateTime.new(+$timestamp, :&formatter) {
    if .Date.DateTime == $_ {
        say .Date;
    }
    else {
        .say;
    }
}

And Perl 6 also offers a shortcut for conditionals on the $_ variable, which can be used as a generalized switch statement:

given DateTime.new(+$timestamp, :&formatter) {
    when .Date.DateTime == $_ { say .Date }
    default { .say }
}

If you have a read-only variable or parameter, you can do without the $ sigil, though you have to use a backslash at declaration time:

multi sub MAIN(Int \timestamp) {
    ...
    given DateTime.new(+timestamp, :&formatter) {
    ...
    }
}

So now the full code looks like this:

#!/usr/bin/env perl6

multi sub MAIN(Int \timestamp) {
    sub formatter($_) {
        sprintf '%04d-%02d-%02d %02d:%02d:%02d',
                .year, .month,  .day,
                .hour, .minute, .second,
    }
    given DateTime.new(+timestamp, :&formatter) {
        when .Date.DateTime == $_ { say .Date }
        default { .say }
    }
}

multi sub MAIN(Str $date where { try Date.new($_) }, Str $time?) {
    my $d = Date.new($date);
    if $time {
        my ( $hour, $minute, $second ) = $time.split(':');
        say DateTime.new(date => $d, :$hour, :$minute, :$second).posix;
    }
    else {
        say $d.DateTime.posix;
    }
}

MAIN magic

The magic that calls sub MAIN for us also provides us with an automagic usage message if we call it with arguments that don't fit any multi, for example with no arguments at all:

$ ./autotime
Usage:
  ./autotime <timestamp>
  ./autotime <date> [<time>]

We can add a short description to these usage lines by adding semantic comments before the MAIN subs:

#!/usr/bin/env perl6

#| Convert timestamp to ISO date
multi sub MAIN(Int \timestamp) {
    ...
}

#| Convert ISO date to timestamp
multi sub MAIN(Str $date where { try Date.new($_) }, Str $time?) {
    ...
}

Now the usage message becomes:

$ ./autotime
Usage:
  ./autotime <timestamp> -- Convert timestamp to ISO date
  ./autotime <date> [<time>] -- Convert ISO date to timestamp

Summary

We've seen a bit of Date and DateTime arithmetic, but the exciting part is multiple dispatch, named arguments, subtype constraints with where clauses, given/when and the implicit $_ variable, and some serious magic when it comes to MAIN subs.

Subscribe to the Perl 6 book mailing list

* indicates required

[/perl-6] Permanent link

Sun, 04 Dec 2016

Perl 6 By Example: Formatting a Sudoku Puzzle


Permanent link

This blog post is part of my ongoing project to write a book about Perl 6.

If you're interested, please sign up for the mailing list at the bottom of the article, or here. It will be low volume (less than an email per month, on average).


As a gentle introduction to Perl 6, let's consider a small task that I recently encountered while pursuing one of my hobbies.

Sudoku is a number-placement puzzle played on a grid of 9x9 cells, subdivided into blocks of 3x3. Some of the cells are filled out with numbers from 1 to 9, some are empty. The objective of the game is to fill out the empty cells so that in each row, column and 3x3 block, each digit from 1 to 9 occurs exactly once.

An efficient storage format for a Sudoku is simply a string of 81 characters, with 0 for empty cells and the digits 1 to 9 for pre-filled cells. The task I want to solve is to bring this into a friendlier format.

The input could be:

000000075000080094000500600010000200000900057006003040001000023080000006063240000

On to our first Perl 6 program:

# file sudoku.p6
use v6;
my $sudoku = '000000075000080094000500600010000200000900057006003040001000023080000006063240000';
for 0..8 -> $line-number {
    say substr $sudoku, $line-number * 9, 9;
}

You can run it like this:

$ perl6 sudoku.p6
000000075
000080094
000500600
010000200
000900057
006003040
001000023
080000006
063240000

There's not much magic in there, but let's go through the code one line at a time.

The first line, starting with a # is a comment that extends to the end of the line.

use v6;

This line is not strictly necessary, but good practice anyway. It declares the Perl version you are using, here v6, so any version of the Perl 6 language. We could be more specific and say use v6.c; to require exactly the version discussed here. If you ever accidentally run a Perl 6 program through Perl 5, you'll be glad you included this line, because it'll tell you:

$ perl sudoku.p6
Perl v6.0.0 required--this is only v5.22.1, stopped at sudoku.p6 line 1.
BEGIN failed--compilation aborted at sudoku.p6 line 1.

instead of the much more cryptic

syntax error at sudoku.p6 line 4, near "for 0"
Execution of sudoku.p6 aborted due to compilation errors.

The first interesting line is

my $sudoku = '00000007500...';

my declares a lexical variable. It is visible from the point of the declaration to the end of the current scope, which means either to the end of the current block delimited by curly braces, or to the end of the file if it's outside any block. As it is in this example.

Variables start with a sigil, here a '$'. Sigils are what gave Perl the reputation of being line noise, but there is signal in the noise. The $ looks like an S, which stands for scalar. If you know some math, you know that a scalar is just a single value, as opposed to a vector or even a matrix.

The variable doesn't start its life empty, because there's an initialization right next to it. The value it starts with is a string literal, as indicated by the quotes.

Note that there is no need to declare the type of the variable beyond the very vague "it's a scalar" implied by the sigil. If we wanted, we could add a type constraint:

my Str $sudoku = '00000007500...';

But when quickly prototyping, I tend to forego type constraints, because I often don't know yet how exactly the code will work out.

The actual logic happens in the next lines, by iterating over the line numbers 0 to 8:

for 0..8 -> $line-number {
    ...
}

The for loop has the general structure for ITERABLE BLOCK. Here the iterable is a range, and the block is a pointy block. The block starts with ->, which introduces a signature. The signature tells the compiler what arguments the blocks expects, here a single scalar called $line-number.

Perl 6 allows to use a dash - or a single quote ' to join multiple simple identifiers into a larger identifier. That means you can use them inside an identifier as long as the following letter is a letter or the underscore.

Again, type constraints are optional. If you chose to include them, it would be for 0..9 -> Int $line-number { ... }.

$line-number is again a lexical variable, and visible inside the block that comes after the signature. Blocks are delimited by curly braces.

say substr $sudoku, $line-number * 9, 9;

Both say and substr are functions provided by the Perl 6 standard library. substr($string, $start, $chars) extracts a substring of (up to) $chars characters length from $string, starting from index $start. Oh, and indexes are zero-based in Perl 6.

say then prints this substring, followed by a line break.

As you can see from the example, function invocations don't need parenthesis, though you can add them if you want:

say substr($sudoku, $line-number * 9, 9);

or even

say(substr($sudoku, $line-number * 9, 9));

Making the Sudoku playable

As the output of our script stands now, you can't play the resulting Sudoku even if you printed it, because all those pesky zeros get in your way of actually entering the numbers you carefully deduce while solving the puzzle.

So, let's substitute each 0 with a blank:

# file sudoku.p6
use v6;

my $sudoku = '000000075000080094000500600010000200000900057006003040001000023080000006063240000';
$sudoku = $sudoku.trans('0' => ' ');

for 0..8 -> $line-number {
    say substr $sudoku, $line-number * 9, 9;
}

trans is a method of the Str class. Its argument is a Pair. The boring way to create a Pair would be Pair.new('0', ' '), but since it's so commonly used, there is a shortcut in the form of the fat arrow, =>. The method trans replaces each occurrence of they pair's key with the pair's value, and returns the resulting string.

Speaking of shortcuts, you can also shorten $sudoku = $sudoku.trans(...) to $sudoku.=trans(...). This is a general pattern that turns methods that return a result into mutators.

With the new string substitution, the result is playable, but ugly:

$ perl6 sudoku.p6
       75
    8  94
   5  6  
 1    2  
   9   57
  6  3 4 
  1    23
 8      6
 6324    

A bit ASCII art makes it bearable:

+---+---+---+
|   | 1 |   |
|   |   |79 |
| 9 |   | 4 |
+---+---+---+
|   |  4|  5|
|   |   | 2 |
|3  | 29|18 |
+---+---+---+
|  4| 87|2  |
|  7|  2|95 |
| 5 |  3|  8|
+---+---+---+

To get the vertical dividing lines, we need to sub-divide the lines into smaller chunks. And since we already have one occurrence of dividing a string into smaller strings of a fixed size, it's time to encapsulate it into a function:

sub chunks(Str $s, Int $chars) {
    gather for 0 .. $s.chars / $chars - 1 -> $idx {
        take substr($s, $idx * $chars, $chars);
    }
}

for chunks($sudoku, 9) -> $line {
    say chunks($line, 3).join('|');
}

The output is:

$ perl6 sudoku.p6
   |   | 75
   | 8 | 94
   |5  |6  
 1 |   |2  
   |9  | 57
  6|  3| 4 
  1|   | 23
 8 |   |  6
 63|24 |   

But how did it work? Well, sub (SIGNATURE) BLOCK declares a subroutine, short sub. Here I declare it to take two arguments, and since I tend to confuse the order of arguments to functions I call, I've added type constraints that make it very likely that Perl 6 catches the error for me.

gather and take work together to create a list. gather is the entry point, and each execution of take adds one element to the list. So

gather {
    take 1;
    take 2;
}

would return the list 1, 2. Here gather acts as a statement prefix, which means it collects all takes from within the for loop.

A subroutine returns the value from the last expression, which here is the gather for ... thing discussed above.

Coming back to the program, the for-loop now looks like this:

for chunks($sudoku, 9) -> $line {
    say chunks($line, 3).join('|');
}

So first the program chops up the full Sudoku string into lines of nine characters, and then for each line, again into a list of three strings of three characters length. The join method turns it back into a string, but with pipe symbols inserted between the chunks.

There are still vertical bars missing at the start and end of the line, which can easily be hard-coded by changing the last line:

    say '|', chunks($line, 3).join('|'), '|';

Now the output is

|   |   | 75|
|   | 8 | 94|
|   |5  |6  |
| 1 |   |2  |
|   |9  | 57|
|  6|  3| 4 |
|  1|   | 23|
| 8 |   |  6|
| 63|24 |   |

Only the horizontal lines are missing, which aren't too hard to add:

my $separator = '+---+---+---+';
my $index = 0;
for chunks($sudoku, 9) -> $line {
    if $index++ %% 3 {
        say $separator;
    }
    say '|', chunks($line, 3).join('|'), '|';
}
say $separator;

Et voila:

+---+---+---+
|   |   | 75|
|   | 8 | 94|
|   |5  |6  |
+---+---+---+
| 1 |   |2  |
|   |9  | 57|
|  6|  3| 4 |
+---+---+---+
|  1|   | 23|
| 8 |   |  6|
| 63|24 |   |
+---+---+---+

There are two new aspects here: the if conditional, which structurally very much resembles the for loop. The second new aspect is the divisibility operator, %%. From other programming languages you probably know % for modulo, but since $number % $divisor == 0 is such a common pattern, $number %% $divisor is Perl 6's shortcut for it.

Shortcuts, Constants, and more Shortcuts

Perl 6 is modeled after human languages, which have some kind of compression scheme built in, where commonly used words tend to be short, and common constructs have shortcuts.

As such, there are lots of ways to write the code more succinctly. The first is basically cheating, because the sub chunks can be replaced by a built-in method in the Str class, comb:

# file sudoku.p6
use v6;

my $sudoku = '000000075000080094000500600010000200000900057006003040001000023080000006063240000';
$sudoku = $sudoku.trans: '0' => ' ';

my $separator = '+---+---+---+';
my $index = 0;
for $sudoku.comb(9) -> $line {
    if $index++ %% 3 {
        say $separator;
    }
    say '|', $line.comb(3).join('|'), '|';
}
say $separator;

The if conditional can be applied as a statement postfix:

say $separator if $index++ %% 3;

Except for the initialization, the variable $index is used only once, so there's no need to give it name. Yes, Perl 6 has anonymous variables:

my $separator = '+---+---+---+';
for $sudoku.comb(9) -> $line {
    say $separator if $++ %% 3;
    say '|', $line.comb(3).join('|'), '|';
}
say $separator;

Since $separator is a constant, we can declare it as one:

`constant $separator = '+---+---+---+';

If you want to reduce the line noise factor, you can also forego the sigil, so constant separator = '...'.

Finally there is a another syntax for method calls with arguments: instead of $obj.method(args) you can say $obj.method: args, which brings us to the idiomatic form of the small Sudoku formatter:

# file sudoku.p6
use v6;

my $sudoku = '000000075000080094000500600010000200000900057006003040001000023080000006063240000';
$sudoku = $sudoku.trans: '0' => ' ';

constant separator = '+---+---+---+';
for $sudoku.comb(9) -> $line {
    say separator if $++ %% 3;
    say '|', $line.comb(3).join('|'), '|';
}
say separator;

IO and other Tragedies

A practical script doesn't contain its input as a hard-coded string literal, but reads it from the command line, standard input or a file.

If you want to read the Sudoku from the command line, you can declare a subroutine called MAIN, which gets all command line arguments passed in:

# file sudoku.p6
use v6;

constant separator = '+---+---+---+';

sub MAIN($sudoku) {
    my $substituted = $sudoku.trans: '0' => ' ';

    for $substituted.comb(9) -> $line {
        say separator if $++ %% 3;
        say '|', $line.comb(3).join('|'), '|';
    }
    say separator;
}

This is how it's called:

$ perl6-m sudoku-format-08.p6 000000075000080094000500600010000200000900057006003040001000023080000006063240000
+---+---+---+
|   |   | 75|
|   | 8 | 94|
|   |5  |6  |
+---+---+---+
| 1 |   |2  |
|   |9  | 57|
|  6|  3| 4 |
+---+---+---+
|  1|   | 23|
| 8 |   |  6|
| 63|24 |   |
+---+---+---+

And you even get a usage message for free if you use it wrongly, for example by omitting the argument:

$ perl6-m sudoku.p6 
Usage:
  sudoku.p6 <sudoku> 

You might have noticed that the last example uses a separate variable for the substituted Sudoku string.This is because function parameters (aka variables declared in a signature) are read-only by default. Instead of creating a new variable, I could have also written sub MAIN($sudoku is copy) { ... }.

Classic UNIX programs such as cat and wc, follow the convention of reading their input from file names given on the command line, or from the standard input if no file names are given on the command line.

If you want your program to follow this convention, lines() provides a stream of lines from either of these source:

# file sudoku.p6
use v6;

constant separator = '+---+---+---+';

for lines() -> $sudoku {
    my $substituted = $sudoku.trans: '0' => ' ';

    for $substituted.comb(9) -> $line {
        say separator if $++ %% 3;
        say '|', $line.comb(3).join('|'), '|';
    }
    say separator;
}

Get Creative!

You won't learn a programming language from reading a blog, you have to actually use it, tinker with it. If you want to expand on the examples discussed earlier, I'd encourage you to try to produce Sudokus in different output formats.

SVG offers a good ratio of result to effort. This is the rough skeleton of an SVG file for a Sudoku:

<?xml version="1.0" standalone="no"?>
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
<svg width="304" height="304" version="1.1"
xmlns="http://www.w3.org/2000/svg">
    <line x1="0" x2="300" y1="33.3333" y2="33.3333" style="stroke:grey" />
    <line x1="0" x2="300" y1="66.6667" y2="66.6667" style="stroke:grey" />
    <line x1="0" x2="303" y1="100" y2="100" style="stroke:black;stroke-width:2" />
    <line x1="0" x2="300" y1="133.333" y2="133.333" style="stroke:grey" />
    <!-- more horizontal lines here -->

    <line y1="0" y2="300" x1="33.3333" x2="33.3333" style="stroke:grey" />
    <!-- more vertical lines here -->


    <text x="43.7333" y="124.5"> 1 </text>
    <text x="43.7333" y="257.833"> 8 </text>
    <!-- more cells go here -->
    <rect width="304" height="304" style="fill:none;stroke-width:1;stroke:black;stroke-width:6"/>
</svg>

If you have a Firefox or Chrome browser, you can use it to open the SVG file.

If you are adventurous, you could also write a Perl 6 program that renders the Sudoku as a Postscript (PS) or Embedded Postscript document. It's also a text-based format.

Subscribe to the Perl 6 book mailing list

* indicates required

[/perl-6] Permanent link

Sun, 27 Nov 2016

Perl 6 By Example: Running Rakudo


Permanent link

This blog post is part of my ongoing project to write a book about Perl 6.

If you're interested, please sign up for the mailing list at the bottom of the article, or here. It will be low volume (less than an email per month, on average).


Before we start exploring Perl 6, you should have an environment where you can run Perl 6 code. So you need to install Rakudo Perl 6, currently the only actively developed Perl 6 compiler. Or even better, install Rakudo Star, which is a distribution that includes Rakudo itself, a few useful modules, and an installer that can help you install more modules.

Below a few options for getting Rakudo Star installed are discussed. Chose whatever works for you.

The examples here use Rakudo Star 2016.10.

Installers

You can download installers from http://rakudo.org/downloads/star/ for Mac OS (.dmg) and Windows (.msi). After download, you can launch them, and they walk you through the installation process.

Note that Rakudo is not relocatable, which means you have to install to a fix location that was decided by the creator of the installer. Moving the installation to a different directory.

On Windows, the installer offers you need to add C:\rakudo\bin and C:\rakudo\share\perl6\site\bin to your PATH environment. You should chose that option, as it allows you to call rakudo (and programs that the module installer installs on your behalf) without specifying full paths.

Docker

On platforms that support Docker, you can pull an existing Docker container from the docker hub:

$ docker pull rakudo-star

Then you can get an interactive Rakudo shell with this command:

$ docker run -it rakudo-star perl6

But that won't work for executing scripts, because the container has its own, separate file system. To make scripts available inside the container, you need to tell Docker to make the current directory available to the container:

$ docker run -v $PWD:/perl6 -w /perl6 -it rakudo-star perl6

The option -v $PWD:/perl6 instructs Docker to mount the current working directory ($PWD) into the container, where it'll be available as /perl6. To make relative paths work, -w /perl6 instructs Docker to set the working directory of the rakudo process to /perl6.

Since this command line starts to get unwieldy, I created an alias (this is Bash syntax; other shells might have slightly different alias mechanisms):

alias p6d='docker run -v $PWD:/perl6 -w /perl6 -it rakudo-star perl6'

I put this line into my ~/.bashrc files, so new bash instances have a p6d command, short for "Perl 6 docker".

As a short test to see if it works, you can run

$ p6d -e 'say "hi"'
hi

If you go the Docker route, just the p6d alias instead of perl6 to run scripts.

Building from Source

To build Rakudo Star from source, you need make, gcc or clang and perl 5 installed. This example installs into $HOME/opt/rakudo-star:

$ wget http://rakudo.org/downloads/star/rakudo-star-2016.10.tar.gz
$ tar xzf rakudo-star-2016.10.tar.gz
$ cd rakudo-star-2016.10/
$ perl Configure.pl --prefix=$HOME/opt/rakudo-star --gen-moar
$ make install

You should have about 2GB of RAM available for the last step; building a compiler is a resource intensive task.

You need to add paths to two directories to your PATH environment variable, one for Rakudo itself, one for programs installed by the module installer:

PATH=$PATH:$HOME/opt/rakudo-star/bin/:$HOME/opt/rakudo-star/share/perl6/site/bin

If you are a Bash user, you can put that line into your ~/.bashrc file to make it available in new Bash processes.

Testing your Rakudo Star Installation

You should now be able to run Perl 6 programs from the command line, and ask Rakudo for its version:

$ perl6 --version
This is Rakudo version 2016.10-2-gb744de3 built on MoarVM version 2016.09-39-g688796b
implementing Perl 6.c.

$ perl6 -e "say <hi>"
hi

If, against all odds, all of these approaches have failed you to produce a usable Rakudo installation, you should describe your problem to the friendly Perl 6 community, which can usually provide some help. http://perl6.org/community/ describes ways to interact with the community.

Next week we'll take a look at the first proper Perl 6 example, so stay tuned for updates!

Subscribe to the Perl 6 book mailing list

* indicates required

[/perl-6] Permanent link

Sun, 20 Nov 2016

What is Perl 6?


Permanent link

This blog post is part of my ongoing project to write a book about Perl 6.

If you're interested, please sign up for the mailing list at the bottom of the article, or here. It will be low volume (less than an email per month, on average).


Perl 6 is a programming language. It is designed to be easily learned, read and written by humans, and it is inspired by natural language. It allows the beginner to write in "baby Perl", while giving the experienced programmer freedom of expression, from concise to poetic.

Perl 6 is gradually typed. It mostly follows the paradigm of dynamically typed languages in that it accepts programs whose type safety it can't guarantee during compilation. Unlike many dynamic languages, it accepts and enforces type constraints. Where possible, the compiler uses type annotations to make decisions at compile time that would otherwise only be possible at run time.

Many programming paradigms have influenced Perl 6. You can write imperative, object-oriented and functional programs in Perl 6. Declarative programming is supported through the regex and grammar engine.

Most lookups in Perl 6 are lexical, and the language avoids global state. This makes parallel and concurrent execution of programs easier, as does Perl 6's focus on high-level concurrency primitives. Instead of threads and locks, you tend to think about promises and message queues when you don't want to be limited to one CPU core.

Perl 6 as a language is not opinionated about whether Perl 6 programs should be compiled or interpreted. Rakudo Perl 6, the main implementation, precompiles modules on the fly, and interprets scripts.

Perl 5, the Older Sister

Around the year 2000, Perl 5 development faced major strain from the conflicting desires to evolve and to keep backwards compatibility.

Perl 6 was the valve to release this tension. All the extension proposals that required a break in backwards compatibility were channeled into Perl 6, leaving it in a dreamlike state where everything was possible and nothing was fixed. It took several years of hard work to get into a more solid state.

During this time, Perl 5 also evolved, and the two languages are different enough that most Perl 5 developers don't consider Perl 6 a natural upgrade path anymore, to the point that Perl 6 does not try to obsolete Perl 5 (at least not more than it tries to obsolete any other programming language :-), and the first stable release of Perl 6 in 2015 does not indicate any lapse in support for Perl 5.

Library Availability

Being a relatively young language, Perl 6 lacks the mature module ecosystem that languages such as Perl 5 and Python provide.

To bridge this gap, interfaces exist that allow you to call into libraries written in C, Python, Perl 5 and Ruby. The Perl 5 and Python interfaces are sophisticated enough that you can write a Perl 6 class that subclasses one written in either language, and the other way around.

So if you like a particular Python library, for example, you can simply load it into your Perl 6 program through the Inline::Python module.

Why Should I Use Perl 6?

If you like the quick prototyping experience from dynamically typed programming languages, but you also want enough safety features to build big, reliable applications, Perl 6 is a good fit for you. Its gradual typing allows you to write code without having a full picture of the types involved, and later introduce type constraints to guard against future misuse of your internal and external APIs.

Perl has a long history of making text processing via regular expressions (regexes) very easy, but more complicated regexes have acquired a reputation of being hard to read and maintain. Perl 6 solves this by putting regexes on the same level as code, allowing you to name it like subroutines, and even to use object oriented features such as class inheritance and role composition to manage code and regex reuse. The resulting grammars are very powerful, and easy to read. In fact, the Rakudo Perl 6 compiler parses Perl 6 source code with a Perl 6 grammar!

Speaking of text, Perl 6 has amazing Unicode support. If you ask your user for a number, and they enter it with digits that don't happen to be the Arabic digits from the ASCII range, Perl 6 still has you covered. And if you deal with graphemes that cannot be expressed as a single Unicode code point, Perl 6 still presents it as a single character.

There are more technical benefits that I could list, but more importantly, the language is designed to be fun to use. An important aspect of that is good error messages. Have you ever been annoyed at Python for typically giving just SyntaxError: invalid syntax when something's wrong? This error could come from forgetting a closing parenthesis, for example. In this case, a Perl 6 compiler says

Unable to parse expression in argument list; couldn't find final ')'

which actually tells you what's wrong. But this is just the tip of the iceberg. The compiler catches common mistakes and points out possible solutions, and even suggests fixes for spelling mistakes.

Finally, Perl 6 gives you the freedom to express your problem domain and solution in different ways and with different programming paradigms. And if the options provided by the core language are not enough, Perl 6 is designed with extensibility in mind, allowing you to introduce both new semantics for object oriented code and new syntax.

Subscribe to the Perl 6 book mailing list

* indicates required

[/perl-6] Permanent link