Thu, 12 Aug 2010

What is "Modern Perl"?


Permanent link

These days you often hear term Modern Perl, as something new(ish), and much improved over the old ways.

But what is it exactly? Well, there's no proper definition, but here is what that term means to me:

It's a set of tools, ideas and attitudes that help you to write better Perl programs, and allows you to have more fun while writing them.

Here are some aspects of Modern Perl

  • Testing: Most modern Perl modules have extensive test suites, that make development sane and robust
  • Some built-ins now come with safer forms: the three-argument form of open() allows you to open files safely with arbitrary characters in them, without any extra precautions. Lexical file handles make things safer and easier too.
  • use strict; use warnings;
  • Proper OO: with Perl 6 and with Moose in Perl 5, we have good object systems, that require less low-level fiddling than the standard Perl 5 object system
  • Following Best Practices
  • (For open source projects) Liberally handing out commit privileges. The source is stored in a version control system anyway, so low-quality changes or vandalism can simply be reverted (but that doesn't happen often in practice).
  • Caring about marketing: do tell people that you built something cool and useful
  • Small handy modules such as List::Util and Try::Tiny
  • Development tools such as Devel::Cover and Devel::NYTProf
  • (update) perlbrew and local::lib to help maintain your own perl installation and locally installed modules.

All of these techniques help to write scalable Perl programs by making proper encapsulation much easier, or by avoiding common errors, identifying performance bottlenecks etc.

Update: after watching some discussions about this post in various media, I should add a few more tools that I forgot about earlier:

[/perl-tips] Permanent link

comments / trackbacks

Tue, 23 Mar 2010

In search of an exponential time regex


Permanent link

Somebody asked on IRC if it was dangerous to interpolate a user-supplied string into a regular expression in Perl. Since by defaul the re 'eval' pragma is disabled, there is no direct danger of malicious code execution.

But still there is danger: since the Perl regex engine does backtracking, it is possible to craft malicious regexes that take up much time. Very much time. So much time that if you don't do anything against it, it will eat up all your CPU - a denial of service attack.

Such a thing is best illustrated with an example - and I wanted one to demonstrate exponential matching time. Easy, you'd think; the literature has one: match the regex /(a*)*b/ against the string "aaaa...a". It fails, but first tries to do search for many possible combinations of how many a's to much with the first capture. You'd think. A quick test reveals that this is not the case:

$ time perl -e ' ("a" x shift)  ~~ /(a*)*b/' 100000000

real    0m0.286s
user    0m0.194s
sys     0m0.093s

This constructs a string of 10 million a's, and then matches the regex against it - and it takes less than the third of a second. That's quite fast. But why? The answer is that Perl does some clever optimizations. You can learn a bit about the applied optimizations with the re 'debug'; pragma:

$ time perl -Mre=debug -e ' ("a" x shift)  ~~ /(a*)*b/' 100000000
Compiling REx "(a*)*b"
synthetic stclass "ANYOF[ab]".
Final program:
   1: CURLYX[0] {0,32767} (11)
   3:   OPEN1 (5)
   5:     STAR (8)
   6:       EXACT <a> (0)
   8:   CLOSE1 (10)
  10: WHILEM[1/1] (0)
  11: NOTHING (12)
  12: EXACT <b> (14)
  14: END (0)
floating "b" at 0..2147483647 (checking floating) stclass ANYOF[ab] minlen 1 
Guessing start of match in sv for REx "(a*)*b" against "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"...
Did not find floating substr "b"...
Match rejected by optimizer
Freeing REx: "(a*)*b"

real    0m0.361s
user    0m0.223s
sys     0m0.101s

This means the regex engine has found a literal b in the regex, and is clever enough to search for that in the string. Which is quite fast, it just has to look at each character once. It doesn't find it, so it knows that it can short-circuit all this expensive backtracking.

So how to fool the optimizer? By not using a literal string, but something that's not so easy to detect -- a character class: /(a*)*[bc]/.

This fools the optimizer, and the regex runs slower - half a second for one million characters. Still not very impressive.

What's up here? If you look at the debugging output, and try to read the structure of the compiled regex, you'll find that it actually corresponds to /(a*)[bc]/. The optimizer has dropped the second star, because it prefers the longest possible match with the star, so never needs to quantify the whole group more than once anyway.

So again I had to fool the optimizer somehow: for example by requiring at least two matches of the first group: (a*){2,}[bc]. However the experiment shows that it still didn't backtrack exponentially.

The next trick finally worked: also give it an upper limit. That defies the optimizer for some reasons not quite transparent to me:

$ time perl -e ' ("a" x shift)  ~~ /(.*){1,32000}[bc]/' 25

real    0m8.856s
user    0m8.841s
sys     0m0.008s

That's nearly 9 seconds for a 25 character string, extending the string length to 27 already takes 35 seconds, 30 characters took 276 seconds. I did not have the patience to try it with any higher numbers - but it's clearly growing fast enough to satisfy my curiosity.

So after some fiddling I did find a malicious regex, but it was not quite a trivial task.

[/perl-tips] Permanent link

comments / trackbacks

Tue, 16 Jun 2009

Data driven programming


Permanent link

The other days somebody asked on IRC for help with this question:

He was looking for all 9 digit numbers that didn't contain a zero digit anywhere, and the first digit should be divisible by one, the number formed from the first two digits should be divisible by 2, the number formed from the first three digits should be divisible by 3 etc.

The first such number is 1232525616, because 1 can be divided by 1, 12 can be divided by 2, 123 can be divided by 3 etc.

Since finding one isn't really a challenge, let's focus on finding them all. The easiest approach is, of course, brute force. There are 109 numbers with 9 digits, that's quite a manageable number for a modern computer. On my laptop it takes perl about 45 seconds to loop over all 109 numbers, doing nothing else.

So all you've got to do is iterate over all numbers, test if they meet the criterion stated above, and print them out if they do:

use strict;
use warnings;
sub test {
    my $z = shift;
    for (2..9) {
        return if (substr($z, 0, $_) % $_ != 0);
    }
    return 1;
}

for (my $i = 1 x 9; length($i) == 9; $i += 9) {
    next if $i =~ /0/;
    if (test($i)) {
        print $i, $/;
    }
}

This uses a small trick: the first number that doesn't contain any zero is 111111111, or short 1 x 9.

It takes 4 minutes and 20 seconds, produces the right answer, and we're happy.

But it wastes a lot of resources, and wouldn't scale for larger numbers. So for the sake of fun I tried a few different optimizations.

The first one is quite simple: the second digit must always be even, otherwise the number consisting of the first two digits could not be even. Likewise the fifths digit must be 5 or 0 to ensure that it can be divided by 5. Since 0 is forbidden anyway, it has to be 5. So let's skip the expensive check if those conditions aren't fulfilled:

for (my $i = 1 x 9; length($i) == 9; $i += 9) {
    next if $i =~ /0/;
    next if $i =~ /^.[13579]/;
    next unless $i =~ /^....5/;
    if (test($i)) {
        print $i, $/;
    }
}

This speeds up the computation to roughly a minute. Since 40 seconds are minimum for the iteration alone, it's nearly as good as it gets with this approach.

But of course there's still room for improvement: when the second digit is odd, the program iterates over a hundred million of numbers without finding one. Instead of skipping them each time, it would be much more efficient to generate the number digit by digit, checking divisibility at each step of the way.

for my $a (1..9) {
    for my $b (2, 4, 6, 8) {
        for my $c (1..9) {
            next if ($a + $b + $c) % 3;
            for my $d (2, 4, 6, 8) {
                next if (10 * $c + $d) % 4;
                my $e = 5;
                for my $f (2, 4, 6, 8) {
                    next if ($a + $b + $c + $d + $e + $f) % 3;
                    for my $g (1..9) {
                        my $so_far = "$a$b$c$d$e$f$g";
                        next if $so_far % 7;
                        for my $h (2, 4, 6, 8) {
                            next if ($so_far . $h) % 8;
                            for my $i (1..9) {
                                my $so_far = $so_far . "$h$i";
                                next if $so_far % 9;
                                print "$so_far$h\n";
                            }
                        }
                    }
                }
            }
        }
    }
}

Wow, that's ugly code, but it works and it's fast. Very fast. 27 milliseconds, or more than 2000 times faster as the previous version.

Bug it contains lots of duplicated code, and again it wouldn't scale for finding larger numbers, this time because for digit a loop needs to be written.

Whenever I find myself repeating some piece of code a few times, but in slightly different forms, I try to pack the code into a data structure instead.

For each digit position there needs to a list of digit to try, and a test that determines if the newly added digit violates its divisibility rule.

use strict;
use warnings;

my @config = (
    [[1..9],        sub { 0 }],
    [[2, 4, 6, 8],  sub { 0 }],
    [[1..9],        sub { $_[0] % 3}],
    [[2, 4, 6, 8],  sub { $_[0] % 4 }],
    [[5],           sub { 0 }],
    [[2, 4, 6, 8],  sub { $_[0] % 6 }],
    [[1..9],        sub { $_[0] % 7 }],
    [[2, 4, 6, 8],  sub { $_[0] % 8 }],
    [[1..9],        sub { $_[0] % 9 }],
);

The nested loops from the previous script can be emulated by recursion, passing the previous digits along (and the configuration for all digit positions that still need to be tested).

sub f {
    my ($so_far, $config) = @_; 
    $config = [ @$config ];
    if (!@$config) {
        print "$so_far\n";
    } else {
        my $c = shift @$config;
        for my $current (@{$c->[0]}) {
            next if $c->[1]->($so_far . $current);
            f($so_far . $current, $config);
        }
    }
}

f('', \@config);

With 46ms runtime it's still acceptably fast, and much leaner than the nested loops.

Actually a rough sketch of the configuration table can also be generated automatically, and then optimized manually:

my @config;

for my $n (1..9) {
    push @config, [
        ($n % 2 ? [1..9] : [2, 4, 6, 8]),
        sub { $_[0] % $n },
    ]
}
$config[4] = [[5], sub { 0 } ];

Now the script contains as little duplication as possible, is reasonably fast, and I'm happy.

(Just one final note: there's no scalability problem in this particular task, because it can't be extended to more than nine digits: if no zero is allowed, there won't be any numbers divisible by 10.)

[/perl-tips] Permanent link

comments / trackbacks

Tue, 09 Jun 2009

Small Perl Helpers


Permanent link

CoreList

You want to use a module, but you're only allowed to use core modules? Or you want to recommend a module to somebody, and you know it's more likely that he'll use a module if it's in core (and thus he won't hand-roll his crappy CGI parser, and open a wide door to spammers)?

Module::CoreList is the answer, and it comes with a very handy script called corelist:

$ corelist Unicode::Normalize
Unicode::Normalize was first released with perl 5.007003

$ corelist DBI
DBI was not in CORE (or so I think)

# search with regexes
$ corelist /Tie/
Pod::Simple::TiedOutFH was first released with perl 5.009003
Tie::Array was first released with perl 5.005
Tie::File was first released with perl 5.007003
Tie::Handle was first released with perl 5.00405
Tie::Hash was first released with perl 5.002
Tie::Hash::NamedCapture was first released with perl 5.009005
Tie::Memoize was first released with perl 5.007003
Tie::RefHash was first released with perl 5.004
Tie::Scalar was first released with perl 5.002
Tie::StdHandle was first released with perl 5.01
Tie::SubstrHash was first released with perl 5.002
TieHash was first released with perl 5

(empty lines sanitized; corelist emits an empty line after each module which is a bit annoying)

Timestamps

Not from CPAN, but a tiny script I wrote:

#!/usr/bin/perl
use strict;
use warnings;
use Time::Local qw(timelocal);


if (@ARGV) {
    my $date = shift @ARGV;
    if ($date =~ m/^(\d{4})-(\d\d)-(\d\d)$/){
        my ($year, $month, $mday) = ($1, $2, $3);
        my ($hr, $min, $sec) = (0, 0, 0);
        my $time;
        if ($time = shift(@ARGV)
            and $time =~ m/^(\d{1,2}):(\d\d)(?::(\d\d))?$/) {

            ($hr, $min, $sec) = ($1, $2, $3||0);
        }
        print timelocal($sec, $min, $hr, $mday, $month - 1, $year - 1900), $/;
    }

} else {
    print time, $/;
}

If called with no argument it will print the current time as a Unix timestamp. If called with one or two arguments, it will interpret these as a date in the YYYY-MM-DD and a time in HH:MM:SS format and print the corresponding timestamp.

[/perl-tips] Permanent link

comments / trackbacks

Tue, 29 Jul 2008

Magic ARGV (considered harmful)


Permanent link

A bunch of threads on the perl5-porters mailing list made me aware of a rather obscure perl 5 feature.

I was quite familiar with the basic form, which goes like this:

while ( <> ) {
    print;
}

This either reads from all files in turn that were given on the command line, or from STDIN if no file name was provided.

Now <> actually uses open internally, specifically the 2 argument form. Which means that anything that is valid as the second argument to open also works in @ARGV:

local @ARGV = ('fortune |');
print while <>;

Instead of trying to open a file called fortune |, it actually executes the command fortune and uses it output for the <> "diamond operator".

This feature can be used for quite some tricky and shiny stuff, but it's also dangerous. If your perl program uses <>, it is vulnerable to the command line argument rm -rf * | and similar things, also known as "arbitrary code execution".

Therefore the perl porters are discussing about disabling this feature, and making the old, magic behaviour available with a command line switch or a pragma.

For me the consequence is "don't use <> with possibly untrusted input".

[/perl-tips] Permanent link

comments / trackbacks

Sat, 26 Jul 2008

Dumping UTF-8 Data


Permanent link

The other day I wrote Perl6::Str, and a small script that I called utf8-dump helped a lot during debugging:

$ echo Überhacker | utf8-dump
\N{LATIN CAPITAL LETTER U WITH DIAERESIS}berhacker

It replaces all non-ASCII-characters with their Unicode name, in a form that can be used in Perl 5 double quoted strings if use charnames qw(:full) is loaded first.

And this is how the script looks:

#!/usr/bin/perl
use strict;
use warnings;
use charnames ();
use Encode qw(decode_utf8);

while (<>){
    $_ = decode_utf8($_);
    s{([^\0-\177])}{N_escape($1)}eg;
    print;
}

sub N_escape {
    my $n = charnames::viacode(ord($_[0]));
    return defined($n) ? "\\N{$n}" : sprintf('\x{%x}', ord($_[0]));
}

(Update 2010-04-19:) Added \x{...} escapes for characters which viacode doesn't like.

[/perl-tips] Permanent link

comments / trackbacks