Tue, 23 Mar 2010

In search of an exponential time regex

Permanent link

Somebody asked on IRC if it was dangerous to interpolate a user-supplied string into a regular expression in Perl. Since by defaul the re 'eval' pragma is disabled, there is no direct danger of malicious code execution.

But still there is danger: since the Perl regex engine does backtracking, it is possible to craft malicious regexes that take up much time. Very much time. So much time that if you don't do anything against it, it will eat up all your CPU - a denial of service attack.

Such a thing is best illustrated with an example - and I wanted one to demonstrate exponential matching time. Easy, you'd think; the literature has one: match the regex /(a*)*b/ against the string "aaaa...a". It fails, but first tries to do search for many possible combinations of how many a's to much with the first capture. You'd think. A quick test reveals that this is not the case:

$ time perl -e ' ("a" x shift)  ~~ /(a*)*b/' 100000000

real    0m0.286s
user    0m0.194s
sys     0m0.093s

This constructs a string of 10 million a's, and then matches the regex against it - and it takes less than the third of a second. That's quite fast. But why? The answer is that Perl does some clever optimizations. You can learn a bit about the applied optimizations with the re 'debug'; pragma:

$ time perl -Mre=debug -e ' ("a" x shift)  ~~ /(a*)*b/' 100000000
Compiling REx "(a*)*b"
synthetic stclass "ANYOF[ab]".
Final program:
   1: CURLYX[0] {0,32767} (11)
   3:   OPEN1 (5)
   5:     STAR (8)
   6:       EXACT <a> (0)
   8:   CLOSE1 (10)
  10: WHILEM[1/1] (0)
  11: NOTHING (12)
  12: EXACT <b> (14)
  14: END (0)
floating "b" at 0..2147483647 (checking floating) stclass ANYOF[ab] minlen 1 
Guessing start of match in sv for REx "(a*)*b" against "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"...
Did not find floating substr "b"...
Match rejected by optimizer
Freeing REx: "(a*)*b"

real    0m0.361s
user    0m0.223s
sys     0m0.101s

This means the regex engine has found a literal b in the regex, and is clever enough to search for that in the string. Which is quite fast, it just has to look at each character once. It doesn't find it, so it knows that it can short-circuit all this expensive backtracking.

So how to fool the optimizer? By not using a literal string, but something that's not so easy to detect -- a character class: /(a*)*[bc]/.

This fools the optimizer, and the regex runs slower - half a second for one million characters. Still not very impressive.

What's up here? If you look at the debugging output, and try to read the structure of the compiled regex, you'll find that it actually corresponds to /(a*)[bc]/. The optimizer has dropped the second star, because it prefers the longest possible match with the star, so never needs to quantify the whole group more than once anyway.

So again I had to fool the optimizer somehow: for example by requiring at least two matches of the first group: (a*){2,}[bc]. However the experiment shows that it still didn't backtrack exponentially.

The next trick finally worked: also give it an upper limit. That defies the optimizer for some reasons not quite transparent to me:

$ time perl -e ' ("a" x shift)  ~~ /(.*){1,32000}[bc]/' 25

real    0m8.856s
user    0m8.841s
sys     0m0.008s

That's nearly 9 seconds for a 25 character string, extending the string length to 27 already takes 35 seconds, 30 characters took 276 seconds. I did not have the patience to try it with any higher numbers - but it's clearly growing fast enough to satisfy my curiosity.

So after some fiddling I did find a malicious regex, but it was not quite a trivial task.

[/perl-tips] Permanent link