Tue, 25 Jul 2017

Writing a Systemd Supervised Service with Perl


Permanent link

There are many ways in which server software can fail. There are crashes, where the server process exits with an error. Program supervisors can catch those easily, and you can monitor for the presence of a server process.

But recently I had to deal with some processes that didn't just crash; they got stuck. It happens only rarely, which makes debugging harder. It involves AnyEvent, forking, the boundaries between synchronous and asynchronous code, and runloops getting stuck. I know the problem needs a much deeper solution, which will take weeks to implement.

So, there was a need for a much faster approach for at least detecting the stuck service, and possibly even restart it. And even if the problem was fixed, some monitoring wouldn't hurt.

Heartbeats

The standard approach to checking the aliveness of a process (or a connection) is a heartbeat. A heartbeat is a periodic action that a process performs; if the process fails to perform that action, a supervisor can pick up on that cue, and do an appropriate action. The action can be restarting the process, closing a TCP connection or so.

So, for a server process, what's a good heartbeat? The most basic approach is writing to a log file, or touching a file. The supervisor can then check for the up-to-dateness.

Systemd and Heartbeats

Since I already used Systemd for managing the service, I wanted to see if systemd supported any heartbeats. It does, and this superuser post gives a great overview. In the context of systemd, a watchdog needs to call the sd_notify C function, which seems to live in the libsystemd.so library. This communicates through some mysterious, unknowable mechanism (actually just a UNIX socket) with systemd. To allow that communication channel, the systemd unit file must include the line NotifyAccess=main, which allows the main process of the server to communicate with systemd, or NotifyAccess=all, which allows subprocesses to also use sd_notify.

The module Systemd::Daemon module gives you access to sd_notify in Perl.

A minimal Perl program that can be watchdog'ed looks like this:

#!/usr/bin/env perl
use 5.020;
use warnings;
use strict;
use Time::HiRes qw(usleep);
use Systemd::Daemon qw( -hard notify );

my $sleep = ($ENV{WATCHDOG_USEC} // 2_000_000) / 2;
$| = 1;
notify( READY => 1 );

while (1) {
    usleep $sleep;
    say "watchdog";
    notify( WATCHDOG => 1 );
}

If you forget the READY notification, a systemctl start $service hangs (until it runs into a timeout), and systemctl status $service says Active: activating (start) since .... The normal state is Active: active (running) since.

If the service misses its heartbeat, it looks like this in the log (journalctl -u $service; timestamps and hostname stripped):

systemd[1]: testdaemon.service: Watchdog timeout (limit 10s)!
systemd[1]: testdaemon.service: Main process exited, code=dumped, status=6/ABRT
systemd[1]: testdaemon.service: Unit entered failed state.
systemd[1]: testdaemon.service: Failed with result 'core-dump'.
systemd[1]: testdaemon.service: Service hold-off time over, scheduling restart.
systemd[1]: Stopped Testdaemon.
systemd[1]: Starting Testdaemon...
systemd[1]: Started Testdaemon.

And this is the corresponding unit file:

[Unit]
Description=Testdaemon
After=syslog.target network.target

[Service]
Type=notify
NotifyAccess=main
Restart=always
WatchdogSec=10

User=moritz
Group=moritz
ExecStart=/home/moritz/testdaemon.pl

[Install]
WantedBy=multi-user.target

Relevant here are Type=notify, which enables the watchdog, Restart=always as the restart policy, and WatchdogSec=10 for 10 second period after which the service restarts if no sd_notify of type WATCHDOG occurred.

Systemd makes the WatchdogSec setting available as the environment variable WATCHDOG_USEC, converted to microseconds (so multiplied by one million). If the server process aims to report heartbeats twice as often as that wait period, small timing errors should not lead to a missed heartbeat.

In my case, the WATCHDOG notification happens in an AnyEvent->timer callback, so if this doesn't happen, either the event loop got stuck, or a blocking operation prevents the event loop from running. The latter should not happen (blocking operations are meant to run in forked processes), so this adequately detects the error I want to detect.

For the little functionality that I use, Systemd::Daemon is a pretty heavy dependency (using XS and quite a few build dependencies). After looking a reimplementation of the notify() protocol in python, I wonder if talking to the socket directly would have been less work than packaging Systemd::Daemon.

Summary

Systemd offers a heartbeat supervisor for processes that manage it. It can automatically restart processes that fail to check in regularly via calls to sd_notify, or doing the equivalent action on a socket. Perl's Systemd::Daemon module gives you access to sd_notify in a Perl server process.

[/perl-tips] Permanent link