Tue, 25 Jul 2017
Writing a Systemd Supervised Service with Perl
Permanent link
There are many ways in which server software can fail. There are crashes, where the server process exits with an error. Program supervisors can catch those easily, and you can monitor for the presence of a server process.
But recently I had to deal with some processes that didn't just crash; they got stuck. It happens only rarely, which makes debugging harder. It involves AnyEvent, forking, the boundaries between synchronous and asynchronous code, and runloops getting stuck. I know the problem needs a much deeper solution, which will take weeks to implement.
So, there was a need for a much faster approach for at least detecting the stuck service, and possibly even restart it. And even if the problem was fixed, some monitoring wouldn't hurt.
Heartbeats
The standard approach to checking the aliveness of a process (or a connection) is a heartbeat. A heartbeat is a periodic action that a process performs; if the process fails to perform that action, a supervisor can pick up on that cue, and do an appropriate action. The action can be restarting the process, closing a TCP connection or so.
So, for a server process, what's a good heartbeat? The most basic approach is writing to a log file, or touching a file. The supervisor can then check for the up-to-dateness.
Systemd and Heartbeats
Since I already used Systemd for managing the service, I
wanted to see if systemd supported any heartbeats. It does,
and this
superuser post gives a great overview. In the context of
systemd, a watchdog needs to call the sd_notify
C function, which seems to live in the libsystemd.so
library. This communicates through some mysterious,
unknowable mechanism (actually just a UNIX socket) with systemd. To allow that
communication channel, the systemd unit file must include
the line NotifyAccess=main,
which allows the main process of the server to communicate
with systemd, or NotifyAccess=all
, which allows
subprocesses to also use sd_notify
.
The module Systemd::Daemon
module gives you access to sd_notify
in Perl.
A minimal Perl program that can be watchdog'ed looks like this:
#!/usr/bin/env perl use 5.020; use warnings; use strict; use Time::HiRes qw(usleep); use Systemd::Daemon qw( -hard notify ); my $sleep = ($ENV{WATCHDOG_USEC} // 2_000_000) / 2; $| = 1; notify( READY => 1 ); while (1) { usleep $sleep; say "watchdog"; notify( WATCHDOG => 1 ); }
If you forget the READY
notification, a
systemctl start $service
hangs (until it runs
into a timeout), and systemctl status $service
says Active: activating (start) since ...
. The
normal state is Active: active (running)
since
.
If the service misses its heartbeat, it looks like this
in the log (journalctl -u $service
; timestamps
and hostname stripped):
systemd[1]: testdaemon.service: Watchdog timeout (limit 10s)! systemd[1]: testdaemon.service: Main process exited, code=dumped, status=6/ABRT systemd[1]: testdaemon.service: Unit entered failed state. systemd[1]: testdaemon.service: Failed with result 'core-dump'. systemd[1]: testdaemon.service: Service hold-off time over, scheduling restart. systemd[1]: Stopped Testdaemon. systemd[1]: Starting Testdaemon... systemd[1]: Started Testdaemon.
And this is the corresponding unit file:
[Unit] Description=Testdaemon After=syslog.target network.target [Service] Type=notify NotifyAccess=main Restart=always WatchdogSec=10 User=moritz Group=moritz ExecStart=/home/moritz/testdaemon.pl [Install] WantedBy=multi-user.target
Relevant here are Type=notify
, which enables the
watchdog, Restart=always
as the restart policy,
and WatchdogSec=10
for 10 second period after
which the service restarts if no sd_notify
of
type WATCHDOG
occurred.
Systemd makes the WatchdogSec
setting
available as the environment variable
WATCHDOG_USEC
, converted to microseconds (so
multiplied by one million). If the server process aims to
report heartbeats twice as often as that wait period, small
timing errors should not lead to a missed heartbeat.
In my case, the WATCHDOG notification happens in an
AnyEvent->timer
callback, so if this doesn't
happen, either the event loop got stuck, or a blocking
operation prevents the event loop from running. The latter
should not happen (blocking operations are meant to run in
forked processes), so this adequately detects the error I
want to detect.
For the little functionality that I use, Systemd::Daemon is a pretty heavy dependency (using XS and quite a few build dependencies). After looking a reimplementation of the notify() protocol in python, I wonder if talking to the socket directly would have been less work than packaging Systemd::Daemon.
Summary
Systemd offers a heartbeat supervisor for processes that
manage it. It can automatically restart processes that fail
to check in regularly via calls to sd_notify
,
or doing the equivalent action on a socket. Perl's
Systemd::Daemon module gives you access to sd_notify in a
Perl server process.