Tue, 08 Nov 2016

Icinga2, the Monitoring System with the API from Hell


Permanent link

Update 2016-12: We've met the Icinga2 developers, and talked through some of the issues. While not all could be resolved, the outlook seems much more positive after this. Please see my update for more information on the status quo.

At my employer, we have a project to switch some monitoring infrastructure from Nagios to Icinga2. As part of this project, we also change the way the store monitoring configuration. Instead of the previous assortment of manually maintained and generated config files, all monitoring configuration should now come from the CMDB, and changes are propagated to the monitoring through the Icinga2 REST API.

Well, that was the plan. And as you know, no plan survives contact with the enemy.

Call Me, Maybe?

Update 2016-12: This bug has been fixed in Version 2.6

We created our synchronization to Icinga2, and used in our staging environment for a while. And soon got some reports that it wasn't working. Some hosts had monitoring configuration in our CMDB, but Icinga's web interface wouldn't show them. Normally, the web interface immediately shows changes to objects you're viewing, but in this case, not even a reload showed them.

So, we reported that as a bug to the team that operates our Icinga instances, but by the time they got to look at it, the web interface did show the missing hosts.

For a time, we tried to dismiss it as an unfortunate timing accident, but in later testing, it showed up again and again. The logs clearly showed that creating the host objects through the REST API produce a status code of 200, and a subsequent GET listed the object. Just the web interface (which happens to be the primary interface for our users) stubbornly refused to show them, until somebody restarted Icinga.

Turns out, it's a known bug.

I know, distributed systems are hard. Maybe talk to Kyle aka Aphyr some day?

CREATE + UPDATE != CREATE

If you create an object through the Icinga API, and then update it to a different state, you get a different result than if you created it like that in the first place. Or in other words, the update operation is incomplete. Or to put it plainly, you cannot rely on it.

Which means you cannot rely on updates. You have to delete the resource and recreate it. Unfortunately, that implies you lose history, and downtimes scheduled for the host or service.

API Quirks

Desiging APIs is hard. I guess that's why the Icinga2 REST API has some quirks. For example, if a PUT request fails, sometimes the response isn't JSON, but plain text. If the error response is indeed JSON, it duplicates the HTTP status code, but as a float. No real harm, but really, WAT?

The next is probably debatable, but we use Python, specifically the requests library, to talk to Icinga. And requests insists on URL-encoding a space as a + instead of %20, and Icinga insists on not decoding a + as a space. You can probably dig up RFCs to support both points of view, so I won't assign any blame. It's just annoying, OK?

In the same category of annoying, but not a show-stopper, is the fact that the API distinguishes between singular and plural. You can filter for a single host with host=thename, but if you filter by multiple hosts, it's hosts=name1&host2=name2. I understand the desire to support cool, human-like names, but it forces the user to maintain a list of both singular and plural names of each API object they work with. And not every plural can be built by appending an s to the singular. (Oh, and in case you were wondering, you can't always use the plural either. For example when referring to an attribute of an object, you need to use the singular).

Another puzzling fact is that when the API returns a list of services, the response might look like this:

{
    "results": [
        {
            "attrs": {
                "check_command": "ping4",
                "name": "ping4"
            },
            "joins": {},
            "meta": {},
            "name": "example.localdomain!ping4",
            "type": "Service"
        },
    ]
}

Notice how the "name" and attrs["name"] attribute are different? A service is always attached to the host, so the "name" attribute seems to be the fully qualified name in the format <hostname>!<servicename>, and attrs["name"] is just service name.

So, where can I use which? What's the philosophy behind having "name" twice, but with different meaning? Sadly, the docs are quiet about it. I remain unenlightened.

State Your Case!

Behind the scene, Icinga stores its configuration in files that are named after the object name. So when your run Icinga on a case sensitive file system, you can have both a service example.com!ssh and example.com!SSH at the same time. As a programmer, I'm used to case sensitivity, and don't have an issue with it.

What I have an issue with is when parts of the system are case sensitive, and others aren't. Like the match() function that the API docs like to use. Is there a way to make it case sensitive? I don't know. Which brings me to my next point.

Documentation (or the Lack Thereof)

I wasn't able to find actual documentation for the match() function. Possibly because there is none. Who knows?

Selection Is Hard

Update 2016.12: The in operator works after all, if you get it right. Using the script debugger in combination with an API request with filter=debugger is a neat way to debug such issues.

For our use case, we have some tags in the our CMDB, and a host can have zero, one or more tags. And we want to provide the user with the ability to create a downtime for all hosts that have tag.

Sounds simple, eh? The API supports creating a downtime for the result of an arbitrary filter. But that pre-supposes that you actually can create an appropriate filter. I have my doubts. In several hours of experimenting, I haven't found a reliable way to filter by membership of array variables.

Maybe I'm just too dumb, or maybe the documentation is lacking. No, not maybe. The documentation is lacking. I made a point about the match() function earlier. Are there more functions like match()? Are there more operators than the ==, in, &&, || and ! that the examples use?

Templates

We want to have some standards for monitoring certain types of hosts. For example Debian and RHEL machines have slightly different defaults and probes.

That's where templates comes in. You define a template for each case, and simply assign this template to the host.

By now, you should have realized that every "simply" comes with a "but". But it doesn't work.

That's right. Icinga has templates, but you can't create or update them through the API. When we wanted to implement templating support, API support for templates was on the roadmap for the next Icinga2 release, so we waited. And got read-only support.

Which means we had to reimplement templating outside of Icinga, with all the scaling problems that come with it. If you update a template that's emulated outside of Icinga, you need to update each instance, of which there can be many. Aside from this unfortunate scaling issue, it makes a correct implementation much harder. What do you do if the first 50 hosts updated correctly, and the 51st didn't? You can try to undo the previous changes, but that could also fail. Tough luck.

Dealing with Bug Reports

As the result of my negative experiences, I've submitted two bug reports. Both have been rejected the next morning. Let's look into it.

In No API documentation for match() I complained about the lack of discoverable documentation for the match() function. The rejection pointed to this, which is half a line:

match(pattern, text)    Returns true if the wildcard pattern matches the text, false otherwise.

What is a "wildcard pattern"? What happens if the second argument isn't a string, but an array or a dictionary? What about the case sensitivity question? Not answered.

Also, the lack of discoverability hasn't been addressed. The situation could easily be improved by linking to this section from the API docs.

So, somebody's incentive seems to be the number of closed or rejected issues, not making Icinga better usable.

To Be Or Not To Be

After experiencing the glitches described above, I feel an intense dislike whenever I have to work with the Icinga API. And after we discovered the consistency problem, my dislike spread to all of Icinga.

Not all of it was Icinga's fault. We had some stability issues with our own software and integration (for example using HTTP keep-alive for talking to a cluster behind a load balancer turned out to be a bad idea). Having both our own instability and the weirdness and bugs from Icinga made hard and work intensive to isolate the bugs. It was much less straight forward than the listing of issues in this rant might suggest.

While I've worked with Icinga for a while now, from your perspective, the sample size is still one. Giving advice is hard in that situation. Maybe the takeaway is that, should you consider to use Icinga, you would do well to evaluate extra carefully whether it can actually do what you need. And if using it is worth the pain.

[/misc] Permanent link