Your monitoring platform is susceptible to failure pretty much like any other component within your infrastructure. Having your monitoring monitored is a topic you should consider spending some time on. In this article we will cover how you can use Wishbone to cross check your monitoring setup and alert an external service such as Pagerduty in case of downtime.
A poll-based architecture
Poll-based monitoring is (arguably) the most common monitoring architecture. Polling means the initiative of validating the state of some endpoint lies with a central scheduler which repeatedly executes a check to determine the state of that endpoint.
Icinga, Nagios, Shinken, Naemon and OP5 are all examples of monitoring solutions relying on some form of a central scheduler executing checks. 1
The plan
A good start is to have an external monitoring application validating whether checks are actually being scheduled and executed on the monitoring application. On top of that, we also want to know whether the "external monitoring application" itself is functioning properly.
It's important to construct the setup in such a way that both applications cross-monitor one another.
If the monitoring application is able to successfully POST every minute, data over http (check_http) to the external monitoring application it knows the external monitoring application is up and running. If that's not the case then the monitoring application should alert accordingly.
If the external monitoring application does not receive at least 1 POST event over http (check_http) in 2 minutes, it knows the monitoring application is not scheduling and executing checks and therefor generate an alert.
Wishbone
We can use Wishbone to construct an external monitoring application which expects to receive data every 2 minutes via http POST.
2 key Wishbone modules to achieve this are:
Bootstrapping the server
To install Wishbone including the external modules required for this setup execute: 2
$ pip install wishbone wishbone-input-httpserver wishbone-output-http
The bootstrap file:
Running the server is really simple:
$ wishbone start --config bootstrap.yaml
check_http command
The check_http command scheduled and executed every minute by the monitoring application should be doing something similar to:
$ /usr/lib64/nagios/plugins/check_http -H wishbone-server-001.company.local -p 19283 -P hello
Final words
Using this setup we have achieved the the desired setup of 2 applications monitoring each other's availability. In case the Wishbone server goes down the monitoring application itself will alert. In case the monitoring application is down Wishbone will not receive any incoming data and therefor trigger a Pagerduty alert. 3
The described setup will not cover all possible failure scenario's although it is a first good step to have your monitoring setup monitored.
Please go ahead and give Wishbone a try and I'd greatly welcome feedback and ideas.
Footnotes: