Diagnostics-TK Part 2: An incident starts with answers

Tags:
observability | sre | diagnostics-tk

Introduction

Whenever an outage occurs and the incident handling process begins, the first responders and incident commander should be able to start their work as well-informed as possible. Having clarity about the situation is key to keeping in control of the incident from the get-go. Clarity enables good decision making.

A previous article detailed how to use Diagnostics-TK to define a list of re-usable diagnostic tests. In this article we will explore how to present these tests as questions and answers for everyone to understand.

Outputs

By default, Diagnostics-TK logs the results from each test it executes:

2023-05-07 12:09:39,690 - INFO - MyCompanyService::MyCompanyServiceTwitter(twitter-api)::test_hostname_resolvable - OK
2023-05-07 12:09:39,690 - INFO - MyCompanyService::MyCompanyServiceInstagram(instagram-api)::test_hostname_resolvable - OK
2023-05-07 12:09:39,706 - INFO - MyCompanyService::MyCompanyServiceInstagram(instagram-api)::test_host_up - OK
2023-05-07 12:09:40,512 - INFO - MyCompanyService::MyCompanyServiceTwitter(twitter-api)::test_host_up - OK

Whilst this might be informative for the people who wrote the tests it might not be for everybody else involved during or after the incident. How can we improve that?

Diagnostics-TK has the concept of outputs. An output is nothing more than a class which accepts the results of each executed test for further processing. An example of such a class is ConsoleTable which comes included by default. The ConsoleTable output displays the test results in a table like structure.

We can extend the example of the previous article to include the ConsoleTable output as such:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from diagnostics_tk import DiagnosticsRunner
from diagnostics_tk.tools import exec_cli
from diagnostics_tk.output import ConsoleTable

class MyCompanyServiceSocialMedia:
    def __init__(self, hostname):
        self.hostname = hostname

    def test_host_up(self):
        """ """
        result, reason = exec_cli(
            f"nmap -sP {self.hostname}",
            exit_code=0,
            stdout_pattern="Host is up",
        )
        assert result, reason

    def test_hostname_resolvable(self):
        """ """
        result, reason = exec_cli(
            f"dig {self.hostname} @1.1.1.1",
            stdout_pattern="status: NOERROR",
            timeout=5,
        )
        assert result, reason


def main():
    with DiagnosticsRunner(name="my_infra", workers=5) as runner:
        runner.register(
            "twitter-api",
            MyCompanyServiceSocialMedia(hostname="api.twitter.com"),
        )
        runner.register(
            "instagram-api",
            MyCompanyServiceSocialMedia(hostname="api.instagram.com"),
        )
        runner.register(
            "facebook-api",
            MyCompanyServiceSocialMedia(hostname="api.facebook.com"),
        )

        runner.register("table", ConsoleTable(title="My Infra"))

The example shows we can register the ConsoleTable class instance just like a test class1.

This yields following table overview:

which is already a bit more pleasant to look at but not much more informative than before.

Docstrings

As seen in the above screenshot, the description for each entry is still empty. We can provide content by defining a docstring for each test_ method respectively. The defined value is treated as a template using Python's str.format() which Diagnostic-TK attempts to render using any class variables it can find.

What style of description works best depends on the specific use-case and it is up to the reader to determine. However, when running diagnostic tests during an outage, we could write what we are testing in the form of a human readable question. Consider following updated example:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from diagnostics_tk import DiagnosticsRunner
from diagnostics_tk.output import ConsoleTable
from diagnostics_tk.tools import exec_cli


class SocialMediaAPI:
    def __init__(self, hostname, dns_server):
        self.hostname = hostname
        self.dns_server = dns_server

    def test_host_up(self):
        """
        Can host `{hostname}` be reached using an ICMP ping?
        """
        result, reason = exec_cli(
            f"nmap -sP {self.hostname}",
            exit_code=0,
            stdout_pattern="Host is up",
        )
        assert result, reason

    def test_hostname_resolvable(self):
        """
        Can DNS server `{dns_server}` resolve host `{hostname}`?
        """
        result, reason = exec_cli(
            f"dig {self.hostname} @{self.dns_server}",
            stdout_pattern="status: NOERROR",
            timeout=5,
        )
        assert result, reason


def main():
    with DiagnosticsRunner(name="my_infra", workers=5) as runner:
        runner.register(
            "twitter-api",
            SocialMediaAPI(hostname="api.twitter.com", dns_server="1.1.1.1"),
        )
        runner.register(
            "instagram-api",
            SocialMediaAPI(hostname="api.instagram.com", dns_server="1.1.1.1"),
        )
        runner.register(
            "facebook-api",
            SocialMediaAPI(hostname="api.facebook.com", dns_server="1.5.1.1"),
        )

        runner.register("table", ConsoleTable(title="My Infra"))


if __name__ == "__main__":
    main()

This yields following output:

Final Notes

In the A previous article we have motivated to start defining the diagnostic checks we execute manually during an incident as part of a Diagnostics-TK based setup. Turning manually executed, ad-hoc tests into a set of re-usable tests helps us to save time during incident. By encapsulating past incident experience in a programmatic way, we can build up a library of checks which validate the usual suspects which could cause or already have caused an outage.

The names of diagnostic checks could become quite cryptic and therefore are in need again for domain specific knowledge to understand their meaning and what they represent. In this article we have covered how Diagnostics-TK uses docstrings and how we can use those to our advantage by presenting a validation as a question and answer for everyone to understand involved during and after the incident.

Keep in mind that the above examples, although useful, are simple in nature and are already easy to understand for for many familiar with the topic. The sort of checks which are required to validate all the things required for your service to run, pass into the realm of the cryptic pretty quickly.

In a next article of the Diagnostics-TK series we will focus how to create an output class and send the check results to Slack.

If you have any feedback or suggestions, don't hesitate to get in touch on Twitter.

Footnotes


  1. Diagnostics-TK knows a class is an output class if it bases the Output class.