Exploring Prometheus metrics using Xpectd.

Tags:
observability | sre | xpectd

When exploring and experimenting with Prometheus its often hard to tell with certainty whether the outcomes of your queries are actually correct. Probably a reason might be that it's not always straightforward to have a predictable set of metrics at hand. Such a data set would at least give a point of reference to compare the query results to. For example: If the response time of each request is between 1 and 2 seconds for the last 5 minutes, the average has to be around 1.5 seconds.

Xpectd is a web service which responds to requests according to a pre-defined test plan. This test plan defines some properties of the individual requests such as response times, return codes and return values.

Graphing response times with Traefik, Prometheus and Xpectd

Traefik is a popular ingress used to expose services (among other things) in K8s. Because of its role it is also capable of collecting metrics of the services behind it and have these scraped by Prometheus.

Once Prometheus scrapes the metrics Traefik exposes, histogram metrics such as traefik_service_request_duration_seconds_bucket can graph the response times of the requests made to Xpectd.

Validating query results

Consider the following Xpectd configuration:

scenarios:
  - endpoint: hello_world
    response:
      status: 200
      payload: Hello world!
      min_time: 0
      max_time: 0.5
    outage:
      schedule: "*/1 * * * *"
      duration: 0
      response:
        - percentage: 100
          status: 200
          payload: (╯°□°)╯︵ ┻━┻
          min_time: 2
          max_time: 3

Once deployed, the /hello_world endpoint will always respond with a 200 between 0 and 0.5 seconds (plus some overhead).

Any web client such as Curl, Wget, Ab, Jmeter, etc can be used to generate requests to Xpectd in a similar fashion:

@http http://localhost:80/xpectd/hello_world
Return code: 200  last: 0.2737  min: 0.2737   max: 0.2737   avg: 0.0274   req/s: 5   Output: Hello world!
Return code: 200  last: 0.1280  min: 0.1280   max: 0.2737   avg: 0.0402   req/s: 5   Output: Hello world!
Return code: 200  last: 0.1829  min: 0.1280   max: 0.2737   avg: 0.0585   req/s: 5   Output: Hello world!
Return code: 200  last: 0.3479  min: 0.1280   max: 0.3479   avg: 0.0932   req/s: 6   Output: Hello world!
Return code: 200  last: 0.3675  min: 0.1280   max: 0.3675   avg: 0.1300   req/s: 4   Output: Hello world!
... snip ...

Median

The following query calculates the median of the request duration over the last 5 minutes:

histogram_quantile(0.50, sum(rate(traefik_service_request_duration_seconds_bucket{service="xpectd@docker"}[5m])) by (le))

This results into the following graph:

(☞゚∀゚)☞

As expected, the results float a little over 0.25 given min_time is 0 and max_time is 0.50.

99th percentile

The following query calculates the 99th percentile of request duration over the last 5 minutes:

histogram_quantile(0.99, sum(rate(traefik_service_request_duration_seconds_bucket{service="xpectd@docker"}[5m])) by (le))

This results into the following graph:

ಠ_ಠ

This graph indicates that 99% of all requests are below 1.17. Which is starting to become a little bit confusing because the max_time value of Xpectd is 0.5. What gives?

Traefik, histograms and bucket sizes

The reason for these un-intuitive results is because Prometheus histograms assign each individual response value to the nearest pre-defined bucket and increments a counter for that bucket.1

Traefik comes with a number of default pre-defined histogram buckets which are at the time of writing this article: [0.1,0.3,1.2,5.0].

This means when a request's response time is below or equal to 0.1 the 0.1 bucket's counter increments including all the parent buckets.1 Prometheus histograms automatically have a le bucket which acts as an overflow for all values higher than the biggest bucket.

When the response time is 0.5, the 1.2 bucket including all parent buckets increment by 1. This might perhaps look puzzling at first glance but it's actually correct since it indicates the request is below 1.2.

The following query supports that finding:

histogram_quantile(1, sum(rate(traefik_service_request_duration_seconds_bucket{service="xpectd@docker"}[5m])) by (le))

(☞゚∀゚)☞

The value returned by this query is 1.2 indicating that all requests are less or equal than 1.2 seconds. That's correct since Xpectd's min_time is 0 and max_time is 0.5 which produces values higher than the pre-defined 0.3 bucket effectively incrementing the 1.2 bucket by one.

Final words

When using Prometheus histograms it's important to realize that the choice of bucket distribution influences the granularity of your measurements which might lead to unexpected results. In the case of Traefik it might be useful to at least review the defaults and possibly override these with more granular buckets.

If you have any questions or remarks you can reach out to Twitter.

Footnotes