DEV Community

loading...
15Five Engineering Blog

Managing health checks at scale

caleb15 profile image Caleb Collins-Parks ใƒป5 min read

At 15Five, like many other companies, we use cron to schedule regularly occurring jobs. If a job fails to complete at the expected time or exits with the error code then we get an alert via healthchecks.io.

failure healthcheck alert

For a while this worked pretty well.

But soon we faced a challenge. We wanted to expand health checks to all of our testing environments, but that would require creating hundreds and hundreds of new checks. That's a lot of manual work, and why work when you can automate?

I pondered different solutions like writing a script to copy the checks from one project to another, or a script to parse all the ping endpoints into a format that could easily be inserted into our codebase, or managing the health checks via terraform. But none of these solutions were perfect. I wanted something to completely automate the process.

Fortunately, after talking with the Healthchecks maintainer, it turns out they already had a basic script for that. Every time a job completed, it would do a simple call to the Healthcheck API, creating the check if it didn't already exist and returning the check endpoint. Ping the endpoint, and you're done! Easy peasy, no human work involved.

#!/bin/bash

API_KEY=your-api-key-here

# Check's parameters. This example uses system's hostname for check's name.
PAYLOAD='{"name": "'`hostname`'", "timeout": 60, "grace": 60, "unique": ["name"]}'

# Create the check if it does not exist.
# Grab the ping_url from JSON response using the jq utility:
URL=`curl -s https://healthchecks.io/api/v1/checks/  -H "X-Api-Key: $API_KEY" -d "$PAYLOAD"  | jq -r .ping_url`

# Finally, send a ping:
curl -m 10 --retry 5 $URL
Enter fullscreen mode Exit fullscreen mode

I expanded on this to work with alert channels, measure start and end times, and to record failures.

Install jq, replace healthchecks_api_key with your API key and ship the file with your cron jobs. Now you can pass the cron command to the script and it will do all your work for you. Creating a new environment? Just update the API key and you're good to go!

We use Ansible to install jq and templatize the crontab and healthcheck file. Feel free to use Puppet instead, or manual distribution, or docker, or heck, redstone blocks in Minecraft, who am I to judge?

This has been working well in our production environment for months now. As a bonus when we go into Healthchecks.io we can see a detailed history of job start and end times:

Image of start and end pings with timing

It's great to be able to set up a new datacenter and take comfort in a vibrant field of solid green checks, and the only thing I had to do to set it up was to update an API key!

Discussion (0)

pic
Editor guide