Health checks are often a last-minute addition to your application stack, if they are even included at all. Advanced Site Reliability Engineering (SRE) practices try to push best practices (like health checks) forward so they are included early before applications are deployed. Many engineers know intuitively that health checks are important, but getting them implemented correctly—and keeping them up to date—is very hard. This article tries to document best practices for health checks, application development including SRE tenets, and how to improve the stability and even performance of your application when it runs in production.
Do you know how much it costs for your application to go offline? Don’t worry if you don’t—or can’t—know the exact figure: the important thing is to go through the mental process of estimating how much an outage or degradation to your application would “cost.” Costs are not only measured in currency, you need to also consider impacts to your brand, your Net Promoter Score (NPS), chatter online and on social media among customers and potential customers, and even negative reactions in the public media.
I have worked in Site Reliability and DevOps my whole career and I have worked at many different companies whose responses for downtime ranged from the casual “our site will be back up eventually and we’ll be fine” to “we have lost $XXX per minute in revenue and we need to investigate methods for replacing that revenue”. No matter the response, I still did the best job my team and I could muster to keep the application and infrastructure services alive and well. There will always be bugs and issues with the code that is deployed and how it runs, however, if a problem occurs at a lower level in the application stack or in the infrastructure itself, then the application simply has no hope of servicing the needs of the consumers who visit your site.
The metaphor that I used often was one of cars driving on the motorway: if the roads are wet and slippery, then the cars will be unsafe and dangerous. When and if a crash occurs, then the roads will also be blocked and traffic will stop while the crash is cleaned up. It’s true that the cars may run out of gasoline, the drivers may get lost and go to the wrong destination, or the cars may not have good horsepower to drive quickly, but all of those factors are a higher order concern in the traffic stack. In this way, I saw my team’s and my job as keeping the roads as clear and safe and uncongested as possible so that the cars could operate at the best possible level.
Very early in the internet industry, the best practices for application stability were primitive and reactionary. Site reliability involved a manual post-mortem approach: finding out what happened and then applying monitoring and alerting on that behavior to alert an operator that something was wrong. The best practices at the time involved a team of on-call engineers and operators who would literally watch an application 24 hours a day, 365 days a year (one extra day for leap years) and respond within a certain timeframe (usually less than fifteen minutes) to manually investigate and fix any issues that came up. In some cases the “team” was actually one poor person tasked with the impossible job of being on-call indefinitely.
There are several drawbacks to this approach, not the least of which is the human toll such manual response takes and the unsustainable pace. The cost of the team, the cost of staff turnover and training, the losses due to turnaround time and missed calls, and the impact to end users were all huge reasons to implement a better solution.
One key initiative that came about in the early aughts was the concept of a health check in the load balancer. I was part of a team that worked with several major load balancer manufacturers to implement a way to not only route traffic to services in our application, but to add monitors and tests (even then we called them “health checks”) to the endpoints which would allow us to add or remove services that were not responding or were unhealthy. The concept was that a web application would respond on a well-known port and respond with a well-known response that proved the application was ready to serve traffic.
For example, we might query the backend service at
http://192.168.0.10/health-check and we expected the service to respond with a string like
200 OK. This trivial example doesn’t sound like much until you realized that our end-goal was to actually perform some internal checks in the application which would allow us to do more than respond with a static string. For example, the application might check that the database is responding to a sample query that a table exists, and then the application could check that the CPU was at some nominal value. Therefore, the health check could be expanded to something like:
HTTP/1.0 200 OK Checking DB… ok Checking CPU… ok Checking User cache… ok
Conversely, if something went wrong, the application could respond something like this:
HTTP/1.0 500 CRITICAL Checking DB... ok Checking CPU… ok Checking User cache… CORRUPTED
Using the response code of 200 and looking for the string “OK” (for example), the load balancer manufacturers were able to remove a service from the backend pool, allowing other servers to accept requests and avoid servers that would otherwise have an error. Also, we could set a timeout so that the load balancer would consider no response to be an error. In this way, we can remove traffic from servers that were not responsive. The beauty of the system we were designing was that we were going to be able to monitor errors proactively and directly at the origin. The servers would be removed before they became a problem.
We would also use the same health check in our monitoring and alerting systems that we had perfected over the previous decades by manually watching them and using them for diagnosis. The difference is that we had more information about what was going wrong, and simultaneously we had more time to respond and properly diagnose the problems without affecting customers at all. Imagine the relief at not having to respond to every alert at 2AM within 15 minutes, but being able to automatically open a ticket to have a technician during the graveyard shift respond within the hour and restart the server and add it back to the pool.
Even better, we were able to convince the load balancer manufacturers to implement an inline-retry policy based on the same idea. For example, if a live service request to a backend server failed with a 500 error code, the load balancer could not only remove the server from the pool, but it could retry the request one more time on a healthy server. With this technology, the loadbalancer could try to resolve the situation before the customer even noticed anything was wrong, and no human could be quicker.
We did better than save human labor and effort in monitoring the systems and responding to problems. After implementing health checks on the load balancers and inside the application, we were able to reduce errors and outages to the point where we actually raised our traffic levels by a double-digit percentage, and also increased actual revenue by a measurable amount. Users who might have encountered an error and navigated away after a Google search were staying around to browse and (more importantly) make purchases. By further tweaking the load balancing algorithms to favor healthier (or faster) servers, we further increased this beneficial business result even further. Steady growth over time occurred as well, because Google saw improved signals from users and fewer errors and therefore moved the site up in rankings. This was a stunning and unexpected outcome that was attributed to removing errors and downtime from our application running with this infrastructure and by utilising SRE practices (long before the term was coined).
With the advent of Kubernetes, the lessons learnt the hard way over the past few decades have been carried forward in architecting a resilient and reliable design for complex service interactions. Kubernetes uses the concepts of a probe to test the application for liveness and readiness (there is a third probe that tests for startup delay, but we’re skipping that for the purposes of this article). With these two probes, we can implement a solution that makes applications far more stable and reduces downtime and manual intervention.
The first solution is the liveness probe which has the job of figuring out if a service is responding properly and within a certain time frame so that it can be considered running properly. If the probes fail, then the pod is considered “dead” and the pod will be terminated and restarted somewhere else in the cluster. For example, a web server may have a memory leak and stop responding after a certain amount of time or number of requests have occurred. Another example might be a database that fills up a temporary disk space area and is unable to process further transactions until the space is cleared out.
You may be saying to yourself that these seem like errors that should be corrected and dealt with properly rather than simply killing the pod and waiting for it to be rescheduled somewhere else. You would be absolutely correct, but let me counter with a rhetorical question asking, “Given this error condition, what do you want me to do at 2AM when no one is available?” The liveness probes can be excellent at monitoring non-responsive servers without state, but may not be so great at monitoring and restarting services with state, like the database example I gave above. So we recommend using the liveness probe only if you feel it would help more than it would hurt. We also spend extra care and effort to ensure that the liveness probes are very forgiving so they do not trigger on false-positive alarms.
Another counter-argument to the “fix it” stance has to do with direct or indirect engineering costs and interacting with third party or open-source code. Trying to allocate resources to fully diagnose an intermittent problem, much less attempt to fix the problem can be difficult. In the case of a third party software or an open-source project where getting upstream fixes submitted, prioritized, approved, tested, and pulled back downstream can be enormously expensive and time consuming. Sometimes the answer really is “just restart it”.
The second solution is a readiness probe which is much like the solution I described earlier in this article with the load balancers. Indeed, the readiness probe does exactly what I’ve described: Kubernetes will periodically run a command to test the service running inside the container to gauge proper and timely responses. The ingress (just a fancy name for the load balancer) will not send traffic to this pod unless and until the readiness probe states that the service is ready for correct operation.
This helps in some scenarios where a web server may hit a threshold in connections or traffic levels where it may slow down or stop responding to new requests. It may be the case that the application simply cannot handle more than a certain number of transactions and so Kubernetes can use this signal to route traffic to another pod that is less busy. If this slow down or refusal to respond can be correlated with other metrics (like traffic volume, CPU utilization, etc.) then the horizontal pod autoscaler could trigger more resources to be added to the service.
In fact, we believe that readiness probes are so important to correct functioning of applications that we strongly recommend all services have a health check of some kind enabled and tested. We feel so strongly about this that we have considered making it a warning condition when no health check is configured on a running service in any of your environments at Release. Specifically, we could make readiness probes an opt out requirement rather than an opt in nicety.
Here are some actual examples of health checks that we have implemented for our customers. These examples are generic enough to be applied almost anywhere.
In this example we do a simple Nginx check on port 80 to ensure that the application is responding before we send traffic to the proxy.
readiness_probe: exec: command: - psql - "-h" - localhost - "-c" - SELECT 1 period_seconds: 2 timeout_seconds: 2 failure_threshold: 30
In this example we perform a health check against an Elastic search node to ensure that the cluster is healthy before accepting traffic (which presumably cannot be processed yet). This is a straight port from the Docker Compose examples in the open source repositories.
- name: elasticsearch image: docker.elastic.co/elasticsearch/elasticsearch:7.9.2 ports: - type: node_port target_port: '9200' port: '9200' liveness_probe: exec: command: - curl - "--fail" - localhost:9200/_cluster/health timeout_seconds: 2 failure_threshold: 3 period_seconds: 30
This example is good to show how a non-HTTP check for a postgres database can be used to ensure the database is up and responding to requests. Note that if this database is not clustered, then application database requests can fail when the health check fails. Your application will need to respond accordingly (either fail in turn to cascade a failover at a higher level, or perform some sort of mitigation so that a graceful failure happens). Recall that if this were a liveness probe, the postgres container would be killed and restarted, which may not be what you want at all.
readiness_probe: exec: command: - psql - "-h" - localhost - "-c" - SELECT 1 period_seconds: 2 timeout_seconds: 2 failure_threshold: 30
By implementing either (or both!) of these health checks, you can not only reduce the amount of time humans have to spend monitoring and interfering with applications, but you can even dramatically improve your traffic response levels, response times, and performance. In some cases, you might even be able to measure the impact to your customers’ NPS and/or your company’s top and bottom line.
Photo by Hush Naidoo on Unsplash