This post originally appeared on jhall.io.
One habit that I think every software developer, if not practically every professional in any field, can benefit from is that of solving every problem twice.
You can also watch the video I've created on this same topic, or skip past to continue reading.
Fix everything two ways
Almost every tech support problem has two solutions. The superficial and immediate solution is just to solve the customer’s problem. But when you think a little harder you can usually find a deeper solution: a way to prevent this particular problem from ever happening again.
Obviously, I believe this principle applies to more than just customer service.
A related concept comes out of the Toyota, the Five Whys. Quoting from Wikipedia:
Five whys is an iterative interrogative technique used to explore the cause-and-effect relationships underlying a particular problem. The primary goal of the technique is to determine the root cause of a defect or problem by repeating the question "Why?". Each answer forms the basis of the next question.
When tackling an observed problem, whether it be in code, business processes, or potentially even a leaky sink, I like to combine these principles into a technique I call "Solve Every Problem Twice".
But as with the Five Whys, don't take "twice" too literally. In practice, this technique should always yield a bare minimum of two solutions, but will often result in 5 or more practical solutions.
The steps I follow are:
Use the Five Whys to determine the multiple causes of the observed problem.
Apply Joel Spolsky's advice of solving each cause at least once. Each cause should have an immediate fix, and most will also have at least one deeper solution.
Go through the first two steps again, this time for the process of solving the original observed problem.
To illustrate the technique in practice, let me describe a problem I ran into recently.
I wanted to do an update to one of the web sites I own, MinimalPairs.net, when I ran into a problem. I host the code for this web site on GitLab, where I use GitLab-CI for my continuous integration and deployment. I have GitLab-CI configured to create a review environment for me whenever a merge request is created.
When I recently pushed a change, I discovered the review environment was not working, with the famous "Your connection is not private" warning from Chrome which happens when an SSL certificate is broken.
I use Let's Encrypt, which I've written about before, to manage my SSL certificates for me. Sometimes it can take a few minutes to get a new certificate, so I was patient. But half an hour later it was still not working, so I knew I had a legitimate problem.
With a little digging through my Kubernetes logs, I found the cause of the
Status: Acme: Uri: Conditions: Last Transaction Time: 2019-11-18T08:41:49Z Message: Failed to verify ACME account: acme: urn:ietf:params:acme:error:rateLimited: Your ACME client is too old. Please upgrade to a newer version. Reason: ErrRegisterACMEAccount Status: False Type: Ready
I then looked in the configuration for my kubernetes cluster and found that I was requiring version 0.5.2 of the cert-manager package.
helm install stable/cert-manager \ --name cert-manager \ --version 0.5.2 \ --set ingressShim.defaultIssureName=letsencrypt-prod \ --set ingressShim.defaultIssureKind=ClusterIssuer \ --namespace kube-system \ --tls
At the time, was 0.11.0, so clearly an upgrade was in order.
With the immediate and root causes determined, let's go through the steps
The observed problem was that the SSL certificate is broken, which leads to our first why:
The reason, as discovered above, is that I was using an old, unsupported version of the cert-manager package. This leads to the second why:
As you may recall from above, I was explicitly requesting version 0.5.2 of the
cert-manager package. Perhaps it would be reasonable to always install the latest version.
Now I can go through the two problems I identified above, and resolve to solve each at least once.
This will solve the immediate, superficial problem, and get my web site
This will prevent the problem from reoccurring in the future. Of course, this may open up my system to a new risk, in case a new version of
cert-manager somehow breaks something, but it may be a risk worth taking.
But don't forget the final step! Repeat for the problem-solving process itself.
In my example, I found two areas where I believe I could have improved the
process of fixing the problem.
I don't update MinimalPairs.net very often. For all I know, this problem may have been lying in wait for weeks before I attempted an update and noticed.
Two possible solutions come to mind for this problem. The first is to use a simple monitoring service to alert me when the web site's SSL certificate is no longer working.
Second, and more proactively, I could use the same error logs which I used to debug the problem, and have them sent to a service such as Sentry.io, which can notify me immediately whenever a problem occurs.
In the spirit of solving each problem twice, I should do both of these.
Once the problem was identified, debugging it took longer than should have been necessary. This was largely due to the fact that Kubernetes doesn't keep all logs in centralized location. This could be solved by setting up a centralized logging system. I already use Loggly for most of my logging, so I can just set it up to track my Kubernetes logs, as well.
Using my technique, I came up with five potential solutions to a simple SSL certificate problem:
- Upgrade the certificate manager
- Don't depend on a specific version of the certificate manager
- Set up monitoring for the web site
- Set up error alerting
- Set up better logging
By applying all five of these solutions, I can ensure that not only have I solved the immediate problem, but that the overall health of my entire system is improving, and the next problem, no matter where it happens in the technology stack, will be that much easier to solve.