Bugs will happen. Are you prepared?

#programming #design #devjournal #eventdriven

It's very common for developers (and managers!) to focus on new features, performance, or bug fixing when coding. But when a project is complex enough, there's a hard truth that we all must take into account: bugs will happen. Period. I have never seen a new project go live without issues. In fact, I don't think I've ever seen a project stay bug-free in production for more than a month straight. Software engineering is imperfect, and our code must be prepared to deal with it. This post is about what you can do to make life easier for you and your team.

A couple of years ago, a bunch of us at The Agile Monkeys started this huge project for a bank. We rebuilt a critical part of their system, with an event-driven architecture that would be processing millions of monetary (and non-monetary) transactions daily. The team did a fantastic job and it's still in production and working well to this day. However, it was not a bed of roses. We faced some important challenges that I believe made us all much better developers today.

As usual with any big project, there was a lot of technical learning involved. From do's and don'ts in event-driven architectures, to communication both inside and across the teams. There were loads of lessons learned. But there was one huge takeaway that I took from the project. We dedicated way too much time to code optimization, bug fixing, and new features. And we didn't work enough on defensive programming.

What's defensive programming?

Quoting Wikipedia directly:

Defensive programming is a form of defensive design intended to ensure the continuing function of a piece of software under unforeseen circumstances. Defensive programming practices are often used where high availability, safety, or security is needed.

The emphasis here is on unforeseen circumstances. Bad Things™ will happen, no matter what you do. So you have to prepare for them.

At our first launch, we thought that we had a very robust application. It was performant and well tested. But, again, Bad Things™ can and will happen. In our case, it was an unknown rate limitation on our vendor production API that temporarily shut our application account down. Then, even when our services had a retry mechanism in place, they all ended up retrying simultaneously. As a result, a large majority of our transactions failed. Recovering from those errors required a lot of work from all of us. We were clearly not prepared enough for an issue at such a scale. We found ourselves doing a lot of manual, tedious work to solve the transactions in pending status. It resulted in many hours of hard work, and it naturally grew into some automatic tooling for reconciliation. Tooling that we should have implemented right away, instead of dedicating that much time to optimize performance, fixing bugs or even adding new features.

Reconciliation tooling

In regards to the tooling that you should have in place, it will greatly depend on the needs of your system. But for an event-driven architecture like the one that we were working with, we found it incredibly useful to have an Admin API (be careful with the security on this one!), with some automatic checks and reconciliation logic. The methods in this API would be able to replay events, look for inconsistent values in DB, or for the most extreme scenarios, even update some DB fields directly.

That, combined with an event store where every event received is kept (especially if not completed successfully), will allow your team to work wonders. It's important that the developers are able to safely and consistently replay the events. One of the issues that we faced in our project was that, even though our log aggregator worked very well to grab event data, it was not consistent enough for our scale. A small percentage (below 1%) of events were lost, but it equated to thousands of events that need to be manually searched for. Having a more complete and reliable event store for our needs was a huge relief for our reconciliation process.

There's another very good benefit that you'll get from all of this tooling: you can automatize those checks and processes to run periodically once you've verified their robustness. For instance, you could run a reconciliation task daily on all of those events that failed or stalled for more than an hour. Once the tooling is in place, there are many advantages that you can enjoy!

So, for every reader out there, I'll be proud if you can just carry home the following takeaway: bugs will happen. Are you prepared to deal with them in production? If not, start dedicating some time to it, even before bug fixing or optimizing your code. Invest in a calm mind. Future You will thank you when enjoying the spare time with your friends instead of dealing with on-call issues!