DEV Community

Itai Katz
Itai Katz

Posted on • Originally published at swimm.io

Incident management & the wet floor sign

In software engineering, incidents are occasions where critical bugs and issues are exposed in the production environment. They can be found by a user, an automated test, or an engineer. But regardless, an incident is considered a critical issue that makes its way into production without the engineering team or any test or automation noticing it beforehand.

Incidents are always a bit of a sensitive topic - whether it’s because people don’t want to admit there are issues with their code, or because they tend to expose flaws and issues in our processes that we usually don’t like to discuss.

And this is, even more the case with repeat incidents, as these would have potentially been avoidable altogether had we simply known about it. In other words, we wouldn’t have slipped on the wet floor if someone had put a warning sign up for us.

When talking about incidents, the legal concept of “Errors and Omissions Excepted” comes to mind. And as engineers, we know that mistakes happen, and they can happen again; this is quite common in the software engineering industry. We understand that people are fallible and make mistakes, and of course we should expect that. And we also assume that they have good intentions; we hope and expect that mistakes will be corrected.

Think about it: how many times have you refactored a code to avoid a bug, only to have another engineer make a similar mistake sometime later on? Or how many times have you created a hot-fix to address a certain issue, promising you’ll refactor and turn a patch into a permanent solution?

That is why we should fix these issues, add tests and processes to ensure they won’t happen again, and also document what happened and share the knowledge.

Why should developers document incidents?

The legal concept of “Errors and Omissions Excepted” is essentially a disclaimer for situations where information is rapidly changing. Therefore it can be hard to obtain and thoroughly review an accurate snapshot of it. And code is precisely that - a rapidly changing, complex to review set of information, whether it’s the code itself or its environment.

One approach is to write tests to make sure that an incident never happens again. But while tests are important and helpful, they usually look at an incident through the pinhole view of a bug in a certain line of code - which means a test will not contribute any information about how an incident happened, what needs to be done outside of the code to avoid a similar incident, and what processes and changes are needed to prevent it from happening again.

And that is why even though tests are helpful, we also need documentation.

But documentation is not free from potential issues. Because even if documentation is created about a particular issue (and that is a big if), that documentation is usually outdated, detached from the code, or hard to locate or even know about in the first place.

Here at Swimm, we’ve encountered similar issues and have created a way using some of our Swimm toolsets to try and solve all of these issues. We simply add a “Wet Floor Sign'' to code paths that might be problematic to change.

Best practices for incident response documentation

There are many ways to avoid repeating incidents, and as developers, we are highly motivated to improve our best practices for incident management.

Encourage documentation
Step 1 is to make sure to create documentation once a certain issue is discovered and fixed.

I know - encouraging engineers to create documentation is hard to do, but Swimm makes it a lot easier with the option of adding a template for Incident Reports with Swimm’s Templates.

templates

And we have also added documentation to our Definition of Done for our hot-fix release. For example, in version 0.7.0, we had an issue where our release notes assets were not added to the correct path, and we created documentation to summarize what happened and how we fixed it.

hot fix

Attach documentation to code
As part of our Definition of Done for hot-fix releases, we also recommend adding a snippet from the code that fixes the issue to the documentation. At Swimm, it’s not a requirement, but a strong recommendation.

As you can see below, we did exactly that by adding our Release Notes to the document describing the hot-fix. This is the head’s up for us, so we remember our fix and don’t slip next time.

snippets

Maintaining documentation
By adding a code snippet to an Incident Report document, you can keep track of it if someone changes the code lines that fixed a specific issue. Specifically, Swimm’s GitHub App will automatically flag the document as Auto-synced (when there’s an insignificant change) or outdated when it requires reselecting a snippet.

In our case, if someone significantly changed a code path related to the incident, they would be asked to read the doc and edit it accordingly, bringing more attention to the sensitivity of the issue. Also, if someone refactors the structure of our release notes, they should also update the documentation containing our warning and move it to the new location.

out of sync docs

And with more minor changes, Swimm can automatically synchronize your document and commit it directly to your Pull Request.

Locate your documentation easily, right when you need it
By requesting a snippet to be added to an Incident Report document, Swimm makes it discoverable using our IDE plugin while writing code. Therefore everyone who adds a new release note sees the hot-fix documentation tagged to this object.

Furthermore, since Swimm is storing your documentation as markdown files together with your code, you can globally search for phrases such as “hotfix” or “incident,” and you’ll be able to find and read the relevant documentation without leaving the IDE. So every time someone adds a new item to our releaseNotes array, they will see there is a related doc with “Hotfix” in its title.

screenshot of ide

screenshot of ide

We also have an Incident Reports Playlist that gathers all of our reports and can help others catch up when onboarding and encountering a familiar issue. This is an important way for managers to keep tabs on reading and understanding incidents and fixes on their teams.

incident playlist

Also, by having our New Document Notification emails sent to Slack, you can get more people to look at new documentation that’s been added.

screenshot: doc recommendation

Bottom Line

While some incidents are unavoidable, with Swimm, you will be less likely to have the same incidents repeating themselves later on down the road and more likely to fix issues or adjust areas that have been lingering for a while

Swimm’s platform utilizes numerous tools such as our Web App, GitHub App, and IDE extensions that make it easier to work on code-coupled documentation, maintain it, and find it when you most need it.

If you’re interested in getting that assurance that your wet floor sign will be handled with Swimm’s platform, learn more about Swimm and join our open beta today. Slips and falls that are avoidable save countless hours of time and aggravation. And you’ll be able to give away your wet floor signs.

Top comments (0)