TLDR: Discussing coding guidelines and making them explicit can give your development team a sound foundation to work on, and can lead to a significant improvement in the quality of your code, the system performance and stability, personal development and in the end developer happiness.
The coding guidelines we came up with as a team, are:
- We are Boyscouts
- Sharing is Caring
- We keep documentation alive
- We learn from our mistakes
- We know our systems
If you've been working in software engineering for a while, you have most likely encountered one or more of the following situations:
The environment you're working in has piled up technical debt, slowing down your daily work, without a clear process of how to get rid of it.
Systems you're working on every day, surprise you with bugs or outages that you missed completely, because of incomplete alerting and/or a system that spams you with meaningless errors.
You're starting on a project that is in dire need of documentation and leaves you guessing on fundamental topics.
Certain applications have only one knowledge carrier, without whom no changes or fixes can be made.
Different teams are more or less isolated from each other, without a lot of knowledge exchange, resulting in certain learnings being made multiple times, (at least) once per team.
...and of course, the list could go on.
A year ago, my team checked all of these boxes. For perfectly good reasons. It had been understaffed for ages and still managed to write a reasonably complex application, with a good number of features that produced a quite high throughput. All the while delivering value to the company.
Still, the situation had become unmaintainable.
So, let's talk!
We had started to talk loosely within the team about the issues we saw, and what we could change in our process to improve the situation, when two temporary project teams joined our setup.
For us this felt like a great opportunity to start to collaborate and incorporate fresh ideas. We introduced a weekly meeting with all backend developers within our department, as a platform to present ideas, exchange experiences and discuss.
The topics can be anything from "Hey, I have this really hard problem and I could use some advice.", over "Look at this horrible code and how I refactored it!" or "Last week we had a devastating crash and here's our takeaways", to "Which guidelines do we all agree on, to make our experience as developers better?". Anybody can contribute, and any topic and its outcomes are documented.
Over the course of the last half year we already benefitted a lot from the knowledge exchange we had in this meeting, easily saving more development time than the meeting cost us. A lot of time we spent in actively discussing the guidelines we should follow in our daily work.
The coding guidelines
We had a look on the problems we were currently facing. We wanted to pave the way to continuously improve the situation, by defining some guidelines that every developer should follow. This poster is what we came up with:
"Leave the campground cleaner than you found it!"
In the past, removing technical debt had always been strictly separated from building features, and subsequently was deprioritized. This slowed us down significantly in the end, since any change was made harder by the amassed technical debt.
It also frustrated the developers and made discussions with the Product Owner more emotional than needed.
In our joint discussions, we decided that any refactoring up to a certain complexity, that removes technical debt, can be done within the scope of an existing ticket, regardless of whether it is a feature ticket or not.
We decided to timebox these refactorings to a maximum of a day of development. Anything exceeding this scope has to be an own ticket.
We agreed with the Product Owner to try this. It quickly turned out, that our system benefited greatly from it, since we also solved a lot of performance issues, so we never looked back.
On the contrary: as a follow-up experiment, we decided to separate the feature development from our technical backlog into separate swim-lanes on our Kanban board. The prioritisation of the maintenance backlog is now taken over by the engineers. This allows the Product Owner to focus on the product view, and allows the developers to decide on which topics they consider relevant to invest a part of their time into, while documenting the issues in a backlog.
- A shared mindset of continuous improvement.
- Doubling of the test coverage from 34% to 68%.
- Improvement of our code quality index from 64 to 75.
- A documented backlog of technical debt, that is maintained by the engineers in a self-organised way.
- Clarity on how much cleanup can be done in the scope of a ticket, without it having to be discussed.
Sharing is Caring
"You're not a lead dev if you're not helping teammates level up."
One of the prerequisites for being (and staying) a successful software developer, is to keep learning and improving. In our discussions, we agreed that the easiest way to learn, is to get feedback from other developers. Luckily, everyone in our group is more than happy to give and receive constructive feedback. An environment with knowledge silos, where each developer works only on one app, is counterproductive, though.
To break these silos up, we decided that we would create a pull request for every change, and have it reviewed by two reviewers - one from within the team, and one additional cross team reviewer. We're using Danger to support our PR process and e.g. have the reviewers being chosen randomly.
This seems like a minor footnote, but it turned out to be much easier to establish the process automatically, instead of relying on the developers to do it manually.
To prevent roadblocks only the first review is mandatory, and the second optional.
We also want to do pair programming more, and are encouraging people to do so. We decided against enforcing this with a rigid pairing schedule. Instead we tried reducing the WIP limit for ongoing development, and hereby "force" people to ask others to pair on things. While this definitely leads to more conversation and pairing, it's sometimes perceived as being too restrictive. So, we're still trying to find a feasible way to motivate us to pair, without actually impairing our productivity (I'd be grateful for recommendations).
Additionally, to reduce differences between the apps, we recently introduced a shared library including e.g. all code quality tools. We're aiming for the highest standards. When including the new library, every app had to exclude a couple of the quality checks, so the continuous integration wouldn't fail. The exclusions are made explicit in a file, so we can now work on getting rid of the exclusions, setting a clear common goal to be reached.
- Less and less knowledge silos, knowledge is spread broadly across the team.
- Accelerated professional growth.
- Every app has approached the common quality benchmark set by the shared library, without all of them completely reaching it (so far).
We keep documentation alive
"Weeks of programming can save you hours of documentation!"
Like many companies, we're using a company-wide wiki, to document all the things. Unfortunately, we're running into well-known problems, like out-of-date entries, duplicate topics with different information, or entries that don't show up when searching for them, for whatever reason.
In the end, that makes the documentation wiki a lot less helpful.
We discussed this, and came up with three main points of information that we were lacking:
- The Context of commits.
- App specific information, regarding e.g. the setup.
- Company-wide information, regarding e.g. the infrastructure.
To address these issues, we agreed on focussing on the quality of the commit history and the information in the pull requests. We decided to always add references to tickets in our ticketing system in the pull request, and references to pull requests in the commits. Every commit should do exactly one thing. Commits with meaningless messages, like e.g. "WIP", or "Small Fix" should be rebased interactively into meaningful commits before merging.
Now, when looking at a specific change in the code base, it is easy to figure out which Pull Request and which ticket was responsible for the change. As an added benefit, the git log is clean and a joy to read!
Also, we took some time to add information to the README for each project, and agreed on updating it anytime we notice that something is missing.
We haven't really figured out how to address the problem of the company-wide information "bottom-up" so far.
- Clean git history
- Living documentation
We learn from our mistakes
"Insanity is doing the same mistake over and over again and expecting a different result"
We have a rather complex system and oftentimes, when we had acute problems, the immediate actions taken, focussed on making the system run again, without looking deeper into the causes or the effects of the outage.
We decided to change this and introduced Post Mortems we create after every incident that affects users.
We're using a template containing four sections:
- Toplevel summary with effects on KPIs "What was the effect on the product?"
- Detailed analysis of problem cause
- Documentation of relevant analysis data (screenshots, discussions etc.)
- Follow-up Tasks / Conclusions
We keep these Post Mortems checked into our respective repositories. We review them like we would any pull request and also invite affected shareholders to give feedback. Subsequently, we share them in our Tech Update meeting.
We then create tickets for the follow-up tasks and prioritise them.
- Better understanding of the effect we have on the company's KPIs.
- We can inform other teams and departments about problems we caused, before they notice it themselves.
- Concrete follow-up tasks that led to multiple improvements in process, performance, alerting and reporting.
We know our systems
"We are drowning in information, but starved for knowledge."
We have a lot of tools in our company that can give us information on how our system is running, how our machines are handling the load and how our KPIs are doing.
Unfortunately, we felt we were still missing a lot of information:
- Due to the complexity of our system, we didn't have a good feeling on how well it was performing.
- We always had to pull information actively from our tools. When something went wrong, there were very few alerts in place.
- We had a view on errors that were currently occurring in each respective app. Unfortunately, in peak times, there was a lot of non-critical error noise, clouding our vision on really critical errors.
- We were lacking a process on how to deal with errors.
As an experiment, we set up Sentry, an error reporting and processing tool, that allows us to conveniently monitor errors and set up a process on how to deal with new and/or re-occurring errors. Additionally, it has a quite convenient alert email logic.
This is helping us on multiple of the points above, mainly giving us a better organised view on the occurring errors and alerting us, when something is going wrong.
Our technology stack is built on a service oriented architecture. To get a better feeling on the latency of messages going through some of these systems, we wrote a library, that allows us to track the latency of specific paths from trigger to finished processing. We send this data to Grafana dashboards, giving us detailed information by time-of-day.
Additionally, we defined alerts on these dashboards, so we get warned when the numbers are unusual.
- The reaction time to new occurring errors or dips in performance has improved greatly. By now we feel that we have most of the needed numbers and alerts in place.
- The new dashboards pointed out performance bottlenecks we managed to solve, improving performance quite considerably.
As already said, since we discussed our coding guidelines and made them explicit, we have improved in many ways.
I strongly believe, that it's the discussions about them that made them stick. I don't believe that these guidelines could be introduced as efficiently in a top-down decision.
Also, it has helped a lot, to automise the decisions taken where possible. This not only supports the developers on the team and keeps them from forgetting about it, but it also makes changes to the process subject to discussion.
I would highly recommend trying this out in your team(s) when you're facing one or more of the problems described in the beginning!
Thank you for reading :)