What follows below is essentially a written version of a work in progress practice that we’ve been following for a while already in my domain, not just for new engineering work but also for architectural refactoring work:
- Engage with relevant stakeholders to understand the scope and context of the problem space as soon as you can
- Try to come up with up to 3 draft solutions each with pros and cons so we can make informed decisions without rushing with the first idea
- Outline potential risks/failure modes inherent in these options.
- Engage with the business stakeholders and users, to assess the impact of these failures on the business/users and identify the top risks to mitigate. Don’t assume anything! These discussions take the form of “what-if/what-when” type questions (as highlighted in the previous section). This pushes the stakeholders and the users to think deeper about the problem and give us pragmatic and honest responses.
- Based on these conversations about risks, fine tune the architectural options for the desired quality attributes for e.g. for reliable messaging we might go for the TRANSACTIONAL OUTBOX pattern, if the users have to be kept up to date with data changes then we might opt for the PUSH NOTIFICATION pattern using one of the server push technologies etc.
- Additionally, there might be risks that engineering teams might perceive as well for e.g. component complexity, duplication of behaviour, ownership, security, maintainability and testability etc. We also address these risks. Its important that these mitigations don’t adversely affect the observable business behaviour of the system. For e.g. if we choose to use SERVERLESS FUNCTION style – in an effort to reduce infrastructure maintenance overhead – to build a solution that should either be limited for concurrency or could be a long running operation, then this choice could either result in operation failure mid-stream due to time out or could produce incorrect results. Both affect the observable behaviour of the system from a stakeholder POV and is therefore not the right style to apply. We strive for a balance between engineering and business stakeholder expectations because both are important.
- When uncertain or want a second opinion, we reach out to other teams in the organisation to have them feedback on our design. I’ve created an Architecture Working Group in my organisation whose purpose is this very cross-collaboration and pollination of ideas to share learnings with each other. We’ve had good feedback from teams so far that have participated in these sessions and their understanding is better for it. Once we have mitigation plans for the most important risks, we pick the solution that minimises most of them. If two solution options tie, we pick the one that has the lowest implementation complexity and/or financial cost.
- Document all the discussed risks and mitigations in ADRs (Architecture Decision Records) and start the engineering iteration
- Repeat for each major product increment or design refactoring work
The goal is never to think up and address all edge cases that could exist upfront – that’s just not possible – but it is to think of and address the most pressing ones from both engineering and business points of view. One useful heuristic for uncovering the failure modes, is to look at the lines connecting the boxes in an architecture diagram as opposed to focussing just on the boxes and asking yourself what-when/what-if questions like, “what happens when this connection fails?” or “what happens when this message gets delivered multiple times and/or out of order?” or “what happens when 2 out of these 3 operations fail?” or “what happens when the box on the other side is not available?” or “what kinds of security risks are we being exposed to by exposing an API to third party?” etc.
This technique has helped us build more pragmatic designs where we have been able to reduce accidental complexity whilst designing for the essential complexity and critical risks. We’ve also improved our communication with our stakeholders a lot as a result which has led to a lot of good business and technical learning for both business stakeholders and engineers. With all this learning and a mindset for continuous improvement we also hope to keep improving the technical and strategic quality of the products we build. In the end this is what agility is all about!
- Understand the business context, partner with stakeholders early and often and make architectural decisions proportionate to the most critical risks involved.
- Engage in modeling exercises like Event Storming often to build up an understanding of the business process and map it to the software process.
- Assume nothing! Push back against overtly confident predictions and assumptions no matter who they come from
- Product Managers’ job is the what and the why, engineers’ job is the how and both have to collaborate on the when.
- Build for known knowns, plan for known unknowns and adapt when unknown unknowns hit
- Document your decisions, trade-offs and risks diligently and regularly review them to find improvement opportunities.
- Rinse and repeat. You are in this for the long haul!