In the tech industry there is a misguided tenet of 'move fast, break things' and always trying to use the 'hottest' stack/language/etc. that you see on Hacker News. Unfortunately, that rarely translates to customer happiness in using your product/service.
Over the years I've come across some key concepts while engineering systems for customer happiness and reliability:
- Be able to identify who your customers are.
- Customers are part of your system.
- Understand what 'customer impact' means in an outage.
Many of these points are illustrated as questions to ask yourself and your team.
What do we build?
- Do we build a website with public consumers as customers?
- Do we build a platform that businesses build on?
- Do we build internal tools for our colleagues?
You need to put yourself in the shoes of your customers (and perhaps their customers too!). Once you do, you can start to shape your engineering tenets around how customers use what you build.
- What are the tenets of your service or product?
If you build a platform that businesses use, you'll discover that reliability and robustness are likely more important than 'moving fast and breaking things'. This doesn't just apply to large enterprises; If you're a startup you probably don't have a massive user-base where you can afford to lose many customers.
- Can our customers tolerate outages, errors, delays? Do we need to build for robustness, speed, uptime, or all of the above?
These aren't points to define once, they should be evolving with your product, much like features and designs do. As your product evolves customers will use it differently and you should adapt as necessary.
In most block diagrams of a system I typically see some cloud labeled 'internet', but I rarely see 'customers'.
You know very well how your application reacts to latency, but how do your customers react to latency, errors, retrying?
I've seen many outages caused by failing to account for how customers interact with a service. While you might have great retry behavior and timeouts throughout your system, it's easy to overlook how your customers deal with an unreliable service (see: thundering herd).
Make sure you have customer-appropriate throttles, monitoring, error pages, and documentation.
In many post-mortems you'll hear statements like 'our webservers served 65,000 HTTP 500 errors from 10:04 to 10:30', which fails to tell the story of the customers who received those errors.
- Did that result in them having to retry a purchase and giving up?
- Did we lose customers?
- Did that break our customer's application and business?
If you can't answer those questions then there are gaps in your system's monitoring.
It's fundamental to have system-level metrics for all of your components, but going beyond that, you need to be able to measure customer impact.
If you build a platform that your customers build their business on, consider being proactive in helping them. It's meaningful to reach out to customers while they're experiencing an outage, even if it's not your fault. You know your system the best and perhaps there is advice you can share so they can leverage your system better during their outage.
If you notice a customer of yours had a bad outage, are there things you can build that they could utilize in order to prevent outages?
I've used the word customer 30 times in this post because at the end of the day most businesses require paying customers to exist. Understanding and engineering for your customers is incredibly important.
"The single most important thing is to make people happy. If you are making people happy, as a side effect, they will be happy to open up their wallets and pay you." - Derek Sivers