germangonzo

Posted on Apr 27, 2021

Lessons I learned from achieving a 99.99% platform uptime

#devops #cloudskills #architecture #uptime

Voice123 is the first (and arguably foremost) open marketplace for voice actors.

Today, Voice123 has more than 250.000 registered voice actors, 60.000 active users, and 1.5 million leads that are generated per month. The platform has supported more than 350 million requests in a single month and manages 4TB of data. Besides the site’s operation, the engineering team is also releasing new features and improvements daily.

By the end of Q3 2020, our platform achieved an elusive 99.995% uptime. Over the last two years, It went from 21.98 minutes of downtime per month (99.9%) to 4.38 minutes (99.99%).

It seemed like an achievement worth celebrating — until we realized that the magical percentage of 99,995 was actually a kind of vanity metric, a benchmark set with insufficient justification. The costs of the implementation were around 30 times the impact of having downtime!

In truth, we’d spent more than 3000 engineering hours to reduce the downtime by 18 minutes.
The estimated cost per one downtime minute is approx 10 USD. According to that calculation, the monthly fee would be $180 vs. our estimated engineering costs of $5000/mo.

Of course, this is an over-simplification of the economic impact of having the site down.

However, this story has a bright side: We realized that the most valuable part of all of this was not the elusive percentage, but the process of achieving it!

Why?

Because the process made our platform much more robust in many places, notably: platform and infrastructure knowledge, mechanical sympathy, development practices, DevOps culture, and project management.

'Mechanical sympathy?'

Let me explain!

Mechanical Sympathy

"You don't have to be an engineer to be a racing driver, but you do have to have mechanical sympathy." Jackie Stewart, legendary British Formula One ace.

Simply put, mechanical sympathy is when you use a tool or system with an understanding of how it operates at its best 1.

The platform is composed of several technologies and components that have millions of daily interactions between them. The databases (SQL and NoSQL), proprietary and third-party services, and utilities must interact in a complex, coordinated, traceable, and predictable way. The system operation should fulfill all those requirements, and our first realization was that the engineering team got easily lost in the heaps of information and tools available.

Consequently, here are the three main areas (and the tools) that enabled us to understand the platform and start being sympathetic.

Logging

Collecting data is the starting point for any effort that pertains to system knowledge. Without accurate and meaningful data, any optimization effort will become useless.

The platform collects all the operational data through a logging mechanism. Access and error logs and performance metrics should be stored in a categorized, centralized, and structured way.

Error reporting

Logs can be useful for tracking issues and detecting anomalies but not enough to solve or manage them. Error monitoring tools like Sentry enabled us to manage and track errors in real-time. Every time an error occurs — independently of the application component — the error is reported and notified to our engineering team via a Slack integration.

Error reporting is not only about collecting data. It’s also the process of tackling the issues reported and how the engineering team is notified. The team struggled with the overwhelming amount of notifications it was receiving. The system became useful and practical for bug squashing just after setting up rules for the incidents reporting based on frequency, component mission, and priority.

Monitoring

Data must serve a purpose, and that’s where monitoring comes in.

The first step in the monitoring process is to extract indicators from the available data. For a better understanding, we categorized the indicators into four general categories: Resource consumption, Performance, Business transactions, and Operational indicators.

Resource consumption (Computational resources usage):

Free memory per component
CPU usage per component
Active connections per component
Emails sent
Bounce rate
Disk usage
Network usage and traffic
ETC.

Performance (How fast and reliable the platform is. Anything that contributes to a better user experience from a technical standpoint):

Availability : 2xx_requests/ (2xx_requests + 5xx_requests)
Uptime (Daily, Weekly, etc.)
Median response time for critical endpoints
Average page load speed

Business transactions (How the system serves the business goals):

The number of meaningful interactions per period (hour, day, week, etc.)
- Projects posted
- Orders processed
- Number of searches per period
- Payments received, etc.

Operational indicators (Service quality provided by the technical operations team):

Median resolution time for bugs
Bugs backlog growth rate

After setting up all the meaningful indicators, the engineering team must set a benchmark for them. It's advisable to implement a notification system that triggers an alert when a measurement is over/below a defined threshold. The big challenge here is how to prevent false alarms that make the communication channels noisy and thus reduce their effectiveness.

Development practices

TDD and Integration tests

Test-Driven-Development (TDD) is a development methodology that software developers acknowledge as fundamental to delivering good quality software. TDD is hard to implement in practice because the tight deadlines eclipse it. Many engineers also think about it as an overhead: it reduces the release velocity as well as personal productivity.

TDD critics could be right in the short-term, but it has demonstrated the opposite in the mid/long-term. We have gone through several refactors and upgrades of package dependencies without significant (or even any) downtime. Unit tests gave us the confidence to implement those changes in the platform with reasonable certainty that it’ll remain stable, reducing the maintenance burden and elevated cost of unexpected critical issues.

Unit testing is only the base of TDD. The challenge is to create a development culture around the benefits of TDD. Here’s the roadmap the Voice123 team followed:

Open the TDD discussion even when the platform has a considerable amount of progress. A tad late, but good enough for creating a remediation plan.
Identify critical components/functionalities of the system, prioritize, and plan the creation of unit tests for them.
Implement the agreed tests to ensure meaningful use cases and extensions instead of code lines covered.
Integrate unit test execution as a continuous integration step.
Run several iterations until there's comprehensive code coverage — 80% could be an acceptable benchmark. (Avoid the code coverage mindset because it can lead to bogus tests driven only to fulfill the number). At some point, the team will pursue that number, but the discussion will be about how well-tested the system is instead of getting the threshold at the expense of ‘cheating’ or manipulating tests.
At this point, the developers are familiar with unit testing and start adopting TDD as the safest and most efficient way to implement changes.

Support mindset

There’s a well-documented rivalry between support and development teams. Support has to react promptly to issues and manage the stress of dealing with the emotional outbursts of angry or concerned users. When the development team is distanced from the platform/product's user-frontline, it underestimates both the impact and the support team’s issues. Conversely, the support crowd tends to think of developers as divas who don't want to be disturbed by those annoying bugs!
Our experience of having developers assigned regularly to support tasks has been beneficial for the platform and the development process. Engineers are more connected with user needs and better understand the impact of what they’ve implemented. At the same time, they gain a deeper understanding of the system by exploring and learning about areas of the platform they never touch — and sometimes, don’t even know existed.

DevOps culture

Continuous integration(CI) and Continuous Delivery (CD)

CI or Continuous Integration is an engineering practice in which team members integrate their code at a very high frequency. Teams implementing CI aim to integrate code daily or, in some cases, hourly.

CD or Continuous Delivery is the practice of ensuring that code is always in a deployable state. This means that all code changes, such as new features, bug fixes, experiments, configuration changes, are always ready for deployment to a production environment (https://www.browserstack.com/guide/ci-cd-vs-agile-vs-devops).

An exemplary CI/CD implementation will:

Improve the team agility and fast response
Reduce the number of regression errors by executing as many automated tests as possible
Allow quick rollback and recovery from many disaster situations introduced by new releases.

Release only once per day

This part might be counterintuitive when applied to the continuous integration principle. But, in practice, most features (and even bugs) released could wait until the following day to be published. At Voice123, we have implemented a simple mechanism to allow daily release:
all the changes are merged to a RELEASE branch that is rebased and published to the production environment every weekday morning.

Panic button (Expedited release)

Yup: shit happens. Some releases crash the platform or create a huge mess.

Having an expedited release process that skips some validation steps helps minimize catastrophic deployments. Just make sure that the panic button is only used in emergencies and doesn’t become the rule for fixing quality issues.

Project management

The challenge is to align the engineering team’s voracious appetite for refactoring and exploring new technologies with a clear and measurable goal. Create a culture of impact thinking instead of delivering.

Here are some questions that help the engineering team make decisions about refactoring or migrating to new technologies:

Is the issue affecting the user experience?
What are the pros of our current technology? What are the cons of the new technology? Ask those specific questions in that order and contrast the answers.
Is it affecting the team’s execution speed? By how much? Can you give specific and practical examples?
What happens if we don't do the refactor? How long can we operate the platform without doing the refactor?

Do periodic upgrades of the packages and dependencies required by the platform. Many of the modern tools and frameworks have package audit tools. Identify those that are critical and high priority and schedule regular maintenance routines — at least one per Quarter.

What's next?

Keep following the good practices that allow the platform to be robust and reliable — and never pursue the five nines (99.999%) uptime! Reliability: checked! Time to work on performance.

"When a measure becomes a target, it ceases to be a good measure." Goodhart's law

A big thank you to the Engineering team at Voice123, Carel Frans, Luis Perichón, Josephine Tse and Carlos Beltrán for your contributions and the help for writing this post