DEV Community

Robertino
Robertino

Posted on

๐Ÿ” Improving Our Platform Resiliency and Upcoming Holiday Plans

๐Ÿ‘ค We are committed to delivering a Tier 0 service for our customers.


Over the last few months, there has been an increase in service degradation and outages that has shaken our customerโ€™s confidence in Auth0. Our number one mission is to provide you with the highest level of service and reliability at all times, for the full spectrum of customers from free users to enterprise. The production environments in US-1 and EU have been impacted the most frequently and, in extreme cases, has caused downtime for some of our customers. As the Auth0 CPO, I sincerely apologize for the pain this caused you and your customers. We take ownership for the failures and recognize how disappointed you are.

We know that identity plays a critical role in your company, and it is our responsibility to make our service reliable. It is also our responsibility to provide guidance to our customers on appropriate architecture patterns. When we experience failures or degradation of our service, we correct them as swiftly as possible. Today, I wanted to share the active measures we are taking to protect against similar issues from occurring in the future.

First, What Caused the Outages?

Before we explore all of the actions we have taken, itโ€™s good to understand the patterns we have observed in these environments:

Noisy neighbors in our large multi-tenant environments create throughput bottlenecks

The noisy neighbor phenomenon is a well-known issue in multi-tenant environments. In our case, one co-tenant may experience a spike in traffic which can lead them to monopolize resources available and thus reduce the throughput (requests-per-second) on the environment for the other co-tenants.

Most of our customers in the US-1 and EU environments have enjoyed many years of uninterrupted service. These environments have grown significantly and are now the longest-running and largest, which exacerbates the noisy neighbor impact.

We have not strictly enforced restrictions to the frequency of customer API calls (rate limits) and at times allowed for a 10X increase in requests-per-second (RPS) for certain tenants in a single environment. Typically, our architecture would allow us to absorb these spikes as it is the case in most of our environments, but given the volume and size of the customers in US-1 and EU, these concurrent requests restrict throughput for other tenants. We do not fault our customers for their peak traffic needs and it is on us to put in place the right protections and guardrails around each tenant. Unfortunately, we are not able to actively load shed today (i.e. automatically move tenants to other environments such as US-3, because we donโ€™t currently have the migration tooling nor can we do it without any downtime). There are some customer scenarios where a planned migration by a customer is possible, but it requires time and effort on your part. I will cover migration scenarios and timing in the โ€˜Looking beyond the next 60 daysโ€™ section below.

Region Mean Time To Recovery (MTTR) outside of our 15-min target

The other item we are solving is meeting our 15 minutes target for region failover. This is caused by dependencies on an underlying managed service that doesnโ€™t support automated failover within our target timeframe. Our teams are in the final stages of completing the switch, a managed service that allows us to automatically failover within 15-minutes, in Q1 2022.

We can successfully failover Availability Zones (AZ) today within our 1-minute target window as validated by quarterly testing.

What Actions Have We Taken Already?

Slowing down and invoking our change freeze earlier

Effective Sept. 30, 2021, weโ€™ve significantly reduced the number and frequency of changes that weโ€™re introducing into production for the remainder of the year. This is to introduce increased testing and soak time in pre-production environments. Heading into the holiday season, our protocols may change slightly, however, we anticipate continuing a conservative posture to maintain the stability of our production environments. We are mostly allowing changes that are related to our resiliency initiatives or security patches. Being an authentication platform, security-related changes are non-negotiable, as is our mission to protect you and your customers.

New staging environment

As of early October, we have a staging environment that has the same load as US-1, so that we can simulate test cases reproducing production traffic at large scale. The volume and high load-related issues described above can now be more easily detected.

Read more...

Discussion (0)