Or Hillel for IO River

Posted on Nov 14, 2023 • Originally published at ioriver.io on Jun 14, 2023

Understanding the Importance of 5 Nines Availability

What is 5 Nines Availability?

In an age where mainly all services that businesses provide their customers run on computing technology, it is crucial that companies understand the importance of providing reliable access to their systems.

In determining a business's value to its clients, the level of service it provides is often a key metric. Service quality can be assessed based on various factors like ease of use, accessibility, security, reliability, and others, all of which contribute to consumer satisfaction. However, consumers often prioritize availability in many systems. Furthermore, there are many recognized standards to measure the availability of a service or system, and the most common one is to measure it as a percentage.

"Five Nines" (99.999%) - The gold standard

Typically we measure availability starting from "one nine" (90%) and move up to "nine nines" (99.9999999%). Within this range, Five Nines availability is often considered the gold standard for availability in critical systems. This level of availability equates to only about 5.26 minutes of downtime per year, which means the system is almost always operational. Besides, here are a few reasons why 99.999% is considered the gold standard:

Reduction of Downtime: By achieving five nines, organizations significantly reduce their risk of prolonged service outages that can have a major impact on operations.
Customer Expectation: In today's digital age, customers expect services to be available whenever they need them. This is especially true for online businesses, where customers may quickly switch to a competitor if they experience service unavailability.
Business Continuity: Many businesses depend on their IT services for day-to-day operations. A high availability helps ensure that these operations can continue with minimal interruption.
Competitive Advantage: Companies that can achieve and maintain a high level of availability may have a competitive advantage over companies that have more frequent and longer-lasting service outages. These days, users are very sensitive to lagging user experiences and don't hesitate to find alternatives when they are unsatisfied.
Reduction of Revenue: System availability directly impacts revenue by influencing customer satisfaction and retention. Consistent accessibility lead to increased engagement and revenue, while downtime or unavailability result in frustrated customers and loss in sales.

This article delves into the concept of 5-Nine availability while shedding light on its significance and what businesses can do to adopt this level of availability.

The Significance of Availability

We often consider availability one of the most important characteristics when designing a service or system, and neglecting availability can have catastrophic consequences for an organization's operations, reputation, and overall performance.

‍

Some of the potential consequences of not considering availability are:

Customer Satisfaction

The availability directly influences customer satisfaction. Customers expect seamless service access and minimal downtime in a highly competitive landscape. When organizations neglect availability, customers experience frustration and inconvenience and may seek alternative solutions. It leads to a loss of trust and loyalty, resulting in decreased customer retention and potential revenue loss.

Airlines rely significantly on digital technology and online platforms to provide customers with seamless booking experiences, flight information, and other critical services. Delta Air Lines experienced a severe system outage in 2017, resulting in flight cancellations and delays across their network. This interruption caused customer discontent, inconvenience, and a major loss of trust in the airline's capacity to provide dependable services.

Revenue Generation

Downtime wreaks havoc on a business, affecting revenue, transactions, and customer engagement. Whether it's an e-commerce platform, a banking system, or a software-as-a-service (SaaS) provider, downtime can result in lost sales, dissatisfied consumers, and financial losses.

As one of the world's largest online retailers, Amazon relies heavily on its website and digital infrastructure to facilitate sales and generate revenue. In 2013, Amazon experienced a brief outage that lasted approximately 30 minutes. Despite the short duration, the downtime resulted in an estimated loss of over $66,240 per minute, totaling millions of dollars in potential revenue loss for the company.

Brand Reputation and Credibility

Long downtime or frequent service interruptions can harm a brand's trust and reputation. Negative experiences can swiftly spread through social media and other platforms, worsening the impact on the brand's image. Regaining trust and a favorable reputation is a difficult and time-consuming process.

In July 2020, Twitter experienced a significant outage that lasted for several hours, rendering the platform inaccessible to millions of users worldwide. This outage disrupted users' ability to access and engage with the platform but also raised concerns about the platform's reliability and stability.

As Twitter serves as a vital communication channel for individuals, businesses, and even public figures, the outage attracted widespread attention and negative feedback on social media and in the news. The incident directly impacted Twitter's brand reputation, with users expressing frustration and disappointment over the lack of availability.

‍

Competitive Advantage

In a crowded market, availability can be a significant differentiator. Prioritizing and achieving high availability gives organizations a competitive advantage. Customers who value uninterrupted access to services prefer reliability, which becomes a selling factor. Businesses can outperform competitors and establish themselves as industry leaders by providing a superior client experience.

Netflix, a leading player in this industry, has built its reputation on providing uninterrupted access to a vast library of movies and TV shows. By investing in robust infrastructure and implementing a multi-CDN strategy, Netflix ensures the high availability of its streaming service across various devices and regions. In 2018, when a major competitor, Hulu, experienced a significant outage during a highly anticipated live event broadcast, Netflix capitalized on the situation.

Leveraging its reputation for reliability and availability, Netflix cleverly promoted its service with messages like "Still streaming, not buffering" and offered free trial subscriptions during the outage period. This strategic response showcased Netflix's ability to maintain uninterrupted service and positioned the company as a more dependable choice for streaming entertainment.

Contract Violations and Penalties

When companies enter into agreements with vendors, they commonly establish specific uptime requirements or SLAs. These legally binding agreements hold the vendors accountable for meeting the agreed-upon obligations. To ensure reliability, these contracts often include provisions for financial penalties imposed on vendors in the event of contract violations. This can take various forms, such as monetary fines, service credits, or compensatory measures, aiming to offset the revenue losses incurred by the company due to the vendor's failure to provide a dependable service.

British Airways (BA) IT system failure in 2017. The failure resulted in the cancellation and delay of numerous flights, causing significant disruptions for thousands of passengers.

As a result of this incident, BA faced legal action and potential penalties from affected passengers and regulatory authorities. The company had to compensate customers for their losses, including reimbursement for flights, accommodation, and $135,000 over tarmac delays.

What Is The Process For Calculating Availability?

By knowing how to calculate the availability of a service or system, we can understand its reliability. This calculation calculates the system's availability, thus allowing organizations to address potential issues hindering the optimal working state.

The following formula can be used to calculate the availability:

Availability = ((Total Available Time - Total Downtime) / Total Available Time) * 100

An example of using this formula to calculate the total availability of a specific system is:

Let's assume that a system experienced a total downtime of 20 hours annually. Using this formula, we can calculate the total availability of this specific system.

Number of hours in a year: 8,760 hours

Total downtime of the system: 20 hours

Availability = ((8,760 - 20) / 8,760) * 100

Availability = 99.77%

Therefore the availability of this system is 99.77%.

Contrasting 5 Nines Availability Against Other Levels of Availability

As discussed in this article, various levels of availability correspond to specific durations during which a service or system is expected to experience downtime. It is essential to identify what each of these levels entails before building any service or system since the level of availability required directly translates into the effort and measures put in place to ensure its availability. It is also crucial to understand that not all systems require the highest level of availability, and organizations must consider an appropriate level of availability during evaluation.

The table below shows the downtime expected for each availability level ranging from 90% to 99.999%.

‍

How to calculate availability

‍

The lowest level of availability mentioned within this table is "one nine" or 90% availability. However, the dynamic and competitive nature of businesses today makes running a system with an approximate downtime of over a month per year unacceptable.

Therefore by default, the lowest acceptable level of availability must be "two nine" or 99% availability. This level ensures that the service or system only encounters approximately 3.36 days of downtime annually. Even though this is a significant downtime for a very critical service or system, some businesses may be able to afford this level of downtime for one of their non-critical systems.

moving up on the availability levels, we encounter the "five nine" availability level. This level of availability ensures that the service or system only undergoes approximately 5 minutes and 26 seconds of downtime annually. It significantly lowers the downtime for the service or system and ensures it is operational throughout the year. This level of downtime is crucial when running systems or services related to susceptible operations such as payment processing or managing critical infrastructure.

Industries That Demand High Levels of Availability

Industries that rely heavily on continuous operation and minimal downtime demand high availability to ensure their crucial systems and services run smoothly. Let us look at some notable examples of industries that place a high priority on availability:

‍E-commerce and Retail

The increasing digitization of commerce means consumers expect seamless, around-the-clock shopping experiences. High availability is a business imperative in this sector. Every second of downtime not only equates to lost sales but it can also harm a company's reputation and customer trust. The stakes are even higher during high-traffic periods such as Black Friday or Cyber Monday. Outages during these periods can turn potential peak revenue periods into public relations nightmares. In addition, digital inventory management and point-of-sale systems rely on high availability to ensure accurate stock numbers and smooth transactions, preventing stock-outs or overselling, which can lead to customer dissatisfaction and logistical challenges.

‍Gaming

‍With millions of players worldwide often playing simultaneously, online gaming companies cannot afford significant downtime without risking player satisfaction and potential revenue. Many games, like Fortnite or World of Warcraft, have worldwide fanbases that expect the ability to play at any time. Even short periods of unavailability can lead to significant backlash from the player base, negatively impacting brand reputation. The importance of high availability is further underscored in the rapidly growing esports sector, where significant prize money is often at stake, and any downtime can have substantial ramifications. In massively multiplayer online games (MMOs), where players can trade virtual goods, downtime can even have real-world financial implications for players. As a result, gaming companies invest heavily in infrastructure to ensure five nines availability, using technologies such as distributed systems and failover mechanisms.

‍

SaaS (Software as a Service)

SaaS companies are not only limited to productivity tools and CRM systems. Another category that forms a critical part of many businesses operations is monitoring tools. Companies like Datadog and New Relic provide real-time monitoring and analytics for IT infrastructure and application performance, helping companies quickly identify and rectify issues before they can cause significant harm. Given these tools' crucial role in maintaining system health and preventing outages, their availability becomes paramount. If these monitoring tools face downtime, businesses could be left in the dark about the status and performance of their systems, preventing them from detecting and addressing issues promptly. This blind spot can potentially lead to longer and more harmful system outages, highlighting why it's vital for such SaaS providers to strive for five-nines availability. Furthermore, since many businesses today operate globally and round the clock, these monitoring services must be available 24/7 to support their clients. Any lapse in monitoring could result in unnoticed system issues, potentially disrupting business operations and leading to revenue and reputation losses.

‍Travel & Leisure‍

Today's travel industry relies heavily on online platforms, from flight and hotel bookings to experiential reservations. These platforms cater to global users across different time zones, making 24/7 availability crucial. Downtime can lead to immediate loss of bookings and revenue and disrupt travelers' plans, leading to a poor customer experience and potential reputational damage. The need for high availability becomes even more critical during peak travel seasons or events. For instance, an outage during a ticket launch for a major event could cause significant customer dissatisfaction and potential revenue loss.

Best Practices for Achieving 5-Nines with Multi-CDN Architecture

Implementing a Multi-CDN architectural strategy for service providers is regarded best practice for achieving 5-nines (99.999%) availability as it is virtually impossible to achieve this level of availability with only a single CDN service. The key benefit of employing a Multi-CDN strategy is its increased reliability and redundancy. The impact of outages can be reduced by dispersing traffic across numerous CDNs, resulting in a more smooth user experience.

Adopting an Active-Active policy is a critical component of a successful Multi-CDN approach. In contrast to an Active-Passive strategy in which one CDN serves all traffic, Active-Active distribution permits traffic to be distributed across two or more CDNs. This assures that anyone CDN can manage the traffic demand, proving each configuration's stability and capability.

Monitoring is crucial to achieving acceptable availability levels. It is critical to detect and respond quickly to local or global outages. Local outages might cause traffic to be routed to non-local Points of Presence (PoPs), considerably decreasing performance and usability. Organizations can immediately discover difficulties and implement failover procedures to divert traffic to an alternate CDN by using monitoring tools that sample traffic performance from the client side.

Furthermore, having a monitoring system that provides real-time alerts and insights allows for immediate action to reduce possible interruptions. Proactive monitoring aids in detecting performance bottlenecks, latency difficulties, and other anomalies that may influence availability. Organizations may optimize their Multi-CDN arrangement by exploiting these insights, providing consistent and reliable performance for end users.

Relying on a manual failover backup plan is a risky. In discussions we have had with dozens of DevOps and IT managers, we have found that manual backup plans are difficult to execute and can introduce numerous unpredictable issues. We strongly recommend avoiding that approach.

Conclusion

Organizations in different industries rely on their systems being reliable and available for their consumers to access immediately. Even though varying levels of availability are defined, we need to ensure that appropriate analysis is conducted before selecting an appropriate level for our service or system.

While networks and connectivity have become crucial aspects in providing the availability of a system or service, it is essential to understand the role that CDNs play within this area. CDNs allow users to connect seamlessly to applications through their vast array of edge locations, thus allowing failover and traffic management capabilities to maintain high levels of availability.

In conclusion, availability takes unparalleled importance in the decision-making process since it can have severe repercussions on the systems and the organizations themselves; therefore, understanding the importance of availability is crucial.