DEV Community

miyuki_samitani
miyuki_samitani

Posted on • Updated on

What is Design for Failure?

Pre-study image

Is it a design for failure kind of story?

Study

What is Design for Failure?

Design for Failure refers to the concept of designing a system based on the assumption that failures will occur.
Servers can fail, AZ/region things can fail.
The idea is to improve the availability of the service by taking countermeasures in case of failures.

How to realize Design for Failure

  • Elimination of SPOFs

To improve availability, it is important not to create single-point-of-failure (SPOF).
By configuring the system in an HA configuration or, in the case of AWS, in a multi-AZ, multi-region configuration, it is possible to avoid SPOFs at a single point of failure.
This is because services can continue to operate even if an entire AZ or region fails.

  • Monitoring

Service monitoring for early detection of service failures, resource monitoring to detect performance degradation, etc.
It is necessary to set up constant monitoring of logs and metrics in the service for early detection and early recovery.

  • Recovery Methods

When a failure occurs and the client is affected, how to recover?
It is possible to recover early by deciding in advance how to move within the organization and what recovery methods to use when a failure occurs and clients are affected.

Image after study

SPOF is indeed ・・・・.
Basically, it means you have to think of the server as something that dies.

Top comments (0)