DEV Community


How to avoid cascading failures in distributed systems

mohanarpit profile image Arpit Mohan Originally published at ・3 min read

TL;DR notes from articles I read today.

How to avoid cascading failures in distributed systems

  • Cascading failures in distributed systems typically involve a feedback loop where an event causes a reduction in capacity, an increase in latency, or a spike in errors which then becomes a vicious cycle due to the responses of other parts of the system. You need to design your system thoughtfully to avoid them.
  • Set a limit on incoming requests for each instance of your service, along with load shedding at the load balancer, so that the client receives a fast failure and retry, or an error message early on. 
  • Moderate client requests to limit dangerous retry behaviours: impose an exponentially increasing backoff between retries and add a little jitter, making the number of retries and wait times application-specific. User-facing applications should degrade or fail fast, batch or asynchronous processing can take longer. Also, use a circuit breaker design to track failures and successes so that a sequence of failed calls to an external service trips the breaker.
  • Ensure bad input does not become a query of death, crashing the service: write your program to quit only if the internal state seems incorrect. Use fuzz testing to help detect programs that crash from malformed input.
  • Avoid making failover plans based on proximity where a failure of a data center or zone pushes the load into the next closest resource, which will then likely cause a domino effect since this second one is likely to be as busy. Balance the load geographically instead, pushing the load to data centers with the most available capability.
  • Reduce, limit or delay work that your server does in response to a failure, such as data replication, with a token bucket algorithm and wait a while to see if the system can recover.
  • Reduce startup times from reading or caching a lot of data, to begin with; it makes autoscaling difficult and you may not detect the problem by the time you start up, and recovery will equally take longer if you need to restart.

Full post here, 14 mins read

The API security maturity model

  • The API Security Maturity Model is a corollary to the Richardson Maturity Model associated with RESTful API design, describing four levels of REST compliance. It describes cumulative levels of security, complexity, and efficiency.
  • Level 0 uses API keys and basic authentication, which is fundamentally insecure as it assumes whoever has the key is the rightful owner of it. There is basically no separate authorization process.
  • Level 1 uses token-based authentication but still conflates authentication and authorization, or produces quasi-authentication where the token acts as an ID card but is vulnerable to malicious intent as you assume the possession of the token is itself guarantee against mal-intent.
  • Level 2 uses token-based authorization, where authentication tokens allow entry but access and privileges are regulated by a system such as OAuth, with permissions designed to match a token’s lifespan and purpose or be set so that tokens age out of use; however, these systems are designed to be authoritative so you need to ask whether you can trust the system the token comes from, and also consider the reliability of data in transit, as tokens can collect more data and alter it as they pass through the system, so you need to monitor who adds data and what sort.
  • Level 3 uses claims for a centralized trust system, which gathers context and verifies information about the subject rather than simply trusting the caller, API gateway or token issuer; to achieve this, you need an asserting party you trust to verify the context and subject attributes for each claim with signed tokens (using private and public keys).

Full post here, 10 mins read

Get these notes directly in your inbox every weekday by signing up for my newsletter, in.snippets().


Editor guide