DEV Community

Designing a rate limiter | Comprehensive

Rate limiting is a technique that limits the number of requests sent to a server or an API endpoint in a specified time frame. It helps prevent overloading the system, ensuring that it remains responsive and reliable, and improves performance. A rate limiter is a key component in any distributed system architecture. It provides the ability to control the rate at which requests are made to a system.

In today’s fast-paced digital world, many applications require a rate limiter to maintain their stability and ensure a smooth user experience. In this article, we will explore various factors that must be taken into account while designing a rate limiter.

The high-level design of a rate limiter can be divided into three main components: the configuration component, the rate-limiting component, and the monitoring component.

Configuration Component

The configuration component is responsible for storing and managing the configuration of the rate limiter. It includes parameters such as the maximum rate at which requests can be processed, the number of requests that can be processed simultaneously, and the rate at which the limit resets.

Rate Limiting Component

The rate-limiting component is responsible for enforcing the rate limit on incoming requests. It does this by tracking the number of requests made in a given time period and rejecting requests that exceed the rate limit. There are several algorithms that can be used to implement rate limiting, including token buckets, leaky buckets, and fixed windows.

IP-based rate limiting: This type of rate limiting limits the number of requests sent from an IP address. It is useful in preventing denial of service (DoS) attacks and protecting the system from excessive traffic.

A token bucket is a common algorithm used for rate limiting. It involves the use of a token bucket that refills at a fixed rate. A token bucket has a fixed capacity, and tokens are added to it at a fixed rate. Every time a request is made, a token is removed from the bucket. When the bucket is empty, the rate limiter rejects additional requests until new tokens are added to the bucket. This algorithm provides more flexibility than a fixed window as it allows for bursts of requests.

A fixed window is the simplest algorithm used for rate limiting. It involves dividing time into windows of fixed duration and allowing a fixed number of requests within each window. This algorithm is less flexible than token buckets and leaky buckets as it doesn’t allow for bursts of requests.

*A leaky bucket * is another algorithm used for rate limiting. It involves a bucket that leaks at a fixed rate. Each incoming request adds to the bucket, and if the bucket overflows, the request is rejected. This algorithm is more strict than the token bucket as it doesn’t allow for bursts of requests.

A Sliding Window algorithm counts the number of requests within a sliding time window, such as the last 1 minute. This allows for a more fine-grained rate limit policy but requires more computational overhead.

Distributed rate limiting

: In this type of rate limiting, multiple rate limiters are deployed across a distributed system. The rate limiters work together to ensure that the system’s overall capacity is not exceeded.

Monitoring Component

The monitoring component is responsible for monitoring the rate limiter and generating alerts when the rate limit is breached. It provides visibility into the rate at which requests are being processed and ensures that any issues are identified and resolved quickly.

Factors to Consider

The context in which the rate limiter is being used is important to consider, as different situations may require different rate limits. The rate limiter may be implemented inside an application or as a separate service to protect multiple microservices or APIs. If it is a separate service, all APIs that need protection must be in sync with the rate limiter.

The rate limiter is important for security reasons and to prevent availability issues, such as overloading storage or crashing the system.

Capacity and scalability: The rate limiter’s capacity must be large enough to handle the expected load. If the capacity is too small, it will result in a high error rate. As the number of users and requests increase, the rate limiter must be able to scale horizontally and handle a larger volume of requests.

Expiration: The rate limiter should have a mechanism for removing expired entries to avoid memory leaks.

Error handling and Fault Tolerance: The rate limiter should be fault-tolerant and able to handle errors gracefully, such as network failures, database errors, or other unexpected events, and recover quickly without losing any data.

Granularity and Latency: The rate limiter’s granularity determines the frequency at which it checks for rate limits. A more granular rate limiter can detect and block requests more quickly. It must respond quickly and accurately to prevent unnecessary delays.

Configuration: The rate limiter should be configurable to allow for the customization of parameters such as capacity, rate limit, and duration. It should be able to set different limits for different types of requests and users.

Security: The rate limiter should be designed to prevent abuse and protect against malicious attacks. It should be able to detect and block requests from attackers or bots.

Monitoring: The rate limiter should be monitored and logged to provide insights into its usage and performance. It should be able to track and report on the number of requests, the limit violations, and other metrics.

High-level request flow

Here’s a high-level overview of the flow of a user request through an application with a rate limiter:

  • The user sends a request to the application through an API gateway or load balancer.
  • The request is routed to a server that hosts the application.
  • The request is checked by the rate limiter, which looks up the user’s IP address or other identifiers in the cache or database to determine if they have exceeded their rate limit.
  • If the user has exceeded their rate limit, the rate limiter rejects the request with a 429 Too Many Requests response.
  • If the user has not exceeded their rate limit, the request is forwarded to the application’s business logic, which processes the request and returns a response.
  • The response is sent back to the user through the API gateway or load balancer.

Image description

API Gateway: An API Gateway acts as the entry point for all incoming API requests. It is responsible for routing requests to the appropriate backend services, including the rate limiter service.

Rate Limiter Service: This is the core component of the rate limiter system. It is responsible for enforcing the rate limit policies and ensuring that incoming requests are either allowed or denied based on the current usage rate.

Cache: A cache can be used to store information about the rate limit usage of each user or client. This can help to reduce latency and improve the overall performance of the system.

Database: A database can be used to store more persistent data about the rate limit usage of each user or client. This can help to ensure that rate limit data is not lost if the cache is cleared or fails.

Load Balancer: A load balancer can be used to distribute incoming requests across multiple instances of the rate limiter service. This can help to ensure high availability and fault tolerance.

Image description

Below is a sample algorithm implemented for a rate-limiter system in python and java.

import time

class RateLimiter:

    def __init__(self, max_tokens, refill_rate):
        self.max_tokens = max_tokens
        self.refill_rate = refill_rate
        self.tokens = max_tokens
        self.last_refill_time = time.monotonic()

    def consume(self, tokens):
        # Refill the bucket based on the elapsed time
        now = time.monotonic()
        time_since_last_refill = now - self.last_refill_time
        self.tokens = min(self.max_tokens, self.tokens + time_since_last_refill * self.refill_rate)
        self.last_refill_time = now

        # Check if there are enough tokens in the bucket
        if tokens > self.tokens:
            return False

        # Consume the tokens and update the bucket
        self.tokens -= tokens
        return True

Enter fullscreen mode Exit fullscreen mode

This rate limiter class initializes a token bucket with a maximum number of tokens and a refill rate and provides a consume() method to check if a certain number of tokens can be consumed. The algorithm tracks the time since the last refill to ensure that the bucket is properly refilled based on the refill rate.

Below is the sample java implementation.

import java.util.concurrent.atomic.AtomicLong;

public class RateLimiter {
    private final long capacity;
    private final AtomicLong tokens;
    private final long ratePerMillis;
    private long lastRefillTime;
    public RateLimiter(long capacity, long ratePerSecond) {
        this.capacity = capacity;
        this.tokens = new AtomicLong(capacity);
        this.ratePerMillis = ratePerSecond * 1000;
        this.lastRefillTime = System.currentTimeMillis();

    public boolean tryAcquire() {
        return tokens.getAndUpdate(tokens -> tokens > 0 ? tokens - 1 : tokens) > 0;
    private void refill() {
        long currentTime = System.currentTimeMillis();
        long elapsedTime = currentTime - lastRefillTime;
        long tokensToAdd = elapsedTime * ratePerMillis / 1000;
        if (tokensToAdd > 0) {
            lastRefillTime = currentTime;
            tokens.updateAndGet(tokens -> Math.min(capacity, tokens + tokensToAdd));
Enter fullscreen mode Exit fullscreen mode

In this implementation, the rate limiter is initialized with a maximum capacity and a rate per second. The tryAcquire the method is called to attempt to acquire a token, and returns true if a token is available, or false if the rate limit has been exceeded.

The token bucket algorithm is used to implement the rate limiter. The bucket starts out full and tokens are consumed as they are requested. The bucket is refilled at a constant rate based on the rate per second. The refill method is called on each call to tryAcquire, to ensure that the bucket is refilled with tokens based on the elapsed time since the last refill.


In conclusion, a rate limiter is an essential component of a distributed system that ensures stability and optimal performance under heavy traffic loads. It allows system components to communicate with each other efficiently and helps prevent overload or abuse by regulating the flow of requests. By implementing an effective rate limiter, we can ensure that the system can handle increased traffic and maintain the desired level of service. It is crucial to consider various factors such as scalability, latency, accuracy, and security while designing a rate limiter. With the right architecture, the rate limiter can provide the necessary protection to the system while maintaining a high level of performance. A well-designed rate limiter can be a critical element in building a reliable and robust system that meets the needs of the users.

If you found this article helpful, please like and share it with your peers. Feel free to follow me on LinkedIn for more system design write-ups and technical articles.

Top comments (0)