Imagine you are on your way to the office during the morning rush hour. The highways were filled with slow-moving vehicles due to heavy traffic jams. When the traffic light turns green, thousands of vehicles try to drive at once, creating even worse traffic jams. This situation is similar to what is called the "thundering herd problem" in the world of computing.
What is the Thundering Herd Problem
Thundering Herd is a term in distributed computing and large-scale systems that refers to a situation where multiple processes or threads try to obtain the same resource simultaneously after a certain delay or event.
This issue often occurs in scenarios where a large number of processes are waiting for a limited resource, such as a database, shared cache, network service, or remote service.
When this resource finally becomes available, all waiting processes will try to access it simultaneously, causing a spike in activity that can overload or even disrupt the service.
The effects of Thundering Herd can cause performance degradation, delays, or even service failure. To overcome this problem, mechanisms such as scheduling, throttling, or using design patterns such as Circuit Breaker are needed to limit the number of requests at once to limited resources.
Case Example
Suppose you have a web application that fetches user data from a MySQL database and stores it in the Redis cache to improve access speed. When user data is not found in the cache, many requests will try to retrieve data from the database simultaneously, causing excessive load on the database and reducing overall application performance.
Look at the code below:
package main
import (
"database/sql"
"encoding/json"
"fmt"
"github.com/go-redis/redis"
_ "github.com/go-sql-driver/mysql"
"log"
"math/rand"
"time"
)
type User struct {
Id int
Username string
Email string
}
type Config struct {
redisClient *redis.Client
dbClient *sql.DB
}
func NewUser(dbClient *sql.DB, redisClient *redis.Client) *Config {
return &Config{
dbClient: dbClient,
redisClient: redisClient,
}
}
func (e *Config) getDataFromMysql(username string) (*User, error) {
row := e.dbClient.QueryRow("SELECT * FROM users WHERE username = ?", username)
user := &User{}
err := row.Scan(&user.Id, &user.Username, &user.Email)
if err != nil {
return nil, err
}
err = e.saveToRedis(username, user)
if err != nil {
return nil, err
}
return user, nil
}
func (e *Config) getDataFromRedis(username string) (*User, error) {
val, err := e.redisClient.Get(username).Result()
if err != nil {
log.Printf("failed to get redis with key [%s], err: %v", username, err)
log.Println("get data from mysql")
user, err := e.getDataFromMysql(username)
if err != nil {
return nil, err
}
return user, nil
}
user := &User{}
err = json.Unmarshal([]byte(val), &user)
if err != nil {
return nil, err
}
return user, nil
}
func (e *Config) saveToRedis(key string, data *User) error {
jsonData, err := json.Marshal(data)
if err != nil {
return err
}
ttl := 10 * time.Second
err = e.redisClient.Set(key, jsonData, ttl).Err()
if err != nil {
return err
}
return nil
}
func main() {
mysqlConn, err := sql.Open("mysql", "root:root@tcp(localhost:3306)/employees")
if err != nil {
panic(err)
}
redisConn := redis.NewClient(&redis.Options{
Addr: "127.0.0.1:6379",
Password: "",
DB: 0,
})
client := NewUser(mysqlConn, redisConn)
username := "jhon_doe"
// Simulate get cache miss as asyncronous
for i := 0; i < 100; i++ {
go func() {
time.Sleep(time.Duration(rand.Intn(100)) * time.Millisecond)
val, err := client.getDataFromRedis(username)
if err != nil {
fmt.Println(err)
} else {
fmt.Printf("Got value: %v\n", val)
}
}()
}
select {}
}
Now let's try running the code above and in the middle of the process we will try to turn off the MySQL database that we have.
In the above process we can see, when data is not found in the Redis cache, all requests will try to retrieve data from the MySQL database simultaneously. Imagine that when you try to retrieve data from MySQL there are lots of requests. This can cause a thundering herd problem and overload the database with excessive requests and also when the database recovers it immediately gets a lot of requests that continue to be sent.
Circuit Breaker Pattern
There are many solutions that can be used to deal with this thundering herd problem. One way is to use Circuit Breaker Pattern. Before we go into the solution to the problem in the code above, let's first find out what a Circuit Breaker Pattern is. In the world of software, the Circuit Breaker Pattern is a design pattern used to prevent system failure due to unresponsive or failed external components. This pattern is similar to how an electrical Circuit Breaker works at home.
In everyday life, we often use electrical appliances at home, such as irons, toasters or washing machines. If too many appliances are used at the same time, this can place an excessive load on your home's electrical system. To prevent damage or fire, an electrical safety device called a "Circuit Breaker" is installed.
Likewise, in a distributed system, components such as databases, web services, or other external systems can fail or become unresponsive for various reasons, such as excessive load, network problems, or internal errors. If your application continually tries to access unresponsive components, this can lead to excessive resource usage, a backlog of requests, and ultimately a complete system failure.
Circuit Breaker Pattern addresses this problem by monitoring the success and failure of requests to external components. If too many failures occur within a certain time period, the Circuit Breaker will "open" and temporarily prevent new requests to external components. As long as the Circuit Breaker is open, new requests will be rejected or redirected to a safe fallback mechanism, such as returning cached data or a default response.
After a certain time delay, the Circuit Breaker will try to close itself again and allow new requests to external components. If the request is successful, the Circuit Breaker will remain closed.
However, if the failure occurs again, the Circuit Breaker will open again and the cycle will repeat. By using the Circuit Breaker Pattern, your application becomes more resilient to external component failures, preventing excessive resource usage, and giving external components time to recover before accepting new requests. This improves overall system stability, reliability and robustness.
Implementation of solutions with Circuit Breaker Pattern
Now we will use the Circuit Breaker Pattern to overcome the thundering herd problem above. Circuit Breaker acts as a safeguard that limits access to a resource (in this case, a MySQL database) when a series of failures occurs. If too many failures occur within a certain time period, the Circuit Breaker will "open" and prevent new requests from the failed resource. Here's the complete code:
package main
import (
"database/sql"
"encoding/json"
"fmt"
"github.com/go-redis/redis"
_ "github.com/go-sql-driver/mysql"
"log"
"math/rand"
"time"
)
type User struct {
Id int
Username string
Email string
}
type Config struct {
redisClient *redis.Client
dbClient *sql.DB
circuit *CircuitBreaker
}
type CircuitBreaker struct {
failureThreshold int
consecutiveFailure int
open bool
openedAt time.Time
}
const circuitBreakerResetTimeout = 2 * time.Second
func NewCircuitBreaker(failureThreshold int) *CircuitBreaker {
return &CircuitBreaker{
failureThreshold: failureThreshold,
consecutiveFailure: 0,
open: false,
}
}
func NewUser(dbClient *sql.DB, redisClient *redis.Client, failureThreshold int) *Config {
return &Config{
dbClient: dbClient,
redisClient: redisClient,
circuit: NewCircuitBreaker(failureThreshold),
}
}
func (cb *CircuitBreaker) IsOpen() bool {
if cb.open {
// Check last time opened
if time.Since(cb.openedAt) >= circuitBreakerResetTimeout {
cb.open = false
cb.consecutiveFailure = 0
log.Println("Circuit Breaker closed")
} else {
return true
}
}
return false
}
func (cb *CircuitBreaker) IncrementConsecutiveFailure() {
cb.consecutiveFailure++
if cb.consecutiveFailure >= cb.failureThreshold {
cb.open = true
cb.openedAt = time.Now()
log.Println("Circuit Breaker opened")
}
}
func (e *Config) getDataFromMysql(username string) (*User, error) {
row := e.dbClient.QueryRow("SELECT * FROM users WHERE username = ?", username)
user := &User{}
err := row.Scan(&user.Id, &user.Username, &user.Email)
if err != nil {
return nil, err
}
err = e.saveToRedis(username, user)
if err != nil {
return nil, err
}
return user, nil
}
func (e *Config) getDataFromRedis(username string) (*User, error) {
if e.circuit.IsOpen() {
return nil, fmt.Errorf("circuit breaker is open")
}
val, err := e.redisClient.Get(username).Result()
if err != nil {
log.Printf("failed to get redis with key [%s], err: %v", username, err)
log.Println("get data from mysql")
user, err := e.getDataFromMysql(username)
if err != nil {
e.circuit.IncrementConsecutiveFailure()
return nil, err
}
return user, nil
}
user := &User{}
err = json.Unmarshal([]byte(val), &user)
if err != nil {
return nil, err
}
return user, nil
}
func (e *Config) saveToRedis(key string, data *User) error {
jsonData, err := json.Marshal(data)
if err != nil {
return err
}
ttl := 2 * time.Millisecond
err = e.redisClient.Set(key, jsonData, ttl).Err()
if err != nil {
return err
}
return nil
}
func main() {
mysqlConn, err := sql.Open("mysql", "root:mysqlsecret@tcp(localhost:3306)/employees")
if err != nil {
panic(err)
}
redisConn := redis.NewClient(&redis.Options{
Addr: "127.0.0.1:6379",
Password: "",
DB: 0,
})
failureThreshold := 3
client := NewUser(mysqlConn, redisConn, failureThreshold)
username := "user20001"
// Simulate get cache miss as asyncronous
for i := 0; i < 100; i++ {
go func() {
time.Sleep(time.Duration(rand.Intn(100)) * time.Millisecond)
val, err := client.getDataFromRedis(username)
if err != nil {
fmt.Println(err)
} else {
fmt.Printf("Got value: %v\n", val)
}
}()
}
select {}
}
Now we will see any changes to the code above. The first thing we do is modify the truct in the Config
struct by adding a circuit field that holds a CircuitBreaker object:
type Config struct {
redisClient *redis.Client
dbClient *sql.DB
circuit *CircuitBreaker
}
With the addition of this circuit field, the Config object now also has a CircuitBreaker object that will be used to apply the Circuit Breaker pattern. Next, there is the addition of a new CircuitBreaker struct and a NewCircuitBreaker function:
type CircuitBreaker struct {
failureThreshold int
consecutiveFailure int
open bool
openedAt time.Time
}
The CircuitBreaker struct has three fields:
- failureThreshold: The maximum number of failures allowed before the Circuit Breaker opens.
- ConsecutiveFailure: The number of consecutive failures that have occurred.
- open: Status whether the Circuit Breaker is currently open or not.
- openedAt: Records when the Circuit Breaker opened.
func NewCircuitBreaker(failureThreshold int) *CircuitBreaker {
return &CircuitBreaker{
failureThreshold: failureThreshold,
consecutiveFailure: 0,
open: false,
}
}
The NewCircuitBreaker function is a constructor to create a new CircuitBreaker object with the given failureThreshold, consecutiveFailure set to 0, and open set to false (the Circuit Breaker is in the closed state initially).
Finally, there is a modification to the NewUser function to register the circuit breaker so that it can be used in the function that initializes the Config
object:
func NewUser(dbClient *sql.DB, redisClient *redis.Client, failureThreshold int) *Config {
return &Config{
dbClient: dbClient,
redisClient: redisClient,
circuit: NewCircuitBreaker(failureThreshold),
}
}
In the code above we also add the Is Open
function:
func (cb *CircuitBreaker) IsOpen() bool {
if cb.open {
// Check last time opened
if time.Since(cb.openedAt) >= circuitBreakerResetTimeout {
cb.open = false
cb.consecutiveFailure = 0
log.Println("Circuit Breaker closed")
} else {
return true
}
}
return false
}
The IsOpen function will first check whether the Circuit Breaker is open (cb.open). If open, the function will check whether the time that has passed since the Circuit Breaker opened (time.Since(cb.openedAt)) has reached or exceeded the circuitBreakerResetTimeout.
If the elapsed time has reached the waiting time limit, the Circuit Breaker will be closed again (cb.open = false), and consecutiveFailure is set back to 0.
The function will also record a log message "Circuit Breaker closed" to notify you that the Circuit Breaker has closed again. If the elapsed time has not reached the waiting time limit, the function will return true to indicate that the Circuit Breaker is still open.
If the Circuit Breaker is not open, the function will return false.
And don't forget to add the constant circuitBreakerResetTimeout
to provide an automatic checking time for the circuit breaker to reopen in the IsOpen
function above.
const circuitBreakerResetTimeout = 2 * time.Second
And then the IncrementConsecutiveFailure
function is also a method of the CircuitBreaker struct. This function is responsible for counting the number of consecutive failures that occurred when trying to access an external resource.
func (cb *CircuitBreaker) IncrementConsecutiveFailure() {
cb.consecutiveFailure++
if cb.consecutiveFailure >= cb.failureThreshold {
cb.open = true
cb.openedAt = time.Now()
log.Println("Circuit Breaker opened")
}
}
First, this function will increase the consecutiveFailure
value (the number of consecutive failures) by adding it by 1. Then, the function will check whether the consecutiveFailure
has reached or exceeded the failureThreshold
(the specified maximum failure limit).
If the consecutiveFailure reaches or exceeds the failureThreshold, the Circuit Breaker will be changed to an open state (open = true) and record the time when the circuit breaker was opened for comparison in the IsOpen function above. When the Circuit Breaker opens, all new requests to external resources will be rejected or redirected to a secure fallback mechanism. In addition, the function will also record a log message "Circuit Breaker opened" to notify that the Circuit Breaker has opened.
After all the necessary configurations and functions have been created, now we will carry out the implementation to handle the thundering herd problem. First, in the main function, we will determine the initial value of failureThreshold
, which in this case is the limit of 3 times for failure.
func main() {
// Others code ...
failureThreshold := 3
client := NewUser(mysqlConn, redisConn, failureThreshold)
// Others code ...
}
The last thing we do in implementing the circuit breaker is the getDataFromRedis
function, let's discuss it.
func (e *Config) getDataFromRedis(username string) (*User, error) {
if e.circuit.IsOpen() {
return nil, fmt.Errorf("circuit breaker is open")
}
// others code ...
First, in the getDataFromRedis
function we add checking whether the Circuit Breaker is open or not by calling e.circuit.IsOpen(). If the Circuit Breaker is open, the function will immediately return the error "circuit breaker is open" and will not continue the process of retrieving data from Redis or MySQL. This is done to prevent new requests to external resources while the Circuit Breaker is open.
If Circuit Breaker is not open, the function will try to retrieve data from Redis using e.redisClient.Get(username).Result()
. If an error occurs while retrieving data from Redis, the function logs the error and tries to retrieve data from MySQL by calling e.getDataFromMysql(username)
.
// others code ...
val, err := e.redisClient.Get(username).Result()
if err != nil {
log.Printf("failed to get redis with key [%s], err: %v", username, err)
log.Println("get data from mysql")
user, err := e.getDataFromMysql(username)
if err != nil {
e.circuit.IncrementConsecutiveFailure() // <-- call function IncrementConsecutiveFailure
return nil, err
}
return user, nil
}
// others code ...
If an error occurs while retrieving data from MySQL, the function will call e.circuit.IncrementConsecutiveFailure() to increase the number of consecutive failures in the Circuit Breaker. If the number of consecutive failures reaches a certain limit (failureThreshold), then the Circuit Breaker will open. After that, the function will return error and nil as user values. As in the example below when we run the code above and our MySQL database dies.
After several consecutive failures to retrieve data from MySQL, the Circuit Breaker will open and there will be no more attempts to retrieve data from MySQL. At this stage all incoming requests will be rejected or we will provide an appropriate fallback for the user to obtain. And the Circuit Breaker will close again until the time limit we specified in the constant circuitBreakerResetTimeout
.
With the implementation of the Circuit Breaker Pattern, the getDataFromRedis function will prevent new requests to external resources (MySQL) while the Circuit Breaker is open. In addition, the function will also monitor failures when retrieving data from MySQL and open the Circuit Breaker if a certain failure limit is reached. This helps prevent thundering herd problems and gives external resources time to recover before accepting new requests again.
Conclusion
Thundering herd problem is a common problem in distributed systems and can cause significant performance degradation. By using design patterns like Circuit Breaker, you can overcome these problems and ensure your applications run smoothly and efficiently. Always consider the potential thundering herd problem when designing systems that depend on limited resources and implement appropriate solutions to prevent it.
Top comments (0)