DEV Community

Comiscience
Comiscience

Posted on

Use of Feature Flags and Observability Tools for Gradual and Safety Database Migration

Scenario Description

Many companies encounter situations requiring database upgrades or migrations, particularly when transitioning from self-hosted database services to cloud-based services, from on-premises data centers to cloud data centers, and from old databases to new ones. Throughout the migration process, it's vital to ensure stability, prevent data loss, and avoid service downtime. One of the most common migration methods is the "Dual Write Database Migration."

Solution

The process of dual write migration is as follows, illustrated in the diagram below:

  1. Initial Stage - The application interacts (reads/writes) with only the old database.
  2. On the existing code that reads/writes to the old database, we need to add code to read/write to the new database. For example, when inserting a record into a table, the data must be inserted into both the old and new databases simultaneously. Typically, these two insert operations are executed in parallel to maintain the original service call handling time as much as possible.
  3. When a database write request comes in, it is written to the old database and a small percentage of the traffic is also written to the new database.
  4. The percentage of traffic written to the new database is gradually increased until it reaches 100%. If any issues arise during this process, we can promptly rollback without impacting the production environment.
  5. After the write migration is complete, we begin to gradually increase the amount of data read from the new database, initially allowing 10% of the traffic to perform read operations on the new database. We measure performance and compare results during this process. If any issues are encountered, we can quickly rollback the new database read traffic without impacting the production environment.
  6. Once the new database has been handling 100% of the read/write operations without any issues for a certain period, we can decommission the old database and its related code services.

Image description

In the real operation process, it's not only the traffic of the old and new databases that needs to be gradually opened up. The read and write codes for the new database also need to be updated in the production environment services in a step-by-step manner to ensure a stable, iterative migration.

Practical Methods and Tools

Throughout the process, besides the design of the system architecture itself, two specific tools play crucial roles:

  1. Feature Flags service, responsible for flexible, real-time, stable traffic ramp-up and rollbacks. In this article, we used FeatBit as a feature flag service.
  2. Observability service, providing comprehensive monitoring of service anomalies and timely alarms throughout the process. In this article, we used GuanceCloud as an observability tool.

Using FeatBit to implement real-time database migration request traffic control

The pseudo-code shown below demonstrates how to split the database read operations for a particular service:

  • In line 6 of the code, by calling the _fbService.BoolVariation("read-sport-olddb") method, we obtain the traffic control return value. If it's true, we add the Query function of reading from the old database to the parallel task execution queue.
  • In line 9 of the code, by calling the _fbService.BoolVariation("read-sport-newdb") method, we obtain the traffic control return value. If it's true, we add the Query function of reading from the new database to the parallel task execution queue.
  • In line 19 of the code, we use the FeatBit Feature Flags SDK to simultaneously run two database read operations, compare and verify the results, return the correct value according to the execution situation, and send relevant exception data to observability tool.
public async Task<List<Sport>> GetSportsByCityAsync(int cityId, int pageIndex, int pageSize)
{
    var tasks = new List<Task<List<Sport>>>();

    // When the feature flag for reading the old database for Sport-related services 
    // returns true, we add a read task to the execution task queue.
    if (_fbService.BoolVariation("read-sport-olddb"))
    {
        tasks.Add(GetSportsByCityQueryAsync(_oldDbContext, cityId, pageIndex, pageSize));
    }

    // When the feature flag for reading the new database for Sport-related services 
    // returns true, we also add a read task to the execution task queue.
    if (_fbService.BoolVariation("read-sport-newdb"))
    {
        tasks.Add(GetSportsByCityQueryAsync(_newDbContext, cityId, pageIndex, pageSize));
    }

    // Two read operations are executed simultaneously (to avoid increasing request 
    // time due to new data read) and the results are compared and returned.
    // If the results are inconsistent, we return the old database read result
    // and record the discrepancy.
    return await _fbService.RunAndCompareDbTasksAsync(
                    tasks,
                    timeoutDelayForNewDB: 3000, // Set the maximum wait time for the new database to avoid a bad user experience
                    (timeoutInfo) => { }, // When the new database call times out, send a message to observability tool
                    (unMatchInfo) => { }, // When the returned results are inconsistent, send a message to observability tool
                    (exception) => { } // When an exception occurs, send a message to observability tool
                );
}
Enter fullscreen mode Exit fullscreen mode

After integrating the similar code above into your project, you can use a Feature Flags's UI tool to scale database migration's dual-write and dual-read traffic. For example, we can initially adjust the traffic scaling of the feature flag read-sport-from-newdb to 5%. If no anomalies are observed in the observability tool over a certain period, we can increase the traffic scaling percentage to 10% (as shown in the figure below).

Image description

Use an Observability tool to Monitor the Entire Migration Process and Timely Discover Potential Problems

During the entire data migration process, automated and timely error detection followed by rollback is extremely important. It can most effectively help us avoid numerous issues, such as:

  • If the operation of the new database brings significant system resource consumption, we need to know immediately and rollback through the Feature Flags system.
  • When the number of write operations or read operations exceeding the estimated threshold due to timeout, we can quickly locate the problem, rollback and repair swiftly. This improves migration speed.
  • When a write operation or read operation results in information errors (such as inconsistency in results, excessive request time, program exceptions, etc.), we can locate the specific error information based on the observability system, thereby accelerating the debugging speed.
  • And so on.

We can use an observability system that integrates Application Performance Monitoring (APM), Real User Monitoring (RUM), and Metric capabilities to monitor abnormal data and system behavior. Examples include DataDog, Guance.one (as used in this article), and so on.

Quickly Locate Migration Errors through "Traces" and "Error Tracking"

On the APM/Traces page of the observability tool, we find that some red items (i.e., Errors) have occurred during the migration process. Through the Resource column, we can easily see that errors occurred in our read operations on the new database, as shown below:

Image description

By clicking on the corresponding Error, we can quickly view its associated call chain flame graph. As the flame graph interpretation shows:

  1. As indicated by the Span at the location marked circle 1 in the diagram below, a Timeout error occurred here during database migration, i.e., the reading time from the new database exceeded our acceptable request response time threshold.
  2. The position marked circle 2 points out that the error occurred when the Feature flag read-sport-newdb was true. This means we can quickly locate the Feature Flags that we may need to rollback or turn off to avoid migration risks.
  3. According to Span at the location marked circle 3, we can quickly locate the server-side API service where the timeout occurred. The captured parameters and headers of the API can assist us in better debugging and resolving the issue later.

Image description

Rollback Read Operations in Real Time Using Feature Flags to Avoid Timeout State

Based on the Traces and Error tracking information above, we've quickly located the abnormal database read operation. Now all we need to do is return to the FeatBit UI, find the feature flag read-sport-newdb we discovered above, and rollback its percentage to last state. As shown in the diagram below, we decrease the percentage of traffic allocation of true from 10% back to 5% where no read anomalies were previously observed.

Image description

After the rollback, as shown in the code below, the return value of _fbService.BoolVariation("read-sport-newdb") will only have a 5% chance of being true.


// When the feature flag for reading from the new database related to the 
// Sport service returns true, add the read task to the execution queue.
if (_fbService.BoolVariation("read-sport-newdb"))
{
    tasks.Add(GetSportsByCityQueryAsync(_newDbContext, cityId, pageIndex, pageSize));
}

Enter fullscreen mode Exit fullscreen mode

Conclusion and Next Steps

This article introduces a basic method to reduce database migration risk by using an observability tool and a Feature Flag tool to implement a dual-write, dual-read operation mode. In real world operation, we may have a large number of businesses to deal with, and human intervention can lead to slow responses for various reasons. In subsequent articles, we will introduce more content, such as:

  • Using the Metrics service from an "Observability tool" and the Trigger service from FeatBit to implement automatic real-time rollback, disaster avoidance, and alarm solutions during migration.
  • Using the Metrics service from "Observability tool" and the Scheduler service from FeatBit to implement automated scaling and rollback solutions.
  • And more.

Original article: https://www.featbit.co/blogs/feature-flags-observation-tool-gradual-db-migration

Top comments (0)