Systems evolve and so does their data. Instead of making our code endlessly complex just to be able to work with old data models and new ones, we can simply massage the existing data into a new shape to better fit the newest domain models. This is the definition of data migration I’m going with today.
In this article I’m using Visualizer, my pet project, to show how to update existing data to conform to a new data models.
I’m going to make use of another Redis module, called RedisGears to perform the data transformation itself. Another cool thing that I’m going to use is Redlock, a distributed lock algorithm for Redis.
Background
I talked about Visualizer in my Redis as a Database series, so I encourage you to read the other articles as well. The gist of it is that Visualizer ingests tweets directly from Twitter, stores them in Redis and offers a GraphQL API to search and retrieve them and has multiple GraphQL subscriptions to get live updates.
You don’t really need read up on Visualizer to understand this article, but if you’re enjoying this one, then you‘d enjoying the others as well.
Current Data Model
We’re not going to look at the entire data model of the application. Instead, we’re going to look at how a single tweet is modeled and stored.
You may notice the Document attribute. That’s because I’m using Redis OM to store and retrieve data. The experience is similar to using Entity Framework.
More importantly, the class defines properties for a unique tweet ID; the actual text of the tweet; the username of the author and the tweet language.
All these properties have either the Indexed or the Searchable attributes, making the tweets easily and efficiently searchable in Redis.
This is how it looks in Redis
Target Data Model
My intention is to extend the data model with a property that contains the ‘sentiment’ of the tweet. Think machine learning based sentiment analysis.
Please note the new property called Sentiment. It is indexed as well, for performant searching and contains a hint for Redis OM (which uses Newtonsoft.JSON under the hood) how serialization should be handled.
The TweetSentiment itself is just an enum.
Migration Strategy
Schema
So now the TweetModel has the Sentiment property. We need to tell Redis about this.
The way I’m doing it in Visualizer is by droping the existing Redis OM (or rather RediSearch) index and recreating an updated version of it.
I’m making use of a IHostedService from Asp.Net, which runs at startup, but before the application accepts network requests.
The first part is done. Our data model is now aware of the Sentiment *property and so is the RediSearch module. Whenever a new tweet is added, its *Text and Lang properties can be passed to an ML model to perform a sentiment analysis and the result can be stored in the Sentiment property. Thanks to the updated index, queries that include the Sentiment should be blazindly fast 😉.
Data
At this point, new tweets are stored together with their estimated Sentiment. They’re also searchable. But what about those tens of thousands of tweets that are already in Redis? Those were stored before the new fancy sentiment analysis feature was released, and thus have no value in their Sentiment property.
We might try to programatically load every tweet that has no Sentiment value, then compute and store it. This would be slow though and probably increase the monthly cloud bill due to all the data that needs to be shoved back and forth. If only there were a way to do this more efficiently.
Here is where RedisGears comes into play. It allows us to run arbitrary Python code (no C# yet 😔) directly on Redis.
My approach is inspired from DbUp. In my application I’m going to have a folder where I place all my data migration scripts. The scripts are numbered such that the order of their execution is clear. Whenever a script is executed successfully, its name is permanently recorded. Scripts are only executed when the application starts up and only those scripts are executed whose name was not previously recorded.
I think that the data should be migrated before the schema migration is performed. So, right before starting the Asp.Net application I’m triggering the data migration like this
The extension method is there just to keep the Program.cs *clean and looks like this
Instead of doing everything in the extension method, I’m resolving *IDataMigrationService to benefit from dependency injection
Instead of pasting the entire implementation of the IDataMigrationService here, I want to ease you into it.
First have a look at the data migration project and the folder with data migration scripts
Notice the extension method class, the numbered scripts and the DataMigratorService class.
Now, the DataMigratorService needs to determine which of the available scripts have been executed already and which are new ones. To get the list of available scripts it just looks into the DataMigration directory and enumerates the files. They all have their build action set to Content. The already executed scripts can be found in a set (data structure) in Redis, under the PerformedMigrationsKey.
Then, it returns the new scripts, sorted alphanumerically in an ascending order.
Next, if any new scripts were detected, one by one they have to be executed and their name has to be stored in the Redis set from before.
The new scripts have been executed and the application can now proceed to start. There’s one small problem though. Depending on the deployment strategy, multiple instances of the application might start up simultaneously and thus try to perform the data migration concurrently. If the data migration scripts aren’t idempotent (that’s up to you), then you’ll have unexpected/corrupted data.
To account for this, I’m making use of RedLock, a distributed lock algorithm that is guaranteed to be deadlock free.
The first thing that the DataMigratorService tries to do it to acquire a named distributed lock. I named mine PerformedMigrationsLock. It will continuously retry to acquire the lock and when it succeeds, it gives the lock an expiry time. The expiry time is pushed ahead as long as the application lives and guarantees that the lock is released even when the service crashes and fails to explicitly release the lock..
Thus, MigrateData() looks like this
How do the scripts look like?
My script to populate the Sentiment field of the stored json documents looks like this
Most RedisGears python scripts start with a GearsBuilder instance. On that instance you define the actual transformations that shall be performed on each loaded key and a pattern that decides which keys to even consider.
Notice that I’m using random values to populate the Sentiment field. You might compute the values for your fields based on other fields or actually use an ML model to perform the transformation. E.g. you could make use of m2cgen to transform trained models to pure python code and load them in RedisGears to be executed in a GearsBuilder* instance. Another option is to pull out the big guns and go straight to RedisAI.
Pro tip: if you’re following a zero downtime deployment strategy, you will have old and new service instances running in parallel. The new ones might perform schema and data migrations while starting up. For this reason it is best to do a release where you only add new fields to your data model. After all old service instances have been replaced with new ones, you can perform a new deployment in which you remove unused fields from the data model, if necessary.
Conclusion
You saw how I’m doing the schema and data transformations in Visualizer. My project is open source so if you like looking at code more than reading an article here you go https://github.com/mariusmuntean/Visualizer
I hope that the concepts are generic and can be transferred to other projects without too much work. Let me know how you do data migration in Redis and how I can improve this article. The best way to get in touch with me is Twitter (my handle is @MunteanMarius) and GitHub issues.
Give the article a ❤️ and follow me for more Redis content.
Top comments (0)