One reason I am wary about doing data migrations in schema migrations is bc sometimes they can take a large amount of time and if your schema migrations are executed inline with your deploy pipeline that can block a deploy. At DEV we use DataUpdateScripts which is pretty similar but it runs asynchronously in the background.
While hooking up our first Elasticsearch model Tags, I realized that in order for search to work I would have to manually reindex all of the tags in our database before the search code went live. This is not a huge deal for us, but it means an extra undocumented step that others who are using this codebase might miss. This framework would give us the ability to run scripts like this the same way we run migrations.
A script like this would be deployed ahead of using the new data. When the app deploys or a local environment is updated the DataUpdateWorker would look at all the files in the data_update_scripts folder. Any file that has not been run, ie is not in our database, it will create a record for and then call run on that class.
We used something like this at my prior company because we had to keep 5+ VPCs data in sync and it worked out really well.
Why not use migrations?
The reason we may want to separate this from migrations is so that it is not tied to our deploy process. In the past, I have had scripts that take hours to run bc they touch a lot of data and you don't want that holding up a deploy. This way a deploy goes out, kicks off a worker, and that worker does its thing in the background for however long it needs.
THOUGHTS?!
Added to documentation?
[x] readme
If people are on board with this approach I will add the necessary documentation to this branch as well
I recently came across this gem data-migrate which allows you to do data migrations like you do schema migrations.
One reason I am wary about doing data migrations in schema migrations is bc sometimes they can take a large amount of time and if your schema migrations are executed inline with your deploy pipeline that can block a deploy. At DEV we use DataUpdateScripts which is pretty similar but it runs asynchronously in the background.
Set Up Framework For Running Data Update Scripts #6025
What type of PR is this? (check all applicable)
Description
While hooking up our first Elasticsearch model Tags, I realized that in order for search to work I would have to manually reindex all of the tags in our database before the search code went live. This is not a huge deal for us, but it means an extra undocumented step that others who are using this codebase might miss. This framework would give us the ability to run scripts like this the same way we run migrations.
A script like this would be deployed ahead of using the new data. When the app deploys or a local environment is updated the
DataUpdateWorker
would look at all the files in thedata_update_scripts
folder. Any file that has not been run, ie is not in our database, it will create a record for and then call run on that class.We used something like this at my prior company because we had to keep 5+ VPCs data in sync and it worked out really well.
Why not use migrations? The reason we may want to separate this from migrations is so that it is not tied to our deploy process. In the past, I have had scripts that take hours to run bc they touch a lot of data and you don't want that holding up a deploy. This way a deploy goes out, kicks off a worker, and that worker does its thing in the background for however long it needs.
THOUGHTS?!
Added to documentation?
If people are on board with this approach I will add the necessary documentation to this branch as well
Awesome !!!
But how do you keep track of the order in which script ran? Do the script file names get timestamp attached to them similar to regular migrations?
Yep! That is exactly how it works, same as migrations. Here you can see a list of our scripts github.com/thepracticaldev/dev.to/...