Cover image Credit: @thesollers
Assumptions
- you understand rails framework
- you have duplicates in your database already
- you have existing records in your database that are duplicates and in violation of
ActiveRecord::UniqueRecord
Problem:
We found that a method in one of our controllers was using find_or_create_by
, and according to this article this method does not use locking, so what can happen if two writes are happening simultaneously, then duplicates are created in the database.
This was conflicting with our business logic, but were harmless, but nonetheless bloating our tables
# for brevity this table will be called table_1
# Indexes
#
# index_on_specific_table_1 (something_id)
#
# Foreign Keys
#
# fk_rails_... (something_id => random_record.id)
#
As you can see in this contrived example we have a index already on the column something_id
, as well as an index.
Solution
The solution as laid out by my boss was two fold.
a. create a migration that would delete all duplicate instances
b. add a UNIQUE
index field to the record
c. rescue
the exception that was thrown for ActiveRecord::RecordNotUnique
, and retry
Method
Inside of the migration the first task was to find a way to destroy all duplicates before running applying the UNIQUE
index key to the column in the table, because if we tried to remove the index , the existing duplicates would throw an error and fail the migration.
so in order to clean up duplicates in the tables , i found this nifty sQL command that took care of all the heavy lifting, from a Wiki Page of all places!
ActiveRecord::Base.connection.execute('
DELETE FROM table_1
WHERE id IN (
SELECT
id
FROM (
SELECT
id,
ROW_NUMBER() OVER w AS rnum
FROM table_1
WINDOW w AS (
PARTITION BY something_id
ORDER BY id
)
) t
WHERE t.rnum > 1);
')
This SQL statement will delete all rows in violation of not having a unique something_id
field. you can also add more columns to this by comma seperation, if you have multiple fields, it would be :
PARTITION BY something_id, something2_id, something3_id
Now the next step is to add a unique index, but we have some issues , because episode_id
already has an index on place on it. Not only that but it also has a foreign key assignment.
You cannot modify an index on a table if a foreign key is on that column , and if you currently have an index on a column, you have to remove it before adding a new UNIQUE
index.
so in order to add the unique index we must first remove the foreign key assignment from the column, which we will add back later:
remove_foreign_key :table_1, column: :something_id
then after the foreign key is removed we can now remove the existing (non-unique) index that is on episode_
But instead of just removing it, we will rename it to be extra safe.
rename_index :table_1, :index_on_specific_table_1, :temp_non_unique_index
so in the above code you can see that we are renaming :index_on_specific_table_1
to :temp_non_unique_index
.
the reason why we want to rename, instead of straight removing at this point is because the add_index
will take a long time, and it could sometimes fail depending on the size of the table.
if the table size is huge it is possible it could add_index
could fail during migration, and if we had used remove_index
before it , then we would be left with a column with no index on it at all, and this could change a request that takes only milliseconds, to possible taking minutes.
then, after we rename the table, we add an index to the table:
add_index :table_1, :something_id, unique: true, name: 'unque_index_table_1`
this will add a unique index to the specific column in the specific table and then it will give it . a specific name: unique_index_table_
After this is all said and done, we will then remove our temporary index: unique_index_table_
remove_index(:table_1, :name => 'temp_non_uniq_index')
add_index :table_1, :episode_id, unique: true
now all togther it should looks something like this:
ActiveRecord::Base.connection.execute('
DELETE FROM table_1
WHERE id IN (
SELECT
id
FROM (
SELECT
id,
ROW_NUMBER() OVER w AS rnum
FROM table_1
WINDOW w AS (
PARTITION BY something_id
ORDER BY id
)
) t
WHERE t.rnum > 1);
')
remove_foreign_key :table_i1, column: :something_id
rename_index :table_1, :index_on_episode_id, :temp_non_uniq_index
add_index :table_1, :something_id, unique: true, name: 'unique_index_on_table_1_episode_id'
remove_index(:table_1, :name => 'temp_non_uniq_index')
add_foreign_key :table_1, :episodes
add_index :table_1, :something_id, unique: true
Sources: (all my open tabs)
https://thoughtbot.com/blog/the-perils-of-uniqueness-validations
https://wiki.postgresql.org/wiki/Deleting_duplicates
https://thoughtbot.com/blog/how-to-create-postgres-indexes-concurrently-in
https://medium.com/@craigsheen/rails-migrations-can-do-more-than-you-think-bf1e40404714
Top comments (1)
Thank you! The snippet for finding duplicate rows is super helpful 🙇