The working process of production data deletion is completely different from development tasks.
We should take it as another kind of task, and think with another way. The coding examples are based on RoR in this article.
It is difficult to estimate without investigation to work through data deletion.
I can say, do not estimate via the usual development style estimation process.
As a rough estimation, it could take 2 weeks at least to complete the whole process.
Because we should answer the questions listed below in advance.
Furthermore, performance could be a problem after you finish up writing code.
I do recommend doing a sensitive investigation, discuss it with all involved people including QA developers before start working on it.
It would be the fastest way.
- How much data do we have to delete?
- How can we test the deletion?
- What is the data to delete?
- How many related resources belong to target data?
- How can we estimate the deletion time?
- Can we have an alternative approach if it's difficult to delete the whole data?
I show the example via Ruby on Rails(RoR) code but it would be able to translate to other languages easily.
Let's say, you are working with an E-commerce app and we have orders, and order_items belong to each order.
If the target data amount is huge, using iteration could be a cause of a terrible performance problem.
Order.where(user_id: 10).all.each do |order| order.order_items.each do |order_item| ... end end
In this example, if the order amount is millions, this leads significant N+1 problem.
We should consider the way not to use ORM magics.
In RoR, there are basically 2 methods to delete records, delete and destroy.
destroy methods calls callbacks defined on a model before deleting the object. On the other hand,
delete doesn't see callbacks set up on the model, simply calling a delete SQL.
Given that, delete_all is way faster than destroy_all, and it makes a significant performance difference if the data is big.
And, it means we should take care of the operation we defined in the
after_destroy callbacks manually.
If it comes to delete even thousands of records, we should handle it carefully to use database resources sufficiently. Probably simply calling
delete_all is not enough.
In that case, use something like built-in method find_in_batches or delete_in_batches gem.
The batch size should be around some thousands to prevent DB stuck. In find_in_batches, default is 1,000. In delete_in_batches, default is 10,000.
Order.find_in_batches(batch_size: 5000).each do |orders| orders.delete_all sleep(1) # throttle down DB end
Like the above example, looks good to throttle down by using sleep in batches when we delete in batches. The RoR official website recommends it.
Possibly you couldn't find this problem until using the full prod pruned data. That is why using prod pruned data is important.
Anyhow, in many cases, the cause of the slow query is full table scan due to no index as you know.
DELETE users where email = 'something@localhost';
SQL tries to look for target records by email but there is no index on email so that it does a full table scan. It is simple and you are probably familiar with it.
Table "addresses" Column | Type | Modifiers -----------+------+------------------------------ id | int | not null default nextval(... street_no | int | not null . . . Indexes: "idx_primary" PRIMARY KEY, btree (id) Referenced by: TABLE "users" CONSTRAINT "fk_users_addresses" FOREIGN KEY (address_id) REFERENCES addresses(id) NOT VALID Table "users" Column | Type | Modifiers -------------+--------+-------------------------------- id | bigint | not null default nextval(... address_id | int | . . . Indexes: "idx_primary" PRIMARY KEY, btree (id) Foreign-key constraints: "fk_users_addresses" FOREIGN KEY (address_id) REFERENCES addresses(id) NOT VALID
SQL tries to delete addresses but addresses referenced by users table, and it has a foreign key constraint on address_id. But because address_id does not have index, SQL does a full table scan for users when deleting addresses
If the target data has a lot of associations, we should consider deleting some of them tentatively.
For example, an order has some order_items, purchaced_users, addresses, items, etc...
But in this case, we will create orphan data which doesn't have the parent data(i, e. relations of orders and order_items, deleting only orders)
This would be tech debt, so when we take this solution, we should have common sense with the people involved.
To delete only required data and delete orphan data in the future, consider creating "to_be_deleted" tables which have ids of target data to be deleted. Those tables look like below.
-- orders_to_be_deleted Table "orders_to_be_deleted" Column | Type | Modifiers ----------+---------+----------- order_id | integer | not null Indexes: "index_orders_to_be_deleted_on_order_id" btree (order_id)
By using these id tables, we can refer to deleted ids after deleting orders themselves for example.
Thanks for this tip Raphael!
No matter how careful we prepare, unexpected situations always come in front of us. Make sure you can have the backup with your team.
Those are what we should consider when deleting data.
I'm glad if someone won't do the same mistakes as mine when working on data deletion projects.