Yesterday, I wrote about the importance of planning your data purges. Today, let’s look at the simplest requirement for purging data: Keep data for X days after it’s created. There are tons of cases where this is the rule. Log data, telemetry data, even transactional data is often retained based on X days since it is created.
Given how common this scenario is, let’s talk in a bit more detail about I like to go about deleting it. Deleting data is easy, right?
DELETE t FROM dbo.Transactions AS t WHERE CreateDate <= DATEADD(dd,-90,GETDATE());
That "One Big Delete" statement is easy to write, but for most of us might cause some problems if it's deleting more than a few rows. I'm talking about deletes, but the same applies to all DML--inserts, updates, and deletes.
The alternative is to do batched deletes (or batched inserts or batched updates). Batching is more effort, and you're as lazy as I am, you'll want to justify that extra effort…so off the top of my head, here are five reasons that have forced me to do batching
Everyone loves to hate replication, but I'm a pretty big fan of it. One of the annoyances of Replication is that a big delete on the Publisher turns into RBAR deletes on the subscriber. Doing One Big Delete of, say, 1 million rows will jam up your Replication Distribution with 1 million separate
DELETE statements, jamming up your Replication distribution, slowing down everything else that needs to be replicated.
If you batch your deletes, you'll still have 1 million RBAR deletes on the subscriber, but the other stuff will have an opportunity to be interspersed with other commands, and will keep data moving to the subscribers--spreading out the delays, rather than having your One Big Delete create One Big Logjam
On an Availability Group, transactions are committed to the AG secondary either in synchronous or asynchronous mode, depending on your configuration. That commit hardens the transaction to the secondary's transaction log--not the data file. The replay of those transactions from the log into the data files is always an asynchronous process.
Long running transactions can case the secondary redo to fall behind. If your secondary redo falls behind, failover will take longer. Say that One Big Delete takes 20 minutes to run. At minute 19, there's a failover of the AG. The secondary will have more transaction log that it needs to reconcile on failover before the secondary is available. Even if it fails over just after your Big Delete commits, the secondary will be busy catching up. The result is that your failover will take longer. Longer failovers mean longer downtime during failovers. If your failover takes too long, you risk violating your uptime SLAs during a failover.
Long-running transactions cause the transaction log to grow. SQL Server can only clear the log that is older than the oldest active transaction. What's that mean? If your longest running transaction takes 60 minutes, then you can't clear the transaction log for that full 60-minute period--even if you take transaction log backups, or are in Simple recovery.
By batching your large operations into smaller chunks, you can manage the transaction log a bit better and prevent it from growing so large, allowing the existing log to be reused.
What happens if your purge process doesn't finish. Maybe it hits an error, or you have to kill it for some reason. If you're doing One Big Delete, then the whole thing rolls back. That means tomorrow or next week, you have to do the whole thing over again, plus the new data that aged out in the mean time.
If you're deleting in batches, only the current batch has to rollback; when you're ready to resume you pick up where you left off.
Long transactions are more likely to cause blocking. The quicker you can get in, make your data modification, and get out, the less likely you are to cause problems related to blocking. Depending on your isolation level, the actual impact will vary.
Even if you're purging from one "end" of the clustered index, and the active data is at the other "end," you can still see blocking from a long-running delete. Even when you're pruning the inactive part of the clustered index, there are probably some non-clustered indexes that you're modifying all throughout the index.
In between each batched operation, you let other sessions in to do their work. Ideally, all those blocked sessions can jump in and do their work between batches. Sometimes, you might need to artificially throttle your purge to increase time between deletes. The goal is to ensure that any blocking is short in duration, and resolves completely, rather than continuing to pile up over time.
Batching into chunks helps alleviate all these problems caused by long-running transactions. Determining your batch size is a balancing act. Doing One Big Delete will be the fastest, most efficient way to do the delete, except for the possible drawbacks listed above. Doing single row deletes will be the slowest, least efficient way to do the delete, except that it solves all of those drawbacks. The solution is to find the batch size that is large enough to be efficient, but small enough to minimize problems. Depending on your environment, your data, and your SLAs, that could be batches of 5,000 rows, or batches of 5,000,000 rows.
Between each batch, you may even want to pause and wait for a few milliseconds. Taking this quick breath between batches helps further alleviate problems. During that short wait, blocked transactions can proceed, replication can catch up, and it slows down the rate of data change to keep you from filling the transaction log too quickly. How long should your wait be? I can't tell you--this is part of the same balancing act as setting batch size. You'll need to balance how fast you need to go with how much you need to hold back to let other processes go. It will all depend on what you're doing in your environment.
Tomorrow, we'll look at one pattern I like to use to do batched deletes.