Today, let’s look at just one scenario, and how I like to handle it. In my experience, this is one of the most common purge requirements (and probably the easiest), so first, let’s look at the business requirements.
In this scenario, we’re going to keep the data for X days after it’s created. Then we delete it. That’s it. X could be 3 days or 3 years–it doesn’t matter, we’ll follow the same design pattern.
In today’s world, we generate loads of log data, sensor data, telemetry data, etc. All that data is super duper valuable. But only for a while. Eventually, all that granular data becomes less useful, and isn’t worth keeping around. Maybe it gets aggregated, summarized, or maybe it just gets thrown out.
You’ll have a lot of data with more complex requirements, but I think you’ll also be surprised at how much data has simple date-based retention based on it’s creation.
When I design a table for this type of business requirement, I always like to use the same pattern, unless there’s some reason where this design pattern won’t work:
1) Use an Identity column. If it’s log data, there’s not really a natural key, so using an surrogate key makes sense. Even if there does seem to be a natural key, it’s not a great choice here. The data life cycle is an important part of the data’s identity (see what I did there?), and the fact that the data is only being retained for X days is a good reason to skip the natural key, and use a surrogate key in the form of an Identity column. You should make the ID column the primary key & clustered index.
2) Include the date & time the data was inserted. We’re retaining data based on X days after it’s created….so…we kinda need to know when it was created. Seems obvious, but I’ve seen a lot of tables where the cleanup retention wasn’t considered when the table is created, and only when the table is huge do the retention requirements get considered. You should create a non-clustered index on the creation date.
Both of these columns will be ever-increasing. That means that while data gets inserted to the “end” of the table, we can delete from the “beginning” of the table–and we’ll have two indexes to help us identify what to delete & what to keep. Why don’t I just use the datetime column, and skip the identity column? Date/time makes for a terrible primary key. Sometimes two rows get inserted at exactly the same time, and then your uniqueness goes to hell. ID columns solve that.
If you’re looking for a real world example of someone smarter than me using the design pattern, check out
dbo.CommandLog from Ola Hallengren’s SQL Server Maintenance Solution.
CREATE TABLE [dbo].[CommandLog]( [ID] [int] IDENTITY(1,1) NOT NULL, [DatabaseName] [sysname] NULL, [SchemaName] [sysname] NULL, [ObjectName] [sysname] NULL, [ObjectType] [char](2) NULL, [IndexName] [sysname] NULL, [IndexType] [tinyint] NULL, [StatisticsName] [sysname] NULL, [PartitionNumber] [int] NULL, [ExtendedInfo] [xml] NULL, [Command] [nvarchar](max) NOT NULL, [CommandType] [nvarchar](60) NOT NULL, [StartTime] [datetime] NOT NULL, [EndTime] [datetime] NULL, [ErrorNumber] [int] NULL, [ErrorMessage] [nvarchar](max) NULL, CONSTRAINT [PK_CommandLog] PRIMARY KEY CLUSTERED ([ID] ASC) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) );
You’ll notice the
ID column for a surrogate key, and the
StartTime column for the data creation time. Ola doesn’t include a non-clustered index on
StartTime, but I add one to assist in my cleanup. Ola’s code ensures StartTime gets populated reliably, but depending on your situation, you might want to have a default constraint or trigger to ensure it’s populated reliably.
Let’s look at the algorithm we want to use to do the cleanup. For the sake of demonstration, let’s use Ola’s
CommandLog table. And let’s assume we need to retain data in the log for 60 days. Let’s also assume we’ve done some testing and determined that we want to delete in batches of 5000 rows. And like I discussed yesterday, we want to take a break between batches to prevent blowing up the transaction log, so we’ll wait for half a second between each delete.
Step 1) Find the range of ID values we want to delete. We’ll use that date/time column to find the range of IDs.
--StartID is where we start, and MaxID is the maximum ID that we want to delete SELECT @StartID = MIN(ID), @MaxID = MAX(ID) FROM dbo.CommandLog WHERE StartTime < DATEADD(DAY,-60,GETDATE());
Step 2) First, let’s set up our loop to do the batches. This is the hard part, so I like to get the logic sorted out before I figure out what the actual work/delete statement is.
--We'll start at the start, and increment up until we hit that MaxID WHILE (@StartID < @MaxID) BEGIN --increment @StartID by our batch size SELECT @StartID = @StartID + 5000; --Delete stuff --Wait for half a second before looping again WAITFOR DELAY '00:00:00.5' END;
Step 3) OK, now we need to write a
DELETE statement. That
--Delete stuff comment looks great, but it’s not very functional.
DELETE TOP (5000) x FROM dbo.CommandLog AS x WHERE x.ID < @StartID AND x.ID < @MaxID;
And now we just have to stitch that all together…
--StartID is where we start, and MaxID is the maximum ID that we want to delete SELECT @StartID = MIN(ID), @MaxID = MAX(ID) FROM dbo.CommandLog WHERE StartTime < DATEADD(DAY,-60,GETDATE()); --We'll start at the start, and increment up until we hit that MaxID WHILE (@StartID < @MaxID) BEGIN --increment @StartID by our batch size SELECT @StartID = @StartID + 5000; --Delete stuff DELETE TOP (5000) x FROM dbo.CommandLog AS x WHERE x.ID < @StartID AND x.ID < @MaxID; --Wait for half a second before looping again WAITFOR DELAY '00:00:00.5' END;
Now you just have to schedule it. Set up a SQL Agent job to run once a week and delete that data.
I found myself writing code that copies this pattern to purge data over and over and over again. Eventually, I automated it. In my DBA database on GitHub, I’ve created a stored procedure that will handle deletes for this exact scenario.
Instead of writing code specific to the table, I just need to execute my
EXEC dbo.Cleanup_TableByID @DbName = 'DBA', @SchemaName = 'dbo', @TableName 'CommandLog', @DateColumnName 'StartTime', @IDColumnName 'ID', @RetainDays int = 60, @ChunkSize int = 5000, @LoopWaitTime time = '00:00:00.5', @Debug bit = 0;