DEV Community

Andy Mallon for AM2.co

Posted on • Originally published at am2.co on

Code to purge data based on creation date

Only one thing can spark more joy than cleaning up old data. Doggos.

Yesterday, I talked about batching deletes, and the day before I talked about how much I like to delete data, and why it’s necessary to do it, despite being a data packrat.

Today, let’s look at just one scenario, and how I like to handle it. In my experience, this is one of the most common purge requirements (and probably the easiest), so first, let’s look at the business requirements.

Purge requirements

In this scenario, we’re going to keep the data for X days after it’s created. Then we delete it. That’s it. X could be 3 days or 3 years–it doesn’t matter, we’ll follow the same design pattern.

In today’s world, we generate loads of log data, sensor data, telemetry data, etc. All that data is super duper valuable. But only for a while. Eventually, all that granular data becomes less useful, and isn’t worth keeping around. Maybe it gets aggregated, summarized, or maybe it just gets thrown out.

You’ll have a lot of data with more complex requirements, but I think you’ll also be surprised at how much data has simple date-based retention based on it’s creation.

Table design

When I design a table for this type of business requirement, I always like to use the same pattern, unless there’s some reason where this design pattern won’t work:

1) Use an Identity column. If it’s log data, there’s not really a natural key, so using an surrogate key makes sense. Even if there does seem to be a natural key, it’s not a great choice here. The data life cycle is an important part of the data’s identity (see what I did there?), and the fact that the data is only being retained for X days is a good reason to skip the natural key, and use a surrogate key in the form of an Identity column. You should make the ID column the primary key & clustered index.

2) Include the date & time the data was inserted. We’re retaining data based on X days after it’s created….so…we kinda need to know when it was created. Seems obvious, but I’ve seen a lot of tables where the cleanup retention wasn’t considered when the table is created, and only when the table is huge do the retention requirements get considered. You should create a non-clustered index on the creation date.

Both of these columns will be ever-increasing. That means that while data gets inserted to the “end” of the table, we can delete from the “beginning” of the table–and we’ll have two indexes to help us identify what to delete & what to keep. Why don’t I just use the datetime column, and skip the identity column? Date/time makes for a terrible primary key. Sometimes two rows get inserted at exactly the same time, and then your uniqueness goes to hell. ID columns solve that.

If you’re looking for a real world example of someone smarter than me using the design pattern, check out dbo.CommandLog from Ola Hallengren’s SQL Server Maintenance Solution.

CREATE TABLE [dbo].[CommandLog](
    [ID] [int] IDENTITY(1,1) NOT NULL,
    [DatabaseName] [sysname] NULL,
    [SchemaName] [sysname] NULL,
    [ObjectName] [sysname] NULL,
    [ObjectType] [char](2) NULL,
    [IndexName] [sysname] NULL,
    [IndexType] [tinyint] NULL,
    [StatisticsName] [sysname] NULL,
    [PartitionNumber] [int] NULL,
    [ExtendedInfo] [xml] NULL,
    [Command] [nvarchar](max) NOT NULL,
    [CommandType] [nvarchar](60) NOT NULL,
    [StartTime] [datetime] NOT NULL,
    [EndTime] [datetime] NULL,
    [ErrorNumber] [int] NULL,
    [ErrorMessage] [nvarchar](max) NULL,
    CONSTRAINT [PK_CommandLog] PRIMARY KEY CLUSTERED ([ID] ASC)
        WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
);

You’ll notice the ID column for a surrogate key, and the StartTime column for the data creation time. Ola doesn’t include a non-clustered index on StartTime, but I add one to assist in my cleanup. Ola’s code ensures StartTime gets populated reliably, but depending on your situation, you might want to have a default constraint or trigger to ensure it’s populated reliably.

Time to clean up

Let’s look at the algorithm we want to use to do the cleanup. For the sake of demonstration, let’s use Ola’s CommandLog table. And let’s assume we need to retain data in the log for 60 days. Let’s also assume we’ve done some testing and determined that we want to delete in batches of 5000 rows. And like I discussed yesterday, we want to take a break between batches to prevent blowing up the transaction log, so we’ll wait for half a second between each delete.

Step 1) Find the range of ID values we want to delete. We’ll use that date/time column to find the range of IDs.

--StartID is where we start, and MaxID is the maximum ID that we want to delete
SELECT @StartID = MIN(ID), @MaxID = MAX(ID) 
FROM dbo.CommandLog
WHERE StartTime < DATEADD(DAY,-60,GETDATE());

Step 2) First, let’s set up our loop to do the batches. This is the hard part, so I like to get the logic sorted out before I figure out what the actual work/delete statement is.

--We'll start at the start, and increment up until we hit that MaxID
WHILE (@StartID < @MaxID)
    BEGIN 
        --increment @StartID by our batch size
        SELECT @StartID = @StartID + 5000;
        --Delete stuff
        --Wait for half a second before looping again
        WAITFOR DELAY '00:00:00.5'
    END;

Step 3) OK, now we need to write a DELETE statement. That --Delete stuff comment looks great, but it’s not very functional.

DELETE TOP (5000) x 
FROM dbo.CommandLog AS x 
WHERE x.ID < @StartID AND x.ID < @MaxID;

And now we just have to stitch that all together…

--StartID is where we start, and MaxID is the maximum ID that we want to delete
SELECT @StartID = MIN(ID), @MaxID = MAX(ID) 
FROM dbo.CommandLog
WHERE StartTime < DATEADD(DAY,-60,GETDATE());

--We'll start at the start, and increment up until we hit that MaxID
WHILE (@StartID < @MaxID)
    BEGIN 
        --increment @StartID by our batch size
        SELECT @StartID = @StartID + 5000;
        --Delete stuff
        DELETE TOP (5000) x 
        FROM dbo.CommandLog AS x 
        WHERE x.ID < @StartID AND x.ID < @MaxID;
        --Wait for half a second before looping again
        WAITFOR DELAY '00:00:00.5'
    END;

Now you just have to schedule it. Set up a SQL Agent job to run once a week and delete that data.

I have an easy button

I found myself writing code that copies this pattern to purge data over and over and over again. Eventually, I automated it. In my DBA database on GitHub, I’ve created a stored procedure that will handle deletes for this exact scenario.

Instead of writing code specific to the table, I just need to execute my Cleanup_TableByID procedure:

EXEC dbo.Cleanup_TableByID
    @DbName = 'DBA',
    @SchemaName = 'dbo',
    @TableName 'CommandLog',
    @DateColumnName 'StartTime',
    @IDColumnName 'ID',
    @RetainDays int = 60,
    @ChunkSize int = 5000,
    @LoopWaitTime time = '00:00:00.5',
    @Debug bit = 0;

Take a look at my Cleanup_TableByID procedure, and install my DBA database. Installation is easy (directions are in the readme), and it’s free (it’s published under the GPL).


Additional Reading

The post Code to purge data based on creation date appeared first on Andy M Mallon - AM².

Top comments (0)