SQL for detecting contiguous and non-contiguous date spans

#sql

This is an example of using SQL to work out where there are gaps between date ranges in a table.

It assumes:

that we have a source table of DbsNm.TblNm with columns for:
some categories - in this example there are three: Category_A , Category_B & Category_C but there could be one or more ;
a pair of columns giving a date span, in the form of "from" and "upto" - Src_From_Dt and Src_Upto_Dt

Here, we'll assume a simple set of "from" and "upto" dates, each of which defines a valid period of some kind of state - e.g. "registered".

Here's an example for one category and two date span records:

Cat	From	Upto
Eric	11 Feb 2007	27 Oct 2007
Eric	28 Oct 2007	31 Dec 2009

As 28 Oct 2007 is the next day after 27 Oct 2007 these two records actually represent contiguous entries for Eric

Here's another date span example:

Cat	From	Upto
Erin	01 Apr 2007	30 Sep 2007
Erin	01 Apr 2008	30 Sep 2008

There, we have a distinct gap between the two records - to the effect that Erin wasn't registered for a period of 2007-2008 and there are 2 non-contiguous registrations.

Once this kind of thing is possible, then we can expect to also see combinations of these occuring. For example:

Cat	From	Upto
Erni	01 Jan 2000	31 Dec 2003
Erni	01 Jan 2004	31 Dec 2006
Erni	01 Jab 2009	31 Dec 2020

in which there are two distinct spans covered by those records.

So what we want to pull out of the data is:

what are the sets of contiguous ranges and where are there gaps?

Note:

This idea of a pair of span dates is quite common in "registrations" data, where each registration record applies for a span of inclusive dates. As a topic this can get quite complicated. See the end Notes section below for more about this.

While I'm coyly saying "category" here, most often the category value is a customer/client specific identifier - as it is usually their registrations being handled. But any concept where there are runs of dates that neatly follow each other is amenable to this method.

The SQL I'm giving here derives identifiers for the subsequences - as GrpNum - and then uses that for an overall summary of them and their boundary points in a final SELECT and GROUP BY.

It also "builds in" some small allowances for overlooking small gaps in the date spans, such as may happen due to weekends and pubilc holidays.

Anyway, here's the SQL

WITH
D AS (
    SELECT
        MAX( Src_From_Dt ) OVER (
            PARTITION BY
                Category_A ,
                Category_B ,
                Category_C
            ORDER BY
                Src_Upto_Dt ,
                Src_From_Dt
            ROWS BETWEEN
                1 PRECEDING
                AND
                1 PRECEDING
            ) AS Prev_From_Dt ,  -- not used, here for inspections
        MAX( Src_Upto_Dt ) OVER (
            PARTITION BY
                Category_A ,
                Category_B ,
                Category_C
            ORDER BY
                Src_Upto_Dt ,
                Src_From_Dt
            ROWS BETWEEN
                1 PRECEDING
                AND
                1 PRECEDING
            ) AS Prev_Upto_Dt ,
    A.*
    FROM
        DbsNm.TblNm AS A -- AS SrcTbl
    ) ,
E AS (
    SELECT
        CASE
            WHEN
                Prev_Upto_Dt IS NULL
                OR
                ( Src_From_Dt - Prev_Upto_Dt ) IN ( 0, 1, 2, 3 ) THEN 'C' -- Continuous
            ELSE 'D'  -- Discontinuous
            END AS Continuity ,
        ( --
            ROW_NUMBER() OVER (
                PARTITION BY
                    Category_A ,
                    Category_B ,
                    Category_C
                ORDER BY                  
                    Src_From_Dt )
            -
            ROW_NUMBER() OVER (
                PARTITION BY
                    Category_A ,
                    Category_B ,
                    Category_C ,
                    Continuity
                ORDER BY
                    Src_From_Dt )
            ) AS GrpNum ,
        D.*
    FROM
        D
    ) ,
F AS (
    SELECT
        Category_A ,
        Category_B ,
        Category_C ,
        GrpNum ,
        MIN( Src_From_Dt  ) AS CntntyGrp_From_Dt ,
        MAX( Src_Upto_Dt  ) AS CntntyGrp_Upto_Dt , -- not used, here for inspections
        COUNT(*) AS Rw_Cnt
    FROM
        E
    GROUP BY
        Category_A ,
        Category_B ,
        Category_C ,
        GrpNum
    ) ,
G AS (
    SELECT
        Category_A ,
        Category_B ,
        Category_C ,
        COUNT( DISTINCT CntntyGrp_From_Dt ) AS CntntyGrp_From_Dt_Ctd  ,
        COUNT(*) AS Rw_Cnt
    FROM
        F
    GROUP BY
    Category_A ,
    Category_B ,
    Category_C
    ) 
SELECT
    CntntyGrp_From_Dt_Ctd  ,
    MIN( Rw_Cnt ) AS Min_RwsPerSince ,
    MAX( Rw_Cnt ) AS Max_RwsPerSince ,
    COUNT(*) AS CategoryCombo_Cnt ,
    MIN( Category_A ) AS Min_Category_A ,
    MAX( Category_A ) AS Max_Category_A
FROM
    G
GROUP BY
    CntntyGrp_From_Dt_Ctd
ORDER BY
    CategoryCombo_Cnt DESC ,
    Max_RwsPerSince DESC
;

Notes

Date Span Complications

I wasn't kidding about this! Among the issues that arise, and which depend on quite what data is being stored are:

whether the dates are inclusive or not (some programmers/designers/architects get it into their head that non-inclusive dates are clever) ;
NULL replacement dates - e.g. 31 Dec 9999 for the "upto" date ;
overlapping spans left in the data ;
not storing as span pairs but as starts/ends of spans in separate rows ;
cancellations of records ;
odd, partial or complex status concepts - e.g. submitted but not approved - do you treat that as "registered" or not? I've deliberately left that kind of thing out of this example.

Luckily, regardless of what one finds, it is usually possble to write a View that handles those complications and gives a simpler set such as presumed by this example.

I could probably do a post just about handling all of these issues if people are interested.

Performance

Using multiple ROW_NUMBER calls tends to give data system admins the heebie-geebies but sensible use is usually ok.

When I ran this on a table of 470 million rows, it took 5 minutes, and returned 54 patterns. Those patterns ranged from approximately 400 million with just one run of contiguous date spans, up to single instances with as many as 96 contiguous date spans.

For this kind of "do we have a problem?" analysis I'd call that a fair investment.

Most likely the 400 million probably had just one date span record each and so didn't really need to be handled anyway. It can therefore be a good idea to restrict the "source table" feed to just those known to have multiple records. As ever with these things, your mileage may vary and you don't usually know when the extra coding was worthwhile until after you have your proof/disproof.

Background

This SQL came about as I was asked to assist with some analysis and in passing I observed that it was being assumed that all the dated records were without any gaps. That led me to show that whether this was a valid assumption (or not) could be checked by a method I'd used long ago.

Having thus put myself on the spot I then tackled proving there was such a method and thereby tested the assumption in the analysis.

At first my memory struggled but an Internet search led me to the following posting that reminded me well enough of the method for me to reconstruct it in full.
Resetting Row number according to record data change
As that spared me from combing my archives, I'd like to give the link here in appreciation.

As it happens, the "old method" I'd recalled was more about finding state changes in date sequences - but I was able to adapt it to the date span problem at hand. With that done, I felt I'd better ensure I had the state change sequences idea rendered in modern CTE style and so took some time to write out and test it. With that coded I felt I could make it into a generic form - and is what I've shared at SQL to derive subsequences

Then, I figured I should do the same with the code for the date span issue - and thus was that SQL adapted and this post written.