DEV Community

geraldew
geraldew

Posted on • Updated on

SQL to derive subsequences

This is an example of using SQL to derive subsequences according to when some value in a table changes or stays the same.

It assumes:

  • that we have a source table of DbsNm.TblNm ;
  • that there is some value ( ImportantSequentialValue ) that always indicates which records come before others ;
  • that there is one or more category columns - given here as two columns: Category_A & Category_B ;
  • that there is a column or an expression which is the thing that variously changes or stays the same: ExprValuThatMayChange .

The rough idea is to use multiple ROW_NUMBER calls:

  • one for the whole category ;
  • one for the changes within the category
  • another one for each sequential instance of change within the category

and then use the differences between these to generate a unique ID for each subsequence.

The SQL I'm giving here derives identifiers for the subsequences - as ArbitraryChangeGroupId - and then uses that for an overall summary of them and their boundary points in a final SELECT and GROUP BY.

There is a display of example data further below.

With that preamble, here is the full example SQL:

WITH
Round_1 AS (
    SELECT
        SrcTbl.* ,
        ( ExprValuThatMayChange ) AS SubsetMarker ,
        ( -- Derive CounterPerCategory
            ROW_NUMBER() OVER (
                PARTITION BY
                    Category_A ,
                    Category_B
                ORDER BY
                    ImportantSequentialValue )
            ) AS CounterPerCategory ,
        ( -- Derive CounterPerSubsetCategory
            ROW_NUMBER() OVER (
                PARTITION BY
                    Category_A ,
                    Category_B ,
                    SubsetMarker
                ORDER BY
                    ImportantSequentialValue )
            ) AS CounterPerSubsetCategory ,
        ( -- Derive SubGrpNum
            CounterPerCategory
            -
            CounterPerSubsetCategory
            ) AS SubGrpNum
    FROM
        DbsNm.TblNm AS SrcTbl
    ) ,
Round_2 AS (
    SELECT
        Round_1.* ,
        ( -- Derive ResetCounterPerCategorySubset
            ROW_NUMBER() OVER (
                PARTITION BY
                    Category_A ,
                    Category_B ,
                    SubGrpNum
                ORDER BY
                    ImportantSequentialValue )
            ) AS ResetCounterPerCategorySubset ,
        ( -- Derive ArbitraryChangeGroupId
            CounterPerSubsetCategory
            -
            ResetCounterPerCategorySubset
            ) AS ArbitraryChangeGroupId
    FROM
        Round_1
    )
-- Example summary by the Arbitrary Change Groups SELECT
SELECT
    Category_A ,
    Category_B ,
    ArbitraryChangeGroupId ,
    MIN( ImportantSequentialValue ) AS SubGrp_MinSeqVal ,
    MAX( ImportantSequentialValue ) AS SubGrp_MaxSeqVal ,
    COUNT(*) AS SubGrp_Rw_Cnt
FROM
    Round_2
GROUP BY
    Category_A ,
    Category_B ,
    ArbitraryChangeGroupId
ORDER BY
    SubGrp_MinSeqVal
;
Enter fullscreen mode Exit fullscreen mode

And here are all the partial steps plotted out to show quite why it works.

  • note that I've separated out, that which becomes column G, as E & F so you can see how the values in G come about;
  • for simplicity I've given just one "category" column and we're only looking at one value in it.
Cat Seq Valu D E F G H I J
Eric ab This 1 1 1 0 1 0
Eric bc This 2 2 2 0 2 0
Eric cd That 3 1 1 2 1 2
Eric de That 4 2 2 2 2 2
Eric ef That 5 3 3 2 3 2
Eric fg This 6 3 3 3 1 5
Eric gh This 7 4 4 3 2 5
Eric hi That 8 4 4 4 1 7
Eric jk This 9 5 5 4 1 8
Eric kl This 10 6 6 4 2 8
Eric lm This 11 7 7 4 3 8

About those columns:

Column In source Meaning or derivation Calc
Cat Y Category A one or more such columns
Seq Y ImportantSequentialValue is often a timestamp
Valu Y ExprValuThatMayChange
D - CounterPerCategory row#
E - CounterPerSubsetCategory for This see G
F - CounterPerSubsetCategory for That see G
G - CounterPerSubsetCategory for Either row#
H - SubGrpNum D minus G
I - ResetCounterPerCategorySubset row#
J - ArbitraryChangeGroupId D minus I

Notes

You will see that the ArbitraryChangeGroupId values are not contiguous, but that for this purpose, that simply doesn't matter.

Do note that in the SQL I've separated out all the steps so that each partial result is present as a named column. In practice, only some of those need to be made explicit, the others can just exist as expressions within the derivations.

How it is best to do that will vary with the query engine being used and its approaches to optimisation. Some will just work around the named columns anyway, conversely some will become inefficient if unable to recognise that identical uses of ROW_NUMBER are happening. As the saying goes: your mileage may vary.

In my work, I have an SQL script processing tool that lets me store the script as a pro forma and then have text substitutions made at run time. Thus I can affirm that the above SQL really works, as I used it via such substitutions for an actual run on real data.

Background

I originally wrote a version of this method a great many years ago. Of course, old hacks like me tend to say that kind of thing, and I probably do have the code somewhere in my archive.

I can't quite remember whether I had access to ROW_NUMBER as a window function at the time, or had to derive something similar the hard way. I can say that the dialect I was using then didn't support CTE syntax, so I would have entirely used the derived table syntax.

Recently the problem of doing this kind of thing came up when I was asked for help out with some analysis. While I knew I had a method somewhere in my past, I couldn't quite remember how it went.

So, like anyone, I net searched and it was this posting that reminded me of the method.
Resetting Row number according to record data change
That only covers the reset counter aspect but it was enough to jog my memory for me to write out and test the full solution - so I'd like to give the link here in appreciation.

Teradata Calendar substitution

One of the test substitutions that I ran (see Notes section) was to apply the method to the Teradata System Calendar. Therefore this is only of use to those to those using a Teradata, but hey, I've no reason not to share this too. It was interesting coming up with a simple made-up category for the "change" values.

WITH
Round_1 AS (
    SELECT
        SrcTbl.* ,
        ( ( day_of_week + day_of_month + day_of_year + day_of_calendar + weekday_of_month ) MOD 2 ) AS SubsetMarker ,
        ( -- Derive CounterPerCategory
            ROW_NUMBER() OVER (
                PARTITION BY
                    year_of_calendar ,
                    quarter_of_calendar
                ORDER BY
                    calendar_date )
            ) AS CounterPerCategory ,
        ( -- Derive CounterPerSubsetCategory
            ROW_NUMBER() OVER (
                PARTITION BY
                    year_of_calendar ,
                    quarter_of_calendar ,
                    SubsetMarker
                ORDER BY
                    calendar_date )
            ) AS CounterPerSubsetCategory ,
        ( -- Derive SubGrpNum
            CounterPerCategory
            -
            CounterPerSubsetCategory
            ) AS SubGrpNum
    FROM
        sys_calendar.calendar AS SrcTbl
    ) ,
Round_2 AS (
    SELECT
        Round_1.* ,
        ( -- Derive
            ROW_NUMBER() OVER (
                PARTITION BY
                    year_of_calendar ,
                    quarter_of_calendar ,
                    SubGrpNum
                ORDER BY
                    calendar_date )
            ) AS ResetCounterPerCategorySubset ,
        ( CounterPerSubsetCategory
            -
            ResetCounterPerCategorySubset
            ) AS ArbitraryChangeGroupId
    FROM
        Round_1
    )
-- Example summary by the Arbitrary Change Groups SELECT
SELECT
    year_of_calendar ,
    quarter_of_calendar ,
    ArbitraryChangeGroupId ,
    MIN( calendar_date  ) AS SubGrp_MinSeqVal ,
    MAX( calendar_date  ) AS SubGrp_MaxSeqVal ,
    COUNT(*) AS SubGrp_Rw_Cnt
FROM
    Round_2
GROUP BY
    year_of_calendar ,
    quarter_of_calendar ,
    ArbitraryChangeGroupId
ORDER BY
    SubGrp_MinSeqVal
;
Enter fullscreen mode Exit fullscreen mode

Top comments (0)