This is an example of using SQL to derive subsequences according to when some value in a table changes or stays the same.
- that we have a source table of
- that there is some value (
ImportantSequentialValue) that always indicates which records come before others ;
- that there is one or more category columns - given here as two columns:
- that there is a column or an expression which is the thing that variously changes or stays the same:
The rough idea is to use multiple ROW_NUMBER calls:
- one for the whole category ;
- one for the changes within the category
- another one for each sequential instance of change within the category
and then use the differences between these to generate a unique ID for each subsequence.
The SQL I'm giving here derives identifiers for the subsequences - as
ArbitraryChangeGroupId - and then uses that for an overall summary of them and their boundary points in a final SELECT and GROUP BY.
There is a display of example data further below.
With that preamble, here is the full example SQL:
WITH Round_1 AS ( SELECT SrcTbl.* , ( ExprValuThatMayChange ) AS SubsetMarker , ( -- Derive CounterPerCategory ROW_NUMBER() OVER ( PARTITION BY Category_A , Category_B ORDER BY ImportantSequentialValue ) ) AS CounterPerCategory , ( -- Derive CounterPerSubsetCategory ROW_NUMBER() OVER ( PARTITION BY Category_A , Category_B , SubsetMarker ORDER BY ImportantSequentialValue ) ) AS CounterPerSubsetCategory , ( -- Derive SubGrpNum CounterPerCategory - CounterPerSubsetCategory ) AS SubGrpNum FROM DbsNm.TblNm AS SrcTbl ) , Round_2 AS ( SELECT Round_1.* , ( -- Derive ResetCounterPerCategorySubset ROW_NUMBER() OVER ( PARTITION BY Category_A , Category_B , SubGrpNum ORDER BY ImportantSequentialValue ) ) AS ResetCounterPerCategorySubset , ( -- Derive ArbitraryChangeGroupId CounterPerSubsetCategory - ResetCounterPerCategorySubset ) AS ArbitraryChangeGroupId FROM Round_1 ) -- Example summary by the Arbitrary Change Groups SELECT SELECT Category_A , Category_B , ArbitraryChangeGroupId , MIN( ImportantSequentialValue ) AS SubGrp_MinSeqVal , MAX( ImportantSequentialValue ) AS SubGrp_MaxSeqVal , COUNT(*) AS SubGrp_Rw_Cnt FROM Round_2 GROUP BY Category_A , Category_B , ArbitraryChangeGroupId ORDER BY SubGrp_MinSeqVal ;
And here are all the partial steps plotted out to show quite why it works.
- note that I've separated out, that which becomes column G, as E & F so you can see how the values in G come about;
- for simplicity I've given just one "category" column and we're only looking at one value in it.
About those columns:
|Column||In source||Meaning or derivation||Calc|
|Cat||Y||Category A||one or more such columns|
|Seq||Y||ImportantSequentialValue||is often a timestamp|
|E||-||CounterPerSubsetCategory for This||see G|
|F||-||CounterPerSubsetCategory for That||see G|
|G||-||CounterPerSubsetCategory for Either||row#|
|H||-||SubGrpNum||D minus G|
|J||-||ArbitraryChangeGroupId||D minus I|
You will see that the
ArbitraryChangeGroupId values are not contiguous, but that for this purpose, that simply doesn't matter.
Do note that in the SQL I've separated out all the steps so that each partial result is present as a named column. In practice, only some of those need to be made explicit, the others can just exist as expressions within the derivations.
How it is best to do that will vary with the query engine being used and its approaches to optimisation. Some will just work around the named columns anyway, conversely some will become inefficient if unable to recognise that identical uses of ROW_NUMBER are happening. As the saying goes: your mileage may vary.
In my work, I have an SQL script processing tool that lets me store the script as a pro forma and then have text substitutions made at run time. Thus I can affirm that the above SQL really works, as I used it via such substitutions for an actual run on real data.
I originally wrote a version of this method a great many years ago. Of course, old hacks like me tend to say that kind of thing, and I probably do have the code somewhere in my archive.
I can't quite remember whether I had access to ROW_NUMBER as a window function at the time, or had to derive something similar the hard way. I can say that the dialect I was using then didn't support CTE syntax, so I would have entirely used the derived table syntax.
Recently the problem of doing this kind of thing came up when I was asked for help out with some analysis. While I knew I had a method somewhere in my past, I couldn't quite remember how it went.
So, like anyone, I net searched and it was this posting that reminded me of the method.
Resetting Row number according to record data change
That only covers the reset counter aspect but it was enough to jog my memory for me to write out and test the full solution - so I'd like to give the link here in appreciation.
One of the test substitutions that I ran (see Notes section) was to apply the method to the Teradata System Calendar. Therefore this is only of use to those to those using a Teradata, but hey, I've no reason not to share this too. It was interesting coming up with a simple made-up category for the "change" values.
WITH Round_1 AS ( SELECT SrcTbl.* , ( ( day_of_week + day_of_month + day_of_year + day_of_calendar + weekday_of_month ) MOD 2 ) AS SubsetMarker , ( -- Derive CounterPerCategory ROW_NUMBER() OVER ( PARTITION BY year_of_calendar , quarter_of_calendar ORDER BY calendar_date ) ) AS CounterPerCategory , ( -- Derive CounterPerSubsetCategory ROW_NUMBER() OVER ( PARTITION BY year_of_calendar , quarter_of_calendar , SubsetMarker ORDER BY calendar_date ) ) AS CounterPerSubsetCategory , ( -- Derive SubGrpNum CounterPerCategory - CounterPerSubsetCategory ) AS SubGrpNum FROM sys_calendar.calendar AS SrcTbl ) , Round_2 AS ( SELECT Round_1.* , ( -- Derive ROW_NUMBER() OVER ( PARTITION BY year_of_calendar , quarter_of_calendar , SubGrpNum ORDER BY calendar_date ) ) AS ResetCounterPerCategorySubset , ( CounterPerSubsetCategory - ResetCounterPerCategorySubset ) AS ArbitraryChangeGroupId FROM Round_1 ) -- Example summary by the Arbitrary Change Groups SELECT SELECT year_of_calendar , quarter_of_calendar , ArbitraryChangeGroupId , MIN( calendar_date ) AS SubGrp_MinSeqVal , MAX( calendar_date ) AS SubGrp_MaxSeqVal , COUNT(*) AS SubGrp_Rw_Cnt FROM Round_2 GROUP BY year_of_calendar , quarter_of_calendar , ArbitraryChangeGroupId ORDER BY SubGrp_MinSeqVal ;