DEV Community

geraldew
geraldew

Posted on

Using SQL to discover problem dates in your data

The Issue

When you work on a data warehouse, it's enough of an issue that there are probably very many tables to be using in your analysis.

Lots of tables, means there are lots of columns, and most likely, a lot of those column will hold date data types. Anywhere there are dates there are potentials for having absurd values showing up in them.

Where I work, a longstanding policy is to not allow vital date columns to support NULL values. Nevertheless, NULL values are need, often for an end-date field as a way to indicate that the record is still current. As a consequence, there is a practice of using the maximum possible date of 31st December 9999 for the pseudo-NULL value. Inevitably though, it seems that not all processes get written using the correct exact value, and so numerous instances of 1st January 9999 will show up as well as other misunderstandings of what should be used.

While I have a fairly stock method for handling these things when I'm analysing end-date values - as per the NULLIF expression:

SELECT
    NULLIF( Cln_Dt, ( '31/12/9999' (DATE, FORMAT 'dd/mm/yyyy') ) ) AS EnNulled_Cln_Dt
Enter fullscreen mode Exit fullscreen mode

and which can be easily extend to handle a few such values:

SELECT
    NULLIF( 
        NULLIF( 
            NULLIF( 
                Cln_Dt, 
                ( '31/12/9999' (DATE, FORMAT 'dd/mm/yyyy') ) 
                ),
            ( '01/01/9999' (DATE, FORMAT 'dd/mm/yyyy') ) 
            ),
        ( '31/12/9000' (DATE, FORMAT 'dd/mm/yyyy') ) 
        )
        AS EnNulled_Cln_Dt
Enter fullscreen mode Exit fullscreen mode

but ultimately the problem becomes one of knowing quite which values to explicitly write the code to handle.

Besides the specific date values, there is also the issue of knowing how often this kind of thing is occurring.

The Absent Solution

Alas, the most likely thing you've thought as you're reading this is: surely any organisation with this kind of problem would acquire some kind of Data Quality tool and tackle it head on?

Yes, you are probably right to think that, and I'm certainly not going to argue against the idea. However I can honestly say that in over 20 years of doing data analysis work I've never known any actual progress to have been made on issues like this. Or maybe it has, but new processes for messing up data keep springing up.

  • I could write a whole book on that topic, don't tempt me!

So what I present here is something that can be done by anyone with the ability to write SQL and sufficient access permissions to create and populate a holding table. I don't claim it is super-clever - it is just something I wrote in a half afternoon and confirmed that it works.

What you see is the SQL I wrote - slightly changed to not use names specific to my workplace - and then some annotations added to describe the code.

The Method

As I work on a Teradata, I'll be using its syntax and features, namely:

  • making a couple of tables
  • making a number of views
  • making a number of macros

Also, I will be using the way that Teradata provides a "data dictionary" - i.e. a way to query the system itself to find out what databases, tables and columns are on the system. In Teradata, this is done by a pseudo database called DBC and various views within that. For our purpose here we'll just use one of those, one that lists all the columns, which is unsurprisingly called DBC.ColumnsV

One other thing that I use for this, is that my SQL tool allows me to run an SQL statement - that has been written to itself generate yet more SQL as a set of return values - and then execute those return values. If you don't have that kind of feature in your tooling, that's ok - because you can usually manually clip the generated cover over from being a result to being submitted as more statements to execute.

Corresponding to that, there are some issues about how to execute very large numbers of SQL statements and how to abandon them partway through. Those issues will not be covered in this article - but the method is structured to deal with the after effects of such interruptions.

Similarly, in other system or dialects the implementation may need to use other-named features instead - e.g "stored procedures" and this article won't attempt to cover what those might be.

The SQL

Create Two Tables - Bank and Run

I will want two tables to hold findings:

  • one for use during a run
  • and one for storing the latest findings across all runs.

While I work on Teradata quite a lot, there can be slight variations among setups, so I have a stock method to use a "CREATE AS" to generate a prototype table. I then do a SHOW TABLE commend to see what the DDL looks like and then adapt it for the real table.

Here's my prototype table maker:

CREATE TABLE
  TempDbs.Prefix_DQ_ExtremeDatesInColumns_Bank
AS
(
SELECT
  DatabaseName ,
  TableName ,
  ColumnName ,
  CAST( NULL AS DATE) AS DateValue ,
  CAST( NULL AS INTEGER ) AS PresenceCount ,
  CAST( NULL AS DATE) AS LastCheckedAtDate
FROM
  DBC.ColumnsV
WHERE
  DatabaseName IS NULL
  AND
  TableName IS NULL
  AND
  ColumnName IS NULL
)
WITH
  NO DATA
;
Enter fullscreen mode Exit fullscreen mode

Next is how wee see the DDL for the prototype table.

SHOW TABLE
  TempDbs.Prefix_DQ_ExtremeDatesInColumns_Bank
;
Enter fullscreen mode Exit fullscreen mode

I clip the output of that to a text editor and make some small changes.

Next we drop the bank table - in case you're wondering, Teradata (still) does not provide a "drop if exists" statement in its SQL dialect.

DROP TABLE
  TempDbs.Prefix_DQ_ExtremeDatesInColumns_Bank
;
Enter fullscreen mode Exit fullscreen mode

Next we make the real bank table - and is my customisation from the DDL of the prototype table. In this case I just added the UPI (unique primary index) which is the Teradata equivalent of a primary key setting. Another Teradata feature is SET TABLE for which the other option is MULTISET TABLE but I'm not going to explain the distinction here. I think "set tables" are a matter of good practice.

CREATE SET TABLE 
  TempDbs.Prefix_DQ_ExtremeDatesInColumns_Bank ,
    FALLBACK ,
    NO BEFORE JOURNAL,
    NO AFTER JOURNAL,
    CHECKSUM = DEFAULT,
    DEFAULT MERGEBLOCKRATIO,
    MAP = TD_MAP1
(
  DatabaseName VARCHAR(128) CHARACTER SET UNICODE NOT CASESPECIFIC,
  TableName VARCHAR(128) CHARACTER SET UNICODE NOT CASESPECIFIC,
  ColumnName VARCHAR(128) CHARACTER SET UNICODE NOT CASESPECIFIC,
  DateValue DATE FORMAT 'yyyy-mm-dd',
  PresenceCount INTEGER,
  LastCheckedAtDate DATE FORMAT 'yyyy-mm-dd'
  )
UNIQUE PRIMARY INDEX ( 
  DatabaseName ,
  TableName ,
  ColumnName ,
  DateValue
)
;
Enter fullscreen mode Exit fullscreen mode

Next I have a statement to drop the per-run table, which I would only need to do here if I was re-doing this sequence, because otherwise the table doesn't yet exist.

DROP TABLE
  TempDbs.Prefix_DQ_ExtremeDatesInColumns_Run
;
Enter fullscreen mode Exit fullscreen mode

Next we make the per-run table - this is the same as for the bank table and Teradata lets us do a simple clone of the structure.

CREATE TABLE
  TempDbs.Prefix_DQ_ExtremeDatesInColumns_Run
AS
  TempDbs.Prefix_DQ_ExtremeDatesInColumns_Bank
WITH
  NO DATA
;
Enter fullscreen mode Exit fullscreen mode

Define Views and Macros

Next we define a View that exists just to give us a constant value of the number of days to wait before refreshing the check for extreme dates on a specific column. Teradata allows for a SELECT with no FROM clause, which makes this quite simple.

  • If you're wondering why this is a View and not merely a matter of having the value quoted as a literal where it gets used, this method allows us to independently re-write this view to change the behaviour of our little system. While this is a small example, in other situations, this approach of putting definitions in vies can allow for quite sophisticated method variations.
REPLACE VIEW
  myuid.Prefix_DQ_C_DaystoRefresh
AS
(
SELECT
  150 AS DaystoRefresh
)
;
Enter fullscreen mode Exit fullscreen mode

Next we define a View for how to get a list of date columns from the system. Here we are using knowledge about the DBC database on a Teradata system and its view holding metadata for the columns in every view and table.

REPLACE VIEW
  myuid.Prefix_DQ_C_DTC_DateColumns_Potential
AS
(
SELECT
  TRIM( C_V.DatabaseName ) AS Dbn ,
  TRIM( C_V.TableName ) AS Tbn ,
  TRIM( C_V.ColumnName ) AS Cln
FROM
  DBC.ColumnsV AS C_V
WHERE
  /* -- later need to find a way to get the real data types for columns in views
  C_V.Type IN ( )
  AND
  */
  C_V.DatabaseName IN ( 'SpecificDbs' )
  /* AND
  TableName IS NULL */
  AND
  C_V.ColumnName LIKE '%_Dt'
)
;
Enter fullscreen mode Exit fullscreen mode

Note that I've left myself some code in a bracketed comment in case I wanted to filter the columns by actually having a DATE data type. In my case, the specific database I wanted to run analyses on was a space of only Views, and this resource does not hold their data types. Hence I've filtered them by assuming that their names would all end with "_Dt"

Next we define a View for how to get a list of date columns to inspect this time around. While the base of this is the view we just defined for fetching the metadata, my data warehouse is so large that this would give me thousands of columns to analyse. While I do want to analyse them all, on any day when I run this, I won't want to re-analyse ones that I've done somewhat recently.

To that end, I compare to the Bank table and look at the dates a column was last analysed. Similarly, if I had to interrupt during a run then I don't want to re-analyse ones that I've just done, but which are pending my end-of-run step to merge into the Bank table.

Both of those checks are simply LEFT OUTER JOINs and testing for a match through them.

REPLACE VIEW
  myuid.Prefix_DQ_C_DTC_DateColumns_BeyondDaystoRefresh
AS
(
SELECT
  GetDateCols.Dbn ,
  GetDateCols.Tbn ,
  GetDateCols.Cln ,
  ( CURRENT_DATE - Overdue_Holdings.Max_LastCheckedAtDate ) AS DaysSinceLastCheck ,
  CASE 
    WHEN DaysSinceLastCheck IS NULL THEN 'A'
    ELSE 'B' 
    END AS LastCheckCategory
FROM
  myuid.Prefix_DQ_C_DaystoRefresh AS D_T_R
  CROSS JOIN
  myuid.Prefix_DQ_C_DTC_DateColumns_Potential AS GetDateCols
  LEFT OUTER JOIN
  ( -- Overdue_Holdings
    SELECT
      DatabaseName ,
      TableName ,
      ColumnName ,
      MAX( LastCheckedAtDate ) AS Max_LastCheckedAtDate
    FROM
      TempDbs.Prefix_DQ_ExtremeDatesInColumns_Bank
    GROUP BY
      DatabaseName ,
      TableName ,
      ColumnName
    ) AS Overdue_Holdings ON
    Overdue_Holdings.DatabaseName = GetDateCols.Dbn
    AND
    Overdue_Holdings.TableName = GetDateCols.Tbn
    AND
    Overdue_Holdings.ColumnName = GetDateCols.Cln
  LEFT OUTER JOIN
  ( -- During_Run = this is here to allow for the meta-SQL generation step to be stopped and re-started
    SELECT
      DatabaseName ,
      TableName ,
      ColumnName ,
      MAX( LastCheckedAtDate ) AS Max_LastCheckedAtDate
    FROM
      TempDbs.Prefix_DQ_ExtremeDatesInColumns_Run
    GROUP BY
      DatabaseName ,
      TableName ,
      ColumnName
    ) AS During_Run ON
    During_Run.DatabaseName = GetDateCols.Dbn
    AND
    During_Run.TableName = GetDateCols.Tbn
    AND
    During_Run.ColumnName = GetDateCols.Cln
WHERE
  Overdue_Holdings.Max_LastCheckedAtDate IS NULL
  OR 
  Overdue_Holdings.Max_LastCheckedAtDate + D_T_R.DaystoRefresh > CURRENT_DATE
  AND
  During_Run.Max_LastCheckedAtDate IS NULL
)
;
Enter fullscreen mode Exit fullscreen mode

Next we define a View to provide a constant value for the number of years after the current year to treat as an extreme date. The setting here is quite arbitrary, but you should think carefully about setting it to zero or one - you might be surprised to find out how often your data includes "next year" or "next decade" values to be valid in various places.

REPLACE VIEW
  myuid.Prefix_DQ_MetaSQL_V_YearsAfterCurrent
AS
(
SELECT
  150 AS YearsToSkip_Int ,
  TRIM( CAST( YearsToSkip_Int AS VARCHAR(8) ) ) AS YearsToSkip_Str
)
;
Enter fullscreen mode Exit fullscreen mode

Next we define the View that will be the SQL script builder for analysing the columns. I've been writing this kind of thing for many years. While it's always a bit messy doing this, hopefully you can read through the construction aspects and get a feel for what the SQL it generates will look like.

REPLACE VIEW
  myuid.Prefix_DQ_MetaSQL_V_ExtremeDatesInColumns
AS
(
SELECT
  ( 
    'INSERT INTO ' || 
      'TempDbs.Prefix_DQ_ExtremeDatesInColumns_Run ' || 
      '( DatabaseName, TableName, ColumnName, DateValue, PresenceCount, LastCheckedAtDate ) ' ||
    'SELECT ' ||
      '''' || Dbn || ''' AS Db_N , ' ||
      '''' || Tbn || ''' AS Tb_N , ' ||
      '''' || Cln || ''' AS Cl_N , ' ||
      Cln || ' AS DateValue , ' ||
      'COUNT( ' || Cln || ') AS PresenceCount , ' ||
      'CURRENT_DATE AS LastCheckedAtDate ' ||
    'FROM ' ||
      Dbn || '.' || Tbn || ' ' ||
    'WHERE ' ||
      'EXTRACT( YEAR FROM ' || Cln || ') > EXTRACT( YEAR FROM CURRENT_DATE ) + ' || Y_A_C.YearsToSkip_Str || ' ' ||
    'GROUP BY ' ||
      Cln || ' ' ||
    ';'
    ) AS SqlStr ,
  ( LastCheckCategory || '#' ||
    CAST( ( DaysSinceLastCheck (FORMAT '99999') ) AS CHAR(2) ) || ':' ||
    ( Dbn || '.' || Tbn || '.' || Cln ) 
    ) AS Sorter
FROM
  myuid.Prefix_DQ_MetaSQL_V_YearsAfterCurrent AS Y_A_C
  CROSS JOIN
  myuid.Prefix_DQ_C_DTC_DateColumns_BeyondDaystoRefresh AS GetDateCols
)
;
Enter fullscreen mode Exit fullscreen mode

While the sorter column is both optional and arbitrary, the way I've constructed it here will let us prioritise columns that have not been analysed yet and then those done longest ago.

Do note, that Teradata Views are not allowed to have an ORDER BY clause, so while we can provide the sorting column, actually applying it will have to wait for a Macro.

Next we have (surprise!) a Teradata Macro, that really doesn't do any other than select just the generated SQL strings and in a prudent order.

Also note that the generated SQL will all be INSERTs into the Run table.

REPLACE MACRO
  myuid.Prefix_DQ_MetaSQL_M_ExtremeDatesInColumns
AS
(
SELECT
  SqlStr
FROM
  myuid.Prefix_DQ_MetaSQL_V_ExtremeDatesInColumns
ORDER BY
  Sorter
;
)
;
Enter fullscreen mode Exit fullscreen mode

Being a SELECT, that macro will return rows when it gets run.

Next we define a Macro for deleting all the rows in the Run table.

REPLACE MACRO
  myuid.Prefix_DQ_R_ExtremeDatesInColumns_ClearRun
AS
(
DELETE FROM
  TempDbs.Prefix_DQ_ExtremeDatesInColumns_Run
;
)
;
Enter fullscreen mode Exit fullscreen mode

Next we define a Macro to perform the merge of the findings that were inserted into the Run table.

To old people like me, this uses the "new" MERGE INTO statement, but really it's been in Teradata for quite a long time now. For any Teradata users not familiar with it, you should note that the internal execution of this was written in a way that is much more efficient than either the older UPDATE or INSERT commands (but you'd need to read Teradata doco to see why that is). The syntax and run-time logic of this command can take some getting used to - but in this situation it fits the bill perfectly, and hopefully is fairly apparent.

REPLACE MACRO
  myuid.Prefix_DQ_R_ExtremeDatesInColumns_MergeRun
AS
(
MERGE INTO
  TempDbs.Prefix_DQ_ExtremeDatesInColumns_Bank AS T_Bnk
USING
  TempDbs.Prefix_DQ_ExtremeDatesInColumns_Run AS T_Run ON
    T_Run.DatabaseName = T_Bnk.DatabaseName
    AND
    T_Run.TableName = T_Bnk.TableName
    AND
    T_Run.ColumnName = T_Bnk.ColumnName
    AND
    T_Run.DateValue = T_Bnk.DateValue
WHEN MATCHED THEN UPDATE SET 
  PresenceCount = T_Run.PresenceCount ,
  LastCheckedAtDate = T_Run.LastCheckedAtDate
WHEN NOT MATCHED THEN INSERT 
  (
    DatabaseName ,
    TableName ,
    ColumnName ,
    DateValue ,
    PresenceCount ,
    LastCheckedAtDate )
   VALUES (
    T_Run.DatabaseName ,
    T_Run.TableName ,
    T_Run.ColumnName ,
    T_Run.DateValue ,
    T_Run.PresenceCount ,
    T_Run.LastCheckedAtDate )
;
)
;
Enter fullscreen mode Exit fullscreen mode

Perform a Run

Now, having created our tables, and defined all the views and macros, we can actually do a run.

First Clear the Run table by running the macro for that.

EXECUTE
  myuid.Prefix_DQ_R_ExtremeDatesInColumns_ClearRun
;
Enter fullscreen mode Exit fullscreen mode

Next we run the macro that generates the SQL.

EXECUTE
  myuid.Prefix_DQ_MetaSQL_M_ExtremeDatesInColumns
;
Enter fullscreen mode Exit fullscreen mode

Assuming you don't have a tool that can automate this, you can just copy all that output as text and re-paste it back into where you submit your SQL.
Perhaps try just the first record on its own first.

At this point, the design of our little system doesn't care how many of those rows you transplant over to being submitted. Perhaps keep doing a few until at least one indicates that it actually inserted something into the Run table.

Next, after we've run as many of those as we feel like doing, we run the macro to merge the Run discoveries into the Bank table.

EXECUTE
  myuid.Prefix_DQ_R_ExtremeDatesInColumns_MergeRun
;
Enter fullscreen mode Exit fullscreen mode

With all that done, we inspect the results now in the Bank table.

SELECT
  *
FROM
  TempDbs.Prefix_DQ_ExtremeDatesInColumns_Bank
ORDER BY
  DatabaseName ,
  TableName ,
  ColumnName ,
  PresenceCount DESC ,
  DateValue DESC
;  
Enter fullscreen mode Exit fullscreen mode

Addendum

As it happened, I wrote all the above and ran it and it chugged along nicely, taking different amounts of time for various tables and the date columns in them.

However, when I inspected the results I realised it included something that I really wasn't interested in - in that it counted all the correct-dummy date of 31 December 9999.

Because really, my interest was to show various people my findings of unusual values - for discussion about when to treat them as erroneous data and when to just treat them as alternate "NULL" end dates.

The 31 December 9999 dates being in the results aren't themselves a problem, but if I want to avoid having them in there at all, then here is an extra couple of lines to insert to have it not bother collecting those values.

   'WHERE ' ||
      'NULLIF( ' || Cln || ' , ( ''31/12/9999'' (DATE, FORMAT ''dd/mm/yyyy'') ) ) IS NOT NULL ' ||
      'AND ' ||
      'EXTRACT( YEAR FROM ' || Cln || ') > EXTRACT( YEAR FROM CURRENT_DATE ) + ' || Y_A_C.YearsToSkip_Str || ' ' ||
Enter fullscreen mode Exit fullscreen mode

Summary

In a better world, problem dates would find enthusiastic people and tools for correcting errant values and managing data in organised ways. But while you wait for that to spontaneously happen (perhaps don't hold your breath) you can at least use your own skills to get a handle on the scale of the problem, with just plain old SQL.

Top comments (0)