DEV Community πŸ‘©β€πŸ’»πŸ‘¨β€πŸ’»

DEV Community πŸ‘©β€πŸ’»πŸ‘¨β€πŸ’» is a community of 963,673 amazing developers

We're a place where coders share, stay up-to-date and grow their careers.

Create account Log in
geraldew
geraldew

Posted on • Updated on

A Data Spider in Recursive SQL

Overview

This is an example of recursive querying following chains of linked records. Of note, is that it implements a concept of controlled following by link types. You don't need recursion to implement this, but it does let us do it all in a single query.

Background

I was recently reminded me of some of my data spider methods from long ago. Indeed, from so long enough ago that many features we can now take for granted in SQL were not then generally available. As a consequence, my early methods required the use of several tables and a process of repeatedly executing a subset of the script. That method included the controllable link following that I've brought through into this example.

A question that came up at my workplace made me curious as to how I might re-code that method with today's SQL features. So that's what I've done here. Do bear in mind, that this is therefore a thought experiment - I'm not suggesting this is practical code. I can vouch that I checked it really works, although I've had to anonymise and abstract the explicit table references for sharing here. My workplace uses Teradata so the example is that dialect, but should be adaptable to other similar dialects.

I've made some attempt at explaining the concepts involved, but I can't really guess how well that will read. I will describe the recursion in detail but I am not writing this to introduce a reader to recursion in SQL.

Prelude

Some things are worth mentioning before looking at the code.

The overall structure is:

  • a series of Common Table Expressions (CTEs)
  • a SELECT FROM using those CTEs

As it happens, a Common Table Expression is how SQL implements the recursive aspect - so just one of those is recursive.

Here I’ve also used CTEs to abstract the fetching of the real-named-tables – so that the main body of the query doesn’t need to be reworked if/when we change the tables we’re really pulling from.

That is why there is a CTE called BaseFetch and a CTE called LinkFetch and both of these have some pseudo-references that you'll need to replace from your real table and column names to make this code actually run.

The pseudo-references are for a link source table of YourDb.YourTableOfLinks with columns YourThisId YourThatId and YourLinkTypeId

In this example, to make sure I always get records that will link, I’ve sourced data for BaseFetch from the same place as where all the links are, and filtered for just the links I’m going to be following - with a GROUP BY to ensure unique starting points.

You will notice that there’s also a CTE - LinkRules - at the very front with lots of CASE expressions. I’ll also cover that one later - along with the Teradata-specific calendar trick it uses to produce control data out of thin air.

Enough prelude - here's the script - then I'll try to describe and explain it.

Script

-- Example of Recursive Link Spider Fetch with a controllable following of link types
WITH
  LinkRules AS
(
  SELECT
    CASE
      WHEN day_of_calendar = 1 THEN 11 -- A to B
      WHEN day_of_calendar = 2 THEN 12 -- C to D
      WHEN day_of_calendar = 3 THEN 13 -- E to F
      WHEN day_of_calendar = 4 THEN 14 -- G to H
      END AS Lnkd_TypId ,
    CASE
      WHEN day_of_calendar = 1 THEN 2 -- Follow forward
      WHEN day_of_calendar = 2 THEN 2 -- Follow forward
      WHEN day_of_calendar = 3 THEN 1 -- Link to but stop
      WHEN day_of_calendar = 4 THEN 0 -- Don't use at all
      END AS Lnk_Go_Ind
  FROM
    sys_calendar.calendar
  WHERE
    day_of_calendar < 5
   ) ,
  BaseFetch AS  
(
  SELECT 
    YourThisId AS Root_Id
  FROM
    YourDb.YourTableOfLinks
  WHERE
    YourLinkTypeId IN (
      SELECT
        Lnkd_TypId 
      FROM
        LinkRules )
    GROUP BY
        Root_Id
  ) ,
  LinkFetch AS  
(
  SELECT 
    L_A.YourThisId AS From_Id ,
    L_A.YourThatId AS Onto_Id ,
    L_A.YourLinkTypeId AS Lnkd_TypId ,
    R_A.Lnk_Go_Ind
  FROM
    YourDb.YourTableOfLinks AS L_A
    INNER JOIN
    LinkRules AS R_A ON
      L_A. = R_A.Lnkd_TypId 
      AND
      R_A.Lnk_Go_Ind > 0
  ) ,
  RECURSIVE RecurseFetch AS
(
  SELECT
    A_F.Root_Id ,
    A_F.Root_Id AS From_Id ,
    A_F.Root_Id AS Onto_Id ,
    CAST( NULL AS SMALLINT ) AS Lnkd_TypId ,
    CAST( 2 AS SMALLINT ) AS Lnk_Go_Ind ,
    0 AS LvlNum ,
    CAST( 
      ( '{"Id":' ||
        TRIM( CAST( ( Root_Id ) AS CHAR(13)) ) ||
        '}'
        ) AS VARCHAR( 300) ) AS Lnk_Seq
  FROM
    BaseFetch AS A_F
  UNION ALL
  SELECT
    R_F.Root_Id ,
    L_F.From_Id ,
    L_F.Onto_Id ,
    L_F.Lnkd_TypId ,
    L_F.Lnk_Go_Ind ,
    R_F.LvlNum + 1 AS LvlNum ,
    ( R_F.Lnk_Seq || 
      ( ',{"Typ":' ||
        TRIM( CAST( ( L_F.Lnkd_TypId ) AS CHAR(5)) ) ||
        ',"Id":' ||
        TRIM( CAST( ( L_F.Onto_Id ) AS CHAR(13)) ) ||
        '}'
        ) 
      ) AS Lnk_Seq
  FROM
    RecurseFetch AS R_F
    INNER JOIN
    LinkFetch AS L_F ON
      L_F.From_Id = R_F.Onto_Id
      AND
      L_F.Onto_Id <> R_F.From_Id
      AND
      L_F.Onto_Id <> R_F.Root_Id
      AND
      R_F.LvlNum < 4
      AND
      R_F.Lnk_Go_Ind > 1
      AND
      R_F.Lnk_Seq NOT LIKE
        '%' || '"Id":' || TRIM( CAST( ( L_F.Onto_Id ) AS CHAR(13)) ) || '}' || '%'
    INNER JOIN
    LinkRules AS L_R ON
      L_R.Lnkd_TypId = L_F.Lnkd_TypId
  )
-- FlatFetch
SELECT
  R.* ,
  '[' || R.Lnk_Seq || ']' AS Lnk_Seq_JSON
FROM
  RecurseFetch AS R
WHERE
  -- Filter out the unlinked
  LvlNum > 1
ORDER BY
  Root_Id ,
  LvlNum ,
  Lnkd_TypId ,
  From_Id ,
  Onto_Id
;
Enter fullscreen mode Exit fullscreen mode

Explanations

The Link Types

We'll come back to this, but first, another abstraction was needed for the link type encoding. In the real script, these came from values and meanings specific to my workplace. Here instead, I'm setting some arbitrary values with link ID values of:

  • 11 for the A to B type of link
  • 12 for the C to D type of link
  • 13 for the E to F type of link
  • 14 for the G to H type of link

These are quite arbitrary and merely need to match control entries in the LinkRules CTE definition.

Recursion

The core of this will be the ability of modern SQL to do a recursive query, which is to say, a query that uses itself in its own FROM clause.

It neither helps nor hinders if you've encountered "recursion" in programming languages, as in SQL it's a slightly different concept. So, let's tackle that – how does this work?

The main non-obvious thing, is that the recursive query has to be a UNION query. This has a reason, and a consequence that we should make explicit.

  • the Reason - is that the first half of the UNION provides the seed records for the recursion : if this part produces no rows then nothing ever happens;
  • the Consequence - is that each layer of run-time recursion appends more rows - and no part of the recursive process can ever remove any of those rows. Dealing with this reality is vital to understanding what will happen.

Ah, but what about the second half of the UNION ? Well, it is designed to join to the rows we have so far, so as to produce more rows.

When the recursive CTE runs, it will run multiple times - as if in a loop - until a run adds no more rows. That's the second non-obvious thing - that you don't ever write anything specific to tell it when to stop! Instead, it’s up to you to ensure that this running out of new rows will eventually happen.

The first time it runs, only the first half of the UNION will deliver rows – the second simply cannot because it has no rows to join to.

The second time it runs, the second half of the UNION can now generate rows by joining something to the result of the first run.

  • You might well ask why the first half of the UNION doesn't run the second and subsequent times. As far as I know, the under-the-hood implementation of recursive SQL is that the parser works out which part of the union is which and only runs the non-recursive part once. You're welcome to remark that this seems non-obvious and want SQL syntax to be more logical. The queue for such complaints is long and perhaps as old as SQL itself. If there are awards for great language design, SQL is never going to the ceremony.

Spider

So much for the logic of the recursion. The rest of what’s been written into this example, is all about making it do a β€œspider” crawl through the data in a way that we want.

What we're going to have the spider "crawl" around is a table of links that really just provides triples of:

  • Entity1 ->- link type ->- Entity2

where each link type has some meaning to us.

Because I can't share the meanings from my real SQL script, instead let's imagine links for these ideas:

  • A is a child of B
  • C is a parent of D
  • E is a student of F
  • G is a teacher of H

Let's imagine that we want to "spider" sets of these links so that we can connect all the people who would know what books each other might like reading, say to be able to choose a book as a present for someone in the chain of links.

So the thinking is that:

  • we can think it's worth asking family members about each other's book likes;
  • we might ask the teacher about one student - that we've linked via the family - but not ask them about all their other students.

As I've implemented things here, I’ve chosen to:

  • fully follow two pairs of link types A->-B and C->-D
  • do follow E->-F but then stop
  • don't follow from G->-H

You can see these control ideas encoded in the LinkRules CTE.

With this kind of link information we will get a link from row J to row K, store that, then treat row K as the new row from which to find a link to the next row L. As per this join clause:

  RecurseFetch AS R_F
  INNER JOIN
  LinkFetch AS L_F ON
    L_F.From_Id = R_F.Onto_Id
Enter fullscreen mode Exit fullscreen mode

In this case, the links should probably be from one ID to a different ID – as someone can't be their own child – but nonetheless, in the JOIN for the second half of the UNION we ensure that we don’t make a new record that is a step from row J again to row J, or a step back to the root of the chain. Hence we have this in the join clause:

  L_F.Onto_Id <> R_F.From_Id
  AND
  L_F.Onto_Id <> R_F.Root_Id
Enter fullscreen mode Exit fullscreen mode

That's how we crawl around, but what do we want to store as move each step?

What I’ve chosen to do, is to start building a set of records so that each makes for part of a chain of discovery, notably including:

  • for each chain, the β€œroot” that we started with;
  • for each place in the chain, the from and to (J and K) and the type of link;
  • a counter of how many steps we’ve come from the root;
  • a character string showing the sequence of steps getting from the root to K

But this would still allow for loops to appear, say after three links there becomes a link to the second in the chain. So how do we detect – and prevent - that?

Well, that’s what the string is for. To make it work, a form of encoding is used, such that an ID can be recognised unambiguously. As it happens, and because it is good enough for the purpose, the encoding I've used is a subset of JSON. To keep the detection part simple, the β€œId” element must be the second in each added pair (of link type and id) so as to ensure there is a closing curly brace.

So now we come to the part of the method where we control which kinds of links are being followed.

The original version of this, where there was no recursion, had a sequence of SQL steps that were done as each β€œround” manually. But what it did also have, was the logic for deciding which types of links were to be followed, and which were to lead to a dead end. That distinction proved to be a very powerful combination.

The key idea is to have a way of deciding what to do based on the link type. The solution is two-fold:

  • Add a kind of control table, that says which links to follow and in what way;
  • Add a control field to the recursion that can act as a brake on link following.

In the real example, I might store those controls per link type as a lookup table. Here instead, the control table has been implemented in the first Common Table Expression. Here it is on its own:

LinkRules AS
(
  SELECT
    CASE
      WHEN day_of_calendar = 1 THEN 11 -- A to B
      WHEN day_of_calendar = 2 THEN 12 -- C to D
      WHEN day_of_calendar = 3 THEN 13 -- E to F
      WHEN day_of_calendar = 4 THEN 14 -- G to H
      END AS Lnkd_TypId ,
    CASE
      WHEN day_of_calendar = 1 THEN 2 -- Follow forward
      WHEN day_of_calendar = 2 THEN 2 -- Follow forward
      WHEN day_of_calendar = 3 THEN 1 -- Link to but stop
      WHEN day_of_calendar = 4 THEN 0 -- Don't use at all
      END AS Lnk_Go_Ind
  FROM
    sys_calendar.calendar
  WHERE
    day_of_calendar < 5
   )
Enter fullscreen mode Exit fullscreen mode

This is using a Teradata trick of calling to its system calendar to create four rows out of nowhere.

  • My understanding is that Teradata generates this "table" on the fly as needed - but frankly it wouldn't matter if it's a real table that is simply always available.

As something that generates, it has a column day_of_calendar that has values starting from one (up to what ever number it has for 31 Dec 9999). In this example it's just a convenient way to get myself four numbered rows. Then I use some CASE expressions to set what is in each of those four rows.

Once you take that mechanism for granted, what you're seeing is a table of two columns.

  • one column is the link type value being controlled, the other holds a setting of three possible control values;
  • you can see their meanings in the end-of-line comments.

I've used three different numbers as it makes a test clause simpler, but it could have been a set of three different character flags.

If you trace back to where these values get used elsewhere in the query, you'll see that:

  • being present in the list determines if the link is followed at all
  • the Go Indicator is used to control whether we record a link
  • the Go Indicator is used to control whether we try link again from a link

Failings

There are some things that this method isn't quite dealing with.

Recursion Limitations

There is a reason, I've billed this as a "thought experiment".

That reason is that SQL engines aren't generally very good at coping with recursion.

It's quite likely that the data engine will use some form of internal temporary tables under the hood, those tables will have capacity limits, and that recursion can quickly reach such limits. Such issues come with many platforms and vary in detail. For those who know their Teradata, I'll just say "spool limit" and some will know what I mean.

This represents a paradox for those of us who work with large datasets. We want to use our big powerful tools on them, but find some things only work if we ensure we can keep the row counts small.

  • Note: that's not a dig at Teradata - and I've had more trouble of this kind with Spark on Hadoop.

Output Oversupply

The glaring flaw is that it produces all the valid chains from every possibly root point. That means the output includes rows that are merely sub-chains of longer chains in other rows - i.e. chains that start earlier.

It’s certainly not impossible to reduce the results to remove the sub-chains. The main questions about that are:

  • Do we really want to do that? In many situations its quite desirable to see the links heading out from any part of the chain treated as a root.
  • What is the most efficient way to do that?

For the latter, we can choose between two main methods:

  • One that uses the sequence strings
  • One that uses the Onto_Id for a self-join test

For both, there is the additional decision of whether to do that virtually, or whether to first store the current output into a table, and to then post-analyse that for the unique-but-comprehensive chains.

Many of these choices get determined by the ideas being sought.

An interesting one to consider is a way to identify the distinct β€œislands” of connected IDs – e.g. to count them and do some statistics on them.

Another is to work out what the ultimate β€œroot” IDs are (usually for time based connections which the example here was not).

Another is to work out which IDs are the common-links in many chains – i.e. who it is that has a finger in every pie?

In short, the result of this recursive query is a good base table of link sequences from which to perform various analyses of interest.

I didn't add such things to the example as there were too many to choose from.

Afterword

I hope you found this to be of interest. I will be keen to see how implementations in other SQL dialects might look.

Top comments (0)

🌚 Friends don't let friends browse without dark mode.

Sorry, it's true.