Notes on PostgreSQL: Common Table Expressions

#postgres #sql #database

Introduction

You might already have heard about common table expressions, in short CTEs - if not, these notes might also be helpful for furthering your understanding on the topic. It's one of these very nice features that come with working with PostgreSQL, especially if you were switching over from MySQL prior to version 8.0. These notes will help in understanding why CTEs are sometimes referred to as "WITH queries", how to leverage multiple CTEs and even a little bit of recursion.

Basics

To get a good understanding of the very basics of CTEs, let's begin with the syntax. To construct a common table expression we leverage the WITH keyword.
NOTE: CTEs are also referred to as WITH queries due to the WITH keyword they are based on.
Consulting the PostgreSQL documentation we can see what a basic structure of a CTE looks like:

WITH CTE_NAME (OPTIONAL_COLUMN_NAMES) AS (CTE_SELECT_QUERY)
SELECT * FROM CTE_NAME;

The WITH clause is followed by the name of the CTE and an optional list of column names. Finally the selection query is defined after the AS keyword.
Once the CTE is defined, it can be accessed in any SELECT, INSERT, UPDATE and even DELETE statement.

Let's take a look at an example to get a better idea. The following examples assume we have a list of vinyl records, including the artist name, year of release, label and format. Furthermore we have a list of all records sold, including date of sale and the corresponding price. Check the schema here.
In the first example we would like to find the most sold records via querying the sales table:

WITH most_sold_records (record_id, sold_items) AS(
  SELECT   record,
           count(id)
  FROM     sales
  GROUP BY record
)

SELECT   *
FROM     most_sold_records
ORDER BY sold_items DESC, record_id ASC;

The above example demonstrates how to construct a CTE and then accessing it in the follow-up statement. The result of the query is a list containing any items sold, ordered by the number of items sold. From the query we can see that we have information about the record id, so our next step would be to join the newly created CTE with the records table, to display information like title, label and format in addition to the number of sold items.

WITH most_sold_records (record_id, sold_items) AS(
  SELECT   record,
           count(id)
  FROM     sales
  GROUP BY record
)

SELECT     record_id,
           name,
           artist,
           label,
           format,
           sold_items
FROM       most_sold_records
INNER JOIN records ON records.id = record_id
ORDER BY   record_id DESC, sold_items ASC;

Let's take a look at another basic example before getting into more advanced territory. Our vinyl catalog is comprised of different formats like 12", 10" or 7" single vinyl or single or double LPs (albums). We would like to retrieve a list where a vinyl record should either be defined as a single or album.

WITH record_catalog AS(
  SELECT id,
         name,
         label,
         CASE
           WHEN format IN ('12', '10', '7')
            THEN 'Single'
           ELSE 'Album'
         END AS type
  FROM   records
)

SELECT * from record_catalog;

This would return a list of vinyl records where type is either a single record or an album. Now we can use the record_catalog CTE to filter out all singles for example.

WITH record_catalog AS(
  SELECT id,
         name,
         label,
         CASE
           WHEN format IN ('12', '10', '7')
             THEN 'Single'
           ELSE 'Album'
         END AS type
  FROM   records
)

SELECT *
FROM   record_catalog
WHERE  type = 'Album';

Up until this point, we should now have a very good understanding of the basic structure of common table expressions in PostgreSQL. In the following section we will try to see what further possible combinations are possible.

Advanced

The WITH clause accepts multiple CTEs:

WITH 
    CTE_NAME_1 (OPTIONAL_COLUMN_NAMES) AS (CTE_SELECT_QUERY_1),
    CTE_NAME_2 (OPTIONAL_COLUMN_NAMES) AS (CTE_SELECT_QUERY_2)

SELECT * FROM CTE_NAME_1 
UNION 
SELECT * FROM CTE_NAME_2

Being able to define multiple CTEs opens up more possibilities to leverage. For example we might want to display a list showing the number of units and the highest and lowest price sold per record as well as the highest price a record was sold by that artist.
The first step is to break it up into two queries. A regular query for finding the max, min and count for every record inside the sales table:

SELECT   count(*) AS total_units_sold, 
         MAX(price) AS highest_price, 
         MIN(price) AS lowest_price,
         a.id As artist_id,
         a.name As artist_name,
         r.name AS record_title
FROM     sales 
JOIN     records r ON (record = r.id)
JOIN     artists a ON (r.artist = a.id)
GROUP BY r.name, a.id;

This would return us a list with all the needed record sales information. For the second part of the query we can change the GROUP BY column to retrieve the highest price on an artist basis.

SELECT     MAX(price) AS highest_overall_price,
           a.id AS artist_id
FROM       sales
INNER JOIN records r ON (record = r.id)
INNER JOIN artists a ON (r.artist = a.id)
GROUP BY   a.id
ORDER BY   a.id;

What is left is now to take these two queries, convert them to CTEs and then join them together.

WITH record_sales AS (
  -- Find the amount sold, highest and lowest price per record
  SELECT     COUNT(*) AS total_units_sold, 
             MAX(price) AS highest_price, 
             MIN(price) AS lowest_price,
             a.id As artist_id,
             a.name As artist_name,
             r.name AS record_title
  FROM       sales 
  INNER JOIN records r ON (record = r.id)
  INNER JOIN artists a ON (r.artist = a.id)
  GROUP BY   r.name, a.id
  ORDER BY   a.id 
),
artist_sales AS (
  -- Find the highest price per artist
  SELECT     MAX(price) AS highest_overall_price,
             a.id AS artist_id
  FROM       sales
  INNER JOIN records r ON (record = r.id)
  INNER JOIN artists a ON (r.artist = a.id)
  GROUP BY   a.id
  ORDER BY   a.id
)

SELECT       * 
FROM         record_sales
INNER JOIN   artist_sales USING (artist_id);

One more interesting aspect, if we recall the format example we previously created, is that by defining a record to either be of type Single or Album, we could extend it to filter out based on said type in a second CTE. The means that we can access CTEs within other CTEs, check the following adaption:

WITH record_catalog AS(
  SELECT id,
         name,
         label,
         CASE
            WHEN format IN ('12', '10', '7')
             THEN 'Single'
            ELSE 'Album'
          END
         AS type
  FROM records
),
album AS (SELECT *
          FROM record_catalog
          WHERE type = 'Album')

SELECT *
FROM   album;

Recursive Common Table Expressions

Sometimes we need to work through some data that has some sort of hierarchy and we would like to go through the data starting from the root and then traversing level by level, finding references to the previous level. For example in our online vinyl record store, we have an artists table that contains a field alias. The alias is used when an artist also uses a different name and sometimes even this alias might have another alias associated with it.
Normally, to find the root artist, we would check to see which artist entries have no alias references.

SELECT *
FROM   artists
WHERE  alias IS NULL;

The standard definition of a WITH RECURSIVE in PostgreSQL has the following structure:

WITH RECURSIVE CTE_NAME (OPTIONAL_COLUMN_NAMES) AS(
    CTE_NON_RECURSIVE_SELECT_QUERY
    UNION ALL
    CTE_RECURSIVE_SELECT_QUERY
)
SELECT * FROM CTE_NAME;

The documentation summarises this nicely: "The general form of a recursive WITH query is always a non-recursive term, then UNION (or UNION ALL), then a recursive term, where only the recursive term can contain a reference to the query's own output."
So we can derive that the non-recursive term is the base result set and the the recursive term, which can consist of one to n CTEs are joined by an UNION [ALL]. The recursion halts if the previous iteration doesn't return any new rows.

Now that we have a basic idea of the structure, the next step is to convert this top level artist lookup into a CTE:

WITH RECURSIVE root_aritst AS (
  SELECT *
  FROM   artists
  WHERE  alias IS NULL
)

SELECT * FROM root_aritst;

By using the WITH RECURSIVE clause we can rebuild our query to go through level by level and return a hierarchical structure representing an artist's aliases and the aliases for these aliases etc.

WITH RECURSIVE leveln AS (
  SELECT *, '' AS alias
  FROM   artists
  WHERE  alias IS NULL
  UNION ALL
  SELECT artists.*, leveln.name alias FROM artists
  JOIN leveln on leveln.id = artists.alias
)

SELECT * FROM leveln;

We start with the root level, the base artist representations, and then start looking for artist entries that reference these representations. Once we have the 1st level aliases, our query will try to find all aliases referencing these 1st level aliases and so forth. By designing our query this way, we can go n levels deep.

NOTE: We need to ensure that we don't run into some endless recursion.
A follow-up post will focus on optimisations and how to avoid pitfalls.

Summary

These notes on CTEs should have helped with better understanding common table expressions. Especially how a CTE is constructed and how to combine multiple CTEs for more complex operations as well as how to use recursion for specific cases.
As a round up, we can see that CTEs come with some nice benefits:

They can be more readable than regular subqueries and very helpful when it comes to reuse.
Multiple CTEs can be constructed and reused and combined in a later statement, enabling us to use the expressions in a more flexible manner.
Leveraging recursive CTEs, we can do more complicated calculations while keeping these complex operations readable.
CTEs can be joined with other tables and window functions. This can help us to filter out specific data step by step. Refining our result set along the way.

If you have any questions or feedback please leave a comment here or connect via Twitter: A. Sharif