DEV Community

Lawrence Cooke
Lawrence Cooke

Posted on

SQL Common Table Expressions and Window functions

Sometimes when writing queries for reporting purposes, the queries required to manipulate the data can get complex.

If you are able to use PHP or Python to manipulate the data after running a query, then the query required to get the desired result is much simpler.

But sometimes you just have to find a way to fetch the data you need using only a SQL query.

For the purposes of this article, the desired outcome from the query is to fetch the total amount transacted each day and display both the daily amount and a running tally for a 7 day period.

Date Amount Tally
2023-01-01 28.50 28.50
2023-01-02 40.50 69.00
2023-01-03 15.50 84.50
2023-01-04 48.00 132.50
2023-01-05 0.00 132.50
2023-01-06 0.00 132.50
2023-01-07 30.00 162.50

The data queried to get this output is

transaction_date amount
2023-01-01 12:55:19 18.00
2023-01-01 23:05:43 10.50
2023-01-02 10:29:23 10.50
2023-01-02 11:40:19 30.00
2023-01-03 11:08:33 10.50
2023-01-03 11:38:20 5.00
2023-01-04 09:32:00 18.00
2023-01-04 10:52:01 30.00
2023-01-07 16:14:27 30.00

Getting Started

With any complex query, the trick is to break it down into its parts and work on each part until you are ready to put it all together.

Here I would run a query to fetch the data for the date range just so I can see what I need to do.

SELECT transaction_date, amount
FROM subscriptions
WHERE transaction_date 
BETWEEN '2023-01-01 00:00:00' 
AND '2023-01-07 23:59:59'
Enter fullscreen mode Exit fullscreen mode
transaction_date amount
2023-01-01 12:55:19 18.00
2023-01-01 23:05:43 10.50
2023-01-02 10:29:23 10.50
2023-01-02 11:40:19 30.00
2023-01-03 11:08:33 10.50
2023-01-03 11:38:20 5.00
2023-01-04 09:32:00 18.00
2023-01-04 10:52:01 30.00
2023-01-07 16:14:27 30.00

Now I need to group the data by day and SUM the amount transacted each day.

SELECT DATE(transaction_date) transaction_date, 
SUM(amount) amount
FROM subscriptions
WHERE transaction_date 
BETWEEN '2023-01-01 00:00:00' 
AND '2023-01-07 23:59:59'
GROUP BY DATE(transaction_date)

Enter fullscreen mode Exit fullscreen mode
transaction_date amount
2023-01-01 28.50
2023-01-02 40.50
2023-01-03 15.50
2023-01-04 48.00
2023-01-07 30.00

The date is in the right format, the amounts are grouped and summed correctly, but we are missing days.

There are many ways in SQL to fill in the gaps, some are fairly simple and others are overly complex.

For me, my preferred method is to use Common Table Expressions.

What are Common Table Expressions (CTE)

CTEs are reusable temporary result sets. In their simplest form they can replace a sub query or a view, but they have some unique characteristics which makes them powerful.

CTEs are reusable, which means that a query is run once, but the resulting data set can be used/queried multiple times.

Defining a CTE

CTEs are defined using WITH

WITH counter AS (...query...)
Enter fullscreen mode Exit fullscreen mode

This creates a temporary result set that can then be used in a query

WITH counter AS (SELECT 1)
SELECT * FROM counter

Enter fullscreen mode Exit fullscreen mode

If needed, the field name returned from a CTE can be defined

WITH counter (number) AS (SELECT 1)
SELECT * FROM counter

Enter fullscreen mode Exit fullscreen mode

Its also possible to define multiple CTEs

WITH counter (number) AS (SELECT 1),
WITH total (total) AS (SELECT 2)
SELECT * FROM counter

Enter fullscreen mode Exit fullscreen mode

A slightly more complex CTE might look like

WITH counter (number) AS (
SELECT 1
UNION ALL
SELECT 2
)
SELECT * FROM counter

Enter fullscreen mode Exit fullscreen mode
number
1
2

Where CTEs become powerful is in their ability to reference themselves recursively

If I wanted to return the numbers 1 to 6 I could use a recursive CTE to generate the numbers

WITH RECURSIVE counter (number) AS (
SELECT 1
UNION ALL
SELECT number + 1 FROM counter WHERE number < 6
)
SELECT * FROM counter

Enter fullscreen mode Exit fullscreen mode

This will produce

number
1
2
3
4
5
6

Using a CTE to generate a date range

For our original request, all dates within the date range were required even for dates where there were no transactions.

To do this, a CTE can be created to return the date range required.

WITH RECURSIVE dates (day) AS (
SELECT '2023-01-01'
UNION ALL
SELECT day + INTERVAL 1 day FROM dates WHERE day < '2023-01-07'
)
SELECT * FROM dates

Enter fullscreen mode Exit fullscreen mode
day
2023-01-01
2023-01-02
2023-01-03
2023-01-04
2023-01-05
2023-01-06
2023-01-07

Joining CTEs In Queries

For the purposes of the original requirements, I am adding the query that fetched the transactions into its own CTE so I can manipulate the data a little later on

WITH RECURSIVE dates (day) AS (
SELECT '2023-01-01'
UNION ALL
SELECT day + INTERVAL 1 day FROM dates WHERE day < '2023-01-07'
),
transactions AS (
SELECT DATE(transaction_date)
 transaction_date, 
SUM(amount) amount
FROM subscriptions
WHERE transaction_date 
BETWEEN '2023-01-01 00:00:00' 
AND '2023-01-07 23:59:59'
GROUP BY DATE(transaction_date))

SELECT * FROM transactions

Enter fullscreen mode Exit fullscreen mode

A feature of CTEs is the ability to check the result from each CTE before putting it all together

SELECT * FROM transactions OR SELECT * FROM dates will help me verify that the data the queries produce are correct.

While this query works as-is, if the date range needs to change, both CTEs would need to be updated.

A CTE can reference another CTE. To change the query so only one CTE needs to be updated in order to change the date range, we can reference the dates CTE from within the transactions CTE.

WHERE transaction_date 
BETWEEN (SELECT MIN(day) FROM dates)
AND (SELECT MAX(day) + INTERVAL 1 DAY FROM dates) 

Enter fullscreen mode Exit fullscreen mode

Now to join the two CTEs together to see all the data together

WITH RECURSIVE dates (day) AS (
SELECT '2023-01-01'
UNION ALL
SELECT day + INTERVAL 1 day FROM dates WHERE day < '2023-01-07'
),

transactions AS (
SELECT DATE(transaction_date)
 transaction_date, 
SUM(amount) amount
FROM subscriptions
WHERE transaction_date 
BETWEEN (SELECT MIN(day) FROM dates)
AND (SELECT MAX(day) + INTERVAL 1 DAY FROM dates)
GROUP BY DATE(transaction_date))

SELECT day,amount FROM dates d
LEFT JOIN transactions t ON t.transaction_date = d.day

Enter fullscreen mode Exit fullscreen mode
day amount
2023-01-01 28.50
2023-01-02 40.50
2023-01-03 15.50
2023-01-04 48.00
2023-01-05 NULL
2023-01-06 NULL
2023-01-07 30.00

Other than the NULLs, the data is looking good so far. To remove the NULL values, we can use the COALESCE function.

What Does COALESCE Do?

COALESCE replaces NULL values with the first non null value in a list of values

COALEASCE(value,0) will replace the value with 0 if the value is NULL. Equally COALESCE(value, null,null,0) will also replace with 0 as its the first non null value in the list of values.

SELECT day,COALESCE(amount,0) amount FROM dates d
LEFT JOIN transactions t ON t.transaction_date = d.day
Enter fullscreen mode Exit fullscreen mode
day amount
2023-01-01 28.50
2023-01-02 40.50
2023-01-03 15.50
2023-01-04 48.00
2023-01-05 0.00
2023-01-06 0.00
2023-01-07 30.00

The last part of the request was a running tally. To return a running tally, SQL Window functions are required.

What Are Window Functions?

Window functions are functions that are run across the entire result set. They allow you to fetch a running tally, or data from a previous row, or from the rows ahead of the current row.

There are quite a few window function, some of them are LEAD, LAG, NTILE

LEAD - return a value from a row ahead of the current row
LAG - return a value from a previous row
NTILE - break rows out into sections, this would allow you to only return the top 25% of results for example.

You can also use aggregating functions in a window (SUM,AVG etc)

We will use the SUM function to fetch the running tally.

The format of a window function is

function(field) OVER (clause)
Enter fullscreen mode Exit fullscreen mode

The PARTITION BY clause is similar to a GROUP BY but not the same.

In a window function the function is run over every row, it won't group rows together, but it will produce a result that grouped values together in its calculation.

ORDER BY will order the results in a specific way that is then used to calculate a result.

Fetching A Running Tally

To fetch the running tally required, we need to add a window function field into the query

SUM(amount) OVER (ORDER BY day) tally
Enter fullscreen mode Exit fullscreen mode

Predefining a Window

While the method above is fine to use. If you find yourself in a situation where multiple window functions are required, it is possible to pre define the window and reference it.

In our current query

SELECT day,COALESCE(amount,0) amount,
SUM(amount) OVER (ORDER BY day) tally
FROM dates d
LEFT JOIN transactions t ON t.transaction_date = d.day
Enter fullscreen mode Exit fullscreen mode

This could be changed to

SELECT day,COALESCE(amount,0) amount,
SUM(amount) OVER w tally
FROM dates d
LEFT JOIN transactions t ON t.transaction_date = d.day
WINDOW w AS (ORDER BY day)
Enter fullscreen mode Exit fullscreen mode

Final Query

The final query using the window functions and CTEs looks like

WITH RECURSIVE dates (day) AS (
SELECT '2023-01-01'
UNION ALL
SELECT day + INTERVAL 1 day FROM dates WHERE day < '2023-01-07'
),

transactions AS (
SELECT DATE(transaction_date) 
transaction_date, 
SUM(amount) amount
FROM subscriptions
WHERE transaction_date 
BETWEEN (SELECT MIN(day) FROM dates)
AND (SELECT MAX(day) + INTERVAL 1 DAY FROM dates)
GROUP BY DATE(transaction_date)
)

SELECT day,COALESCE(amount,0) amount,
SUM(amount) OVER w tally
FROM dates d
LEFT JOIN transactions t ON t.transaction_date = d.day
WINDOW w AS (ORDER BY day)

Enter fullscreen mode Exit fullscreen mode
day amount tally
2023-01-01 28.50 28.50
2023-01-02 40.50 69.00
2023-01-03 15.50 84.50
2023-01-04 48.00 132.50
2023-01-05 0.00 132.50
2023-01-06 0.00 132.50
2023-01-07 30.00 162.50

Common Table Expressions and Windows open up SQL for better data processing especially for reporting and data analysis.

CTEs contain powerful features that simplify tasks, they also add readability making building or debugging complex queries easier.

Being reusable means that where you might have to run a subquery more than once in a query, the CTE will only run the query once and use the result set instead of running multiple subqueries, which can add efficiency to the query in some cases.

Window functions are a great way to provide analysis over the result set, this is evident in accounting type reports, but not limited to just accounting, it has a lot of use cases where using window functions allow you to gather additional information needed to produce the results required.

Top comments (1)

Collapse
 
aarone4 profile image
Aaron Reese

Excellent article. For me the key advantages of CTEs are:

  • Code simplification by avoiding sub queries.

  • code simplification by avoiding complex join rules (prep the data in a CTE instead)

  • unit testability of each CTE separately

  • being able to reference the same CTE multiple times

  • recursion

  • preparation of any data that uses variables or configuration instead of 'hiding' it in the main body of code

Be aware that sometimes using lots of CTEs will slow down your code because it is all done in memory. If this happens I tend to convert the CTEs to #tmp tables which can be magnitudes faster and allow you to add indexes if necessary, just remember to be a good boy scout and explicitly drop them when you are done.