DEV Community

Chidiebere Ogujeiofor
Chidiebere Ogujeiofor

Posted on • Edited on

Whats WITH CTE

This is sort of a continuation of this

You can generate the data I used in this blogpost from this GitHub Gist.

In that post, I talked about subqueries and how they work. How you can use them to answer complex questions. In this post, I would extend on that and talk about Common Table Expression(CTEs).

What are they

According to PostgreSQL Tutorial:

A common table expression is a temporary result set which you can reference within another SQL statement including SELECT, INSERT, UPDATE or DELETE. Common Table Expressions are temporary in the sense that they only exist during the execution of the query.

Once created the CTE can then be treated as a table thus you can perform JOINs, put them in a FROM, etc.

Compared to nested subqueries, CTEs have the following advantages:

  • Since they store data temporarily, they can improve performance than subqueries
  • They are more readable than nested subqueries

Single CTE in a query syntax

CTEs are created using the WITH keyword and are immediately followed by the main query. Below is an overall syntax for CTEs:

WITH <cte-name> AS (
   <the CTE query> 

)
<Final query>
Enter fullscreen mode Exit fullscreen mode

For example, in the post on subqueries, we answered the question

In which courses did the student with an ID of 9 scores above his overall average? What was his score in those courses?

We can answer that question, using CTEs as following:

WITH student_average AS (

   SELECT ROUND(AVG(score),2) AS avg FROM results
     WHERE student_id=9
)

SELECT course, score, student_average.avg as average
    FROM results
JOIN student_average
    on TRUE
WHERE student_id=9 AND score > student_average.avg

ORDER BY course;

Enter fullscreen mode Exit fullscreen mode
course score average
Biology 84 82.08
Computer Science 93 82.08
History 84 82.08
Physics 85 82.08
Statistics 85 82.08

With similar results as when we used subqueries in the previous post.

Multiple CTEs

It is possible to declare multiple CTEs such that one CTE may or may not call other CTEs. Again this technique helps make queries faster and readable as parts of the query are named.

Below is the overall structure of using multiple CTEs:

WITH cte1 AS (
   <SOME-sql-statement>
),

cte2 AS (
   <some-sql-that-may-or-may-not-use-cte1>
),

.
.
.,
cteN AS (
   <some-sql-that-may-or-may-not-previous-ctes>

)
<main/final-sql-statement>
;

Enter fullscreen mode Exit fullscreen mode

For example, say we want to answer the following question:

What was the class average and which students' average was higher than the class average? What was each of those students' average?

We can answer that question by retrieving a table with student_ids, their average, the class average and filter only when the student_average is more than the class average. We can do this using the following query:

WITH 

each_student_avg  AS (
 SELECT student_id, ROUND(AVG(score),2) as student_overall_avg
        FROM results 
        GROUP BY results.student_id

), 
class_average AS (
 SELECT ROUND(AVG(score),2) as class_avg
        FROM results 
)

SELECT student_id, student_overall_avg, cls.class_avg
FROM each_student_avg AS esa
JOIN class_average  AS cls
    ON TRUE
WHERE student_overall_avg > class_avg  
;

Enter fullscreen mode Exit fullscreen mode

This gives the following result:

student_id student_overall_avg class_avg
4 65.62 65.46
18 66.23 65.46
8 66.92 65.46
13 68.62 65.46
16 68.69 65.46
2 69.62 65.46
5 69.62 65.46
10 70.38 65.46
19 71.77 65.46
17 72.00 65.46
7 75.77 65.46
14 75.85 65.46
9 82.08 65.46
24 86.00 65.46

With this ResultSet we can see that the class average is 65.46 and the ids of the students whose average is above the class average.

NB

After each CTE is declared (class_average and each_student_avg), their result is stored in memory. Thus, the calls to class_average and each_student_avg in the main query performs operations on the ResultSet stored in each of the CTE and not the whole results table.

Readability powers

Without CTEs here's how the previous query would look like:

SELECT student_id, student_overall_avg, cls.class_avg
FROM (
 SELECT student_id, ROUND(AVG(score),2) as student_overall_avg
        FROM results 
        GROUP BY results.student_id

) AS esa
JOIN (
 SELECT ROUND(AVG(score),2) as class_avg
        FROM results 
)  AS cls
    ON TRUE
WHERE student_overall_avg > class_avg; 

Enter fullscreen mode Exit fullscreen mode

You can agree that the query with CTEs improves the readability of the query.

A more complex question

CTEs are particularly handy in answering questions that are very complex and involve a lot of aggregation/reshaping of the data.

For example, say our school has a policy of placing students on Probation or Withdrawing their admission based on poor performance.

When a student gets below 60 in any course in a particular semester and they are not on probation from the previous term, then they are placed on probation for the next semester.

Also, if they get below 60 in any course and they were on probation from the previous term then their admission is withdrawn. In summary:

When a student gets below 60 in any course in the current semester and he got below 60 in any course in the previous semester then the admission of such a student should be Withdrawn

When a student gets below 60 in any course in the current semester and passed all his courses in the previous semester then such a student should be placed on probation for the next semester

Assuming that Semester 2 started from '2019-08-01' and ended in '2019-10-30' while Semester 1 started from '2019-05-01' and ended in '2019-07-30':

Which students that should be placed on placed on probation and which should have their admission withdrawn at the end of Semester 2? Which courses did they fail? What was their score in those courses and when did they take them?

We can answer the above question using a table that has student ID, a status that can be either 'Withdrawal' or 'Probation' and a column that contains all the courses the student got below 60 in both semesters, the date the course was taken and their score in a comma-separated list. Something like:

student_id status course_info
{id-1} Probation {score1} {course1} {date-taken1}, ...,{scoreN} {courseN} {date-takenN}
{id-2} Withdrawal {score1} {course1} {date-taken1}, ...,{scoreN} {courseN} {date-takenN}
. . .
. . .
. . .

Using CTEs, we begin with the WITH keyword:

WITH

Enter fullscreen mode Exit fullscreen mode

Then we get all those that scored below 60 in semester 1 like so

semester_1_lt_60 AS (
--Retrieves all the data for semester 1 that got below 60
    SELECT student_id, course_end_date,score FROM results
    WHERE  
        score < 60 AND 
        course_end_date BETWEEN '2019-05-01'  AND '2019-07-30' 

),


Enter fullscreen mode Exit fullscreen mode

Then we get all those that scored below 60 in semester 2 like so


semester_2_lt_60 AS (
--Retrieves all the data for semester 2 that got below 60
    SELECT student_id, course_end_date,score FROM results
    WHERE  
        score < 60 AND 
        course_end_date BETWEEN '2019-08-01' AND '2019-10-30' 

),

Enter fullscreen mode Exit fullscreen mode

We can now get all those due for withdrawals as those who failed courses in both semester 1 and 2

withdrawals as (
-- Retrieves all those that got below 60 in both semester 1 and 2
    SELECT s1.student_id
    FROM semester_1_lt_60 as s1
    JOIN semester_2_lt_60 AS s2
        ON s1.student_id = s2.student_id

),

Enter fullscreen mode Exit fullscreen mode

Next, we can retrieve our Probation candidates as those that scored below 60 in semester 2 that are not in the withdrawal CTE:


probation AS (
-- Retrieves all those that failed only in semester 2 but not in both semester 1 and 2
-- This is essentially removing all those ids in  semester_2_lt_60 that are not in withdrawals 
    SELECT student_id
        FROM semester_2_lt_60
    WHERE student_id NOT IN
    (
        SELECT student_id 
        FROM withdrawals
    )
),

Enter fullscreen mode Exit fullscreen mode

Now we create our final CTE which joins both Probation and Withdrawal CTEs:


id_with_status AS (
--Adds retrieves both data with the student status for the term
    SELECT 
        DISTINCT COALESCE(w.student_id, p.student_id) AS student_id,
    CASE 
        WHEN w.student_id IS NOT NULL THEN 'Withdrawal'
        WHEN p.student_id IS NOT NULL THEN 'Probation'
    END AS status   FROM withdrawals as w
    FULL JOIN probation as p
        ON p.student_id = w.student_id 
    ORDER BY status

)

Enter fullscreen mode Exit fullscreen mode

Finally, our main query simply queries the results table JOINs id_with_status and performs a string aggregation that returns all the courses that each of the students failed in both semesters with the name of the course, their score and the date the course ended


SELECT r.student_id,  id_status.status, STRING_AGG(
        CONCAT(r.score, ' ',r.course, ' ',  r.course_end_date), ','
    ) as course_info

FROM results AS r
JOIN id_with_status AS id_status
    ON r.student_id = id_status.student_id
WHERE
    r.score < 60 AND
    r.course_end_date BETWEEN '2019-05-01' AND '2019-10-30' 
GROUP BY r.student_id, id_status.status
ORDER BY id_status.status ;

Enter fullscreen mode Exit fullscreen mode

Here is how it looks all together


WITH 


semester_1_lt_60 AS (
--Retrieves all the data for semester 1 that got below 60
    SELECT student_id, course_end_date,score FROM results
    WHERE  
        score < 60 AND 
        course_end_date BETWEEN '2019-05-01'  AND '2019-07-30' 

),

semester_2_lt_60 AS (
--Retrieves all the data for semester 2 that got below 60
    SELECT student_id, course_end_date,score FROM results
    WHERE  
        score < 60 AND 
        course_end_date BETWEEN '2019-08-01' AND '2019-10-30' 

),

withdrawals as (
-- Retrieves all those that got below 60 in both semester 1 and 2
    SELECT s1.student_id
    FROM semester_1_lt_60 as s1
    JOIN semester_2_lt_60 AS s2
        ON s1.student_id = s2.student_id

),

probation AS (
-- Retrieves all those that failed only in semester 2 but not in both semester 1 and 2
-- This is essentially removing all those ids in  semester_2_lt_60 that are not in withdrawals 
    SELECT student_id
        FROM semester_2_lt_60
    WHERE student_id NOT IN
    (
        SELECT student_id 
        FROM withdrawals
    )
),

id_with_status AS (
--Adds retrieves both data with the student status for the term
    SELECT 
        DISTINCT COALESCE(w.student_id, p.student_id) AS student_id,
    CASE 
        WHEN w.student_id IS NOT NULL THEN 'Withdrawal'
        WHEN p.student_id IS NOT NULL THEN 'Probation'
    END AS status   FROM withdrawals as w
    FULL JOIN probation as p
        ON p.student_id = w.student_id 
    ORDER BY status

)

SELECT r.student_id,  id_status.status, STRING_AGG(
        CONCAT(r.score, ' ',r.course, ' ',  r.course_end_date), ','
    ) as course_info

FROM results AS r
JOIN id_with_status AS id_status
    ON r.student_id = id_status.student_id
WHERE
    r.score < 60 AND
    r.course_end_date BETWEEN '2019-05-01' AND '2019-10-30' 
GROUP BY r.student_id, id_status.status
ORDER BY id_status.status ;

Enter fullscreen mode Exit fullscreen mode

Below is the final result:

student_id status course_info
6 Probation 53 Statistics 2019-10-05
13 Probation 58 Statistics 2019-10-05
15 Probation 57 Statistics 2019-10-05
18 Probation 0 History 2019-10-05
20 Probation 40 Chemistry 2019-10-05,54 Statistics 2019-10-05,0 History 2019-10-05
23 Probation 36 Chemistry 2019-10-05,57 Statistics 2019-10-05
1 Withdrawal 0 Chemistry 2019-10-05,0 Mathematics 2019-05-09,14 Statistics 2019-10-05
12 Withdrawal 52 Statistics 2019-10-05,40 Computer Science 2019-05-09,44 Commerce 2019-05-09,47 Mathematics 2019-05-09
21 Withdrawal 0 Chemistry 2019-10-05,35 Computer Science 2019-05-09,48 Statistics 2019-10-05
22 Withdrawal 55 Commerce 2019-05-09,49 Statistics 2019-10-05,41 Chemistry 2019-10-05

Looking at the SQL query above you can easily see how we are storing data in memory and re-using them in subsequent CTEs and the main query thus breaking down the problem into smaller bits until we have enough info to generate/return our final ResultSet. This greatly improves performance as opposed to running through the table several times.

The query is also quite readable as we are naming parts of the query while using the CTE names.

I would like to make one last attempt to show you how readable CTEs are by showing you the nested subquery equivalent( Brace yourself, comrades!⛈⚡️🎇):



SELECT r.student_id,  id_status.status, STRING_AGG(
        CONCAT(r.score, ' ',r.course, ' ',  r.course_end_date), ','
    ) as course_info

FROM results AS r
JOIN (
    SELECT 
        DISTINCT COALESCE(w.student_id, p.student_id) AS student_id,
    CASE 
        WHEN w.student_id IS NOT NULL THEN 'Withdrawal'
        WHEN p.student_id IS NOT NULL THEN 'Probation'
    END AS status   FROM (
        SELECT s1.student_id
        FROM (
            SELECT student_id, course_end_date,score FROM results
            WHERE  
                score < 60 AND 
                course_end_date BETWEEN '2019-05-01'  AND '2019-07-30' 
        ) as s1
        JOIN (
            SELECT student_id, course_end_date,score FROM results
            WHERE  
                score < 60 AND 
                course_end_date BETWEEN '2019-08-01' AND '2019-10-30'
        ) AS s2
            ON s1.student_id = s2.student_id

    ) as w
    FULL JOIN (
        SELECT student_id
        FROM (
            SELECT student_id, course_end_date,score FROM results
            WHERE  
                score < 60 AND 
                course_end_date BETWEEN '2019-08-01' AND '2019-10-30' 
        ) AS lt_60_in_semester_1
        WHERE student_id NOT IN
        (
            SELECT student_id 
            FROM (
                SELECT s1.student_id
                FROM (
                    SELECT student_id, course_end_date,score FROM results
                    WHERE  
                        score < 60 AND 
                        course_end_date BETWEEN '2019-05-01'  AND '2019-07-30' 
                ) as s1
                JOIN (
                    SELECT student_id, course_end_date,score FROM results
                    WHERE  
                        score < 60 AND 
                        course_end_date BETWEEN '2019-08-01' AND '2019-10-30' 
                ) AS s2
                    ON s1.student_id = s2.student_id
            ) AS  withdrawal
        )

     )as p
        ON p.student_id = w.student_id 
    ORDER BY status
) AS id_status
    ON r.student_id = id_status.student_id
WHERE
    r.score < 60 AND
    r.course_end_date BETWEEN '2019-05-01' AND '2019-10-30' 
GROUP BY r.student_id, id_status.status
ORDER BY id_status.status ;
Enter fullscreen mode Exit fullscreen mode

I believe you now get the point! 😉

Conclusion

In this blogpost:

  • we have looked at CTEs what they are and how they work
  • we have seen that CTEs are more readable than nested subqueries
  • we have also seen that CTEs can improve performance.
  • finally, that CTEs can be used to efficiently answer complex questions

I would like to conclude guys by urging you to look at the query that used CTEs to answer the question in the Complex Question section. How would you make it more efficient?

Alt text of image

Folks its time to throw away those unreadable/inefficient nested subqueries

Alt text of image

Top comments (3)

Collapse
 
aarone4 profile image
Aaron Reese

Great article. One key advantage of CTEs not mentioned is that you can effectively debug/ unit test them by just running select * from CTE immediately after the CTE and check the results rather than having to comment out the outer queries.
The first example could also have been achieved using SUM() OVER () which I think is a cleaner syntax but as this is an article about CTEs we can let it go.
Instead of writing two CTEs, one for each semester, you could have had a single CTE with the semester in the dataset and then joined to it twice using the semester column as a filter in the join clause.
One other trick with CTEs us that normally you can't use windowed functions like RANK() in the WHERE clause, but if you wrap the initial query as a CTE and give the RANK() result a column name you can use it in the main query WHERE clause.
All fields returned by a CTE have to have a unique column name.
CTEs are not always the most performant solution, at least on MSSQL, in my experience very large CTE results sets seem to put pressure on memory and can also stop the query optimiser from using indexes and table statistics when generating a query plan. As there are no indexes bon CTEs you can often get better performance by persisting the data to temp tables and building indexes on them

Collapse
 
chidioguejiofor profile image
Chidiebere Ogujeiofor

Makes sense. Great comment

Collapse
 
chidioguejiofor profile image
Chidiebere Ogujeiofor

edited this to have syntax highlighting. eg

SELECT

Some comments may only be visible to logged-in visitors. Sign in to view all comments.