I have recently been helping out my brother on a very exciting project that requires managing student data and we recently implemented some features using subqueries and CTEs.
Thus, I decided to write this post about subqueries and another about CTEs.
You can generate the data I used in this blogpost from this GitHub Gist.
SQL challenges are all about answering questions. To this effect, each example in the blog post asks a question and then answers it using an SQL statement
A subquery is essentially a
SELECT statement that is within another
SELECT statement. It may be placed in SELECT, WHERE or FROM clause of their parent SELECT statement.
A subquery in the SELECT clause is useful when we want to add a column to our data that cannot be easily gotten from the ResultSet. These subqueries must return a single value.
For example, consider the following ResultSet for the student with an id of 9.
SELECT course, score FROM results WHERE student_id=9 ORDER BY course;
To get the student's average score per course:
SELECT ROUND(AVG(score),2) AS average FROM results WHERE student_id=9;
We get an average of
Now say we want to answer the question:
How well did the student with ID of 9 do in each course compared to his/her overall average?
The following query gives us an answer:
SELECT course, score, ( SELECT ROUND(AVG(score),2) FROM results WHERE student_id =9 )AS student_avg FROM results WHERE student_id=9 ORDER BY course ;
We could also decide to perform a mathematical operation such that we see the difference between the student's score in each course and the student's overall average as follows:
SELECT course, score, ROUND(score - ( SELECT AVG(score) FROM results WHERE student_id =9 ),2) AS difference_between_score_and_student_average FROM results WHERE student_id=9 ORDER BY difference_between_score_and_student_average DESC;
When writing subqueries in the SELECT clause it is important to note the following:
- The subquery must return a single value. If it returns a table there would be no way of putting in one column thus an error is thrown
- The WHERE clause of the outer query does not affect the subquery in any way. For instance, if we had omitted the student_id filter in our subquery, that is
SELECT course, score, ( SELECT ROUND(AVG(score),2) FROM results )AS student_avg FROM results WHERE student_id=9 ORDER BY course;
We would have gotten totally different results:
So the WHERE clause in the outer query does not affect the subquery.
Adding a subquery in the WHERE clause is useful when we want to filter a ResultSet by some value from another query.
For example, to answer the following question:
In which courses did the student with an ID of 9 scores above his overall average? What was his score in those courses?
If we know that the student average is
82.08 and we can answer that question with the following query:
SELECT course, score FROM results WHERE student_id=9 AND score > 82.08 ORDER BY course;
And get the following:
But if we don't know the average beforehand, we can use a subquery to calculate the average and use the value in the where clause immediately:
SELECT course, score FROM results WHERE student_id=9 AND score > ( SELECT AVG(score) FROM results WHERE student_id=9 ) ORDER BY course;
with the same results but a more reusable query.
When using subqueries in the WHERE clause, we could also return a list of values and filter using the IN operator. For example, we could retrieve courses taken in the first quarter of 2019:
SELECT course, score FROM results WHERE student_id=9 AND course IN ( SELECT course Where course_end_date >= '2019-01-01' AND course_end_date <= '2019-03-31' ) ORDER BY course ;
Note that the above example is used to demonstrate using IN clause with a subquery. A more efficient way would have been to just do
WHERE student_id=9 AND course_end_date >= '2019-01-01' AND course_end_date <= '2019-03-31'
When using subqueries in WHERE clause:
- the subqueries would have to return a single column
- if you want to compare a single value ensure that the subquery returns a single row/value.
- if you want to compare a list of values, ensure that the subquery returns a single column.
Let us say we want to answer the following question:
How did each student perform in Literature compared to the student's average on all the courses
One way we can answer this is by using a subquery to get each student's average in each course, that is:
SELECT student_id, ROUND(AVG(score),2) as student_avg FROM results GROUP BY results.student_id
and then join that to our results table like so:
SELECT main_result.student_id, main_result.course, score, student_av.student_avg FROM results AS main_result JOIN ( SELECT student_id, ROUND(AVG(score),2) as student_avg FROM results GROUP BY results.student_id ) as student_av ON student_av.student_id = main_result.student_id WHERE course = 'Literature' ORDER BY score;
Now we can take it one step further by getting the difference between the student Overall average and his Literature score
SELECT course, main_result.student_id, score, student_avg_sub.student_avg, score - student_avg_sub.student_avg AS score_diff FROM results AS main_result JOIN ( SELECT student_id, ROUND(AVG(score),2) as student_avg FROM results GROUP BY results.student_id ) as student_avg_sub ON student_avg_sub.student_id = main_result.student_id WHERE course = 'Literature' ORDER BY score_diff DESC;
With this, we can see that student with the
id of 21 exceeded his overall average the most. Thus we can conclude that this is one of his best courses.
Subqueries can also be used in more than one part of the main query. That is, they can be used in the WHERE or FROM or SELECT clauses or a combination of 2 or all.
For example, consider the question:
What was the average score of the student with an ID of 9 and in which courses did he score above this average?
We could run the following:
SELECT course, score, ( SELECT ROUND(AVG(score),2) FROM results WHERE student_id=9 ) AS average_score FROM results WHERE student_id = 9 AND score > ( SELECT AVG(score) FROM results WHERE student_id=9 ) ORDER BY score;
Looking at the above ResultSet we can see the student's average score and all the courses where he performed above his average.
Here are some extra features of subqueries:
- you can have a subquery within another subquery. This is known as a nested subquery.
- subqueries can access data from the main query. This is known as a correlated subquery.
All tools/techniques have their downside and subqueries are not excluded from this. Here are some downsides of using subqueries
- subqueries are executed as separate SELECT statements thus it increases the amount of time our query takes to run. For this reason, it is best to use it in a few places as possible
- subqueries are quite unreadable and can be difficult to maintain(well we could add a comment that says "don't touch the query below" 😀).
- if we have two subqueries that do the same thing, they would be executed twice. For example, the query from the previous section actually computes the average of the student twice. This can be very inefficient in large tables. One way around this is to use CTEs
In this blog post, we have seen that:
- subqueries are SELECT statements within another statement.
- subqueries are useful for reshaping and filtering our data
- we can use subqueries in the SELECT, FROM or WHERE clauses of our outer query or a combination of 2 or more.
- subqueries are executed as separate SELECT statements. For this reason, multiple subqueries can re-compute the same value.
- it is best to use subqueries sparingly as they can be quite unreadable and difficult to maintain