We often need to deal with long and unreadable SQL statements. There are multiple ways to increase their readability and maintainability in a longer run. Let’s start with some examples:
- UPPERCASE for keywords - we should write all keywords (like
SELECT
,FROM
,JOIN
) all uppercase to distinguish them from other parts of the query - snake_case for identifiers and names - name your table
customer_data
instead ofCustomerData
(so use lowercase and underscore) - Aliases - just rename your columns and tables in the query to improve the readability, especially if you join the same table many times
- Indentation - indent column names in SELECT part, indent subqueries, indent
WHERE
filters - Use
JOIN ON
instead ofJOIN
+WHERE
(even these two are equivalent)
There are probably other best practices that we can follow, especially depending on where we work and what software we use. However, all these tricks do not let us avoid repetition and make our queries significantly shorter.
In order to fix that, we can use two different approaches. They will make our code more readable and easier to maintain in the long run.
Common Table Expressions
Common Table Expression (CTE), also known as WITH
expression, allows us to write an auxiliary statement to later refer to it in a larger query. It’s like a temporary table that exists just for one query. CTE can contain SELECT
, INSERT
, UPDATE
, or DELETE
, however, most often we use it with SELECT
statements.
Let’s see an example. Imagine that we store data about orders in a given region.
CREATE TABLE orders(region VARCHAR(100), amount INT, product VARCHAR(100));
INSERT INTO orders(region, amount, product)
VALUES ('EU', 10, 'Product1'), ('EU', 20, 'Product2'), ('US', 1, 'XYZ'), ('JP', 10, 'ABC')
We would like to show orders from regions that constituted at least 10% of the total sales. Let’s take this query:
SELECT region, product, (SELECT SUM(amount) AS total_sales FROM orders AS o2 WHERE o2.region = o.region GROUP BY region)
FROM orders AS o
WHERE region IN (
SELECT region
FROM orders
GROUP BY region
HAVING SUM(amount) > (
SELECT SUM(amount)
FROM orders
) / 10
)
+---------+------------+-------------+
| region | product | total_sales |
+---------+------------+-------------+
| EU | Product 1 | 30 |
| EU | Product 2 | 30 |
| JP | ABC | 10 |
+---------+------------+-------------+
We show region, product, and the total number of products sold in that region. We filter regions based on the condition that the sum of orders in a given region must be at least 10% of the sum of all orders.
However, we can see that this query isn’t very straightforward. It mixes where things are calculated and filtered, calculates sum many times, and doesn’t provide a clear understanding of how we calculate the threshold. Let’s now rewrite it with the WITH statement:
WITH best_sales_threshold AS (
SELECT SUM(amount) / 10 AS threshold
FROM orders
), sales_by_region AS (
SELECT region, SUM(amount) AS total_sales
FROM orders
GROUP BY region
), top_regions AS (
SELECT DISTINCT(region)
FROM sales_by_region
WHERE total_sales > (SELECT threshold FROM best_sales_threshold)
)
SELECT o.region, o.product, o2.total_sales
FROM orders AS o
JOIN sales_by_region AS o2 ON o2.region = o.region
WHERE o.region in (SELECT region FROM top_regions)
We first calculate the threshold as a temporary table with one column. Next, we calculate another temporary table with the sum of sales per region. We then find regions that performed the best. Finally, we just present the data.
We can see the query is much more readable and easier to understand. We can see that it’s not nested multiple times, things are "linear" and we can understand what’s going on just by following the query and meaningful names.
However, CTEs can allow us things that are not otherwise possible in standard SQL. CTEs can be recursive, so the temporary table can refer to itself to fill the content. Let’s see that in action:
WITH RECURSIVE numbers(n) AS (
VALUES (1)
UNION ALL
SELECT n+1 FROM numbers WHERE n < 100
)
SELECT n FROM numbers;
We define a recursive CTE that has one column. We set the initial value to 1
. Next, we use the UNION ALL
to introduce the recursive term that can refer to the query’s own output. So we take all the rows that are below one hundred, and we create new rows with values increased by one. This generates numbers from 1 to 100.
Window Functions
A window function calculates values across a set of table rows related to the current row. This is very similar to aggregation, but window functions do not group rows into a single row. The rows are still independent, however, they can contain values grouped together.
Let’s take the example from the previous section but ignore the top region part:
SELECT region, product, (SELECT SUM(amount) AS total_sales FROM orders AS o2 WHERE o2.region = o.region GROUP BY region)
FROM orders AS o
+---------+------------+-------------+
| region | product | total_sales |
+---------+------------+-------------+
| EU | Product 1 | 30 |
| EU | Product 2 | 30 |
| US | XYZ | 1 |
| JP | ABC | 10 |
+---------+------------+-------------+
So we take the region, product, and the total sales for a given region. With window function we can do the following:
SELECT region, product, SUM(amount) OVER (PARTITION BY region) AS total_sales
FROM orders AS o
We simply specify that the value for total_sales
column should take all the rows for which the region
column has the same value as the region
value of the current row, and then sum all of them.
We can also include the ordering. Let’s say that we wanted to rank the products within a region. We can use the following:
SELECT region, product, SUM(amount) OVER (PARTITION BY region) AS total_sales, rank() OVER (PARTITION BY region ORDER BY amount DESC)
FROM orders AS o
We once again include rows from the same region, but this time we also order them by the decreasing amount, and we calculate the value with rank()
function
+---------+------------+--------------+------------+
| region | product | total_sales | sales_rank |
+---------+------------+--------------+------------+
| EU | Product 1 | 30 | 1 |
| EU | Product 2 | 30 | 2 |
| US | XYZ | 1 | 1 |
| JP | ABC | 10 | 1 |
+---------+------------+--------------+------------+
There are many more functions that we can use:
-
FIRST_VALUE
- returns value of the first row in the partition -
LAST_VALUE
- returns value of the last row in the partition -
LAG
- returns value of the row before the row in the partition -
LEAD
- returns value of the row after the row in the partition -
NTH_VALUE
- returns value of the n-th row in the partition
We can also use the regular aggregate functions like SUM
, MIN
, MAX
, etc.
Practical examples
Let us now see some practical examples.
Choosing n-th element from a group
Let’s say that we would like to group rows and select top three rows in each group:
WITH numbered_orders AS (
SELECT region, product, rank() OVER (PARTITION BY region ORDER BY amount DESC) AS rank
FROM orders AS o
)
SELECT *
FROM numbered_orders
WHERE rank < 4
We use CTE to number orders in a given region based on their rank. Then, we select rows for which rank is less than 4.
Running sums
Let’s say that we would like to calculate the running sum, so sum the value of a given column for all the rows before the current row:
SELECT region, product, SUM(amount) OVER (PARTITION BY region ORDER BY amount) AS running_sum
FROM orders AS o
Difference between neighboring rows
Let’s say that we would like to show by how much the product is performing better than the previous one. We can do that like this:
SELECT region, product, amount, amount - LAG(amount, 1) OVER (PARTITION BY region ORDER BY amount)
FROM orders AS o
So we take the current amount and subtract the amount of the product that was one row earlier using the LAG function.
Performance
It may seem that CTE and window functions bring benefits only. That’s not necessarily the case. Let’s take these two queries:
WITH cte_performance AS (
SELECT *, MD5(MD5(ticket_no)) AS double_hash
FROM boarding_passes
)
SELECT COUNT(*)
FROM cte_performance AS C1
JOIN cte_performance AS C2 ON C2.ticket_no = C1.ticket_no
JOIN cte_performance AS C# ON C3.ticket_no = C1.ticket_no
WHERE
C1.double_hash = 'HASH'
AND C2.double_hash = 'HASH'
AND C3.double_hash = 'HASH'
and
SELECT COUNT(*)
FROM boarding_passes AS C1
JOIN boarding_passes AS C2 ON C2.ticket_no = C1.ticket_no
JOIN boarding_passes AS C# ON C3.ticket_no = C1.ticket_no
WHERE
MD5(MD5(C1.ticket_no)) = 'HASH'
AND MD5(MD5(C2.ticket_no)) = 'HASH'
AND MD5(MD5(C3.ticket_no)) = 'HASH'
They are equivalent. The boarding_passes table has 8 million rows in total. The former query takes 13 seconds to complete, while the latter takes 8. They return the same result, but their performance differs greatly.
Summary
Common Table Expressions and window functions let you make your queries much more readable and maintainable. They also make some queries doable directly in the SQL. It’s always a good idea to explore whether they improve the readability in the specific case.
Top comments (0)