DEV Community

Cover image for What Are Window Functions in SQL & How To Use Them
Adam Furmanek for Metis

Posted on • Originally published at metisdata.io

What Are Window Functions in SQL & How To Use Them

We often need to deal with long and unreadable SQL statements. There are multiple ways to increase their readability and maintainability in a longer run. Let’s start with some examples:

  • UPPERCASE for keywords - we should write all keywords (like SELECT, FROM, JOIN) all uppercase to distinguish them from other parts of the query
  • snake_case for identifiers and names - name your table customer_data instead of CustomerData (so use lowercase and underscore)
  • Aliases - just rename your columns and tables in the query to improve the readability, especially if you join the same table many times
  • Indentation - indent column names in SELECT part, indent subqueries, indent WHERE filters
  • Use JOIN ON instead of JOIN + WHERE (even these two are equivalent)

There are probably other best practices that we can follow, especially depending on where we work and what software we use. However, all these tricks do not let us avoid repetition and make our queries significantly shorter.

In order to fix that, we can use two different approaches. They will make our code more readable and easier to maintain in the long run.

Common Table Expressions

Common Table Expression (CTE), also known as WITH expression, allows us to write an auxiliary statement to later refer to it in a larger query. It’s like a temporary table that exists just for one query. CTE can contain SELECT, INSERT, UPDATE, or DELETE, however, most often we use it with SELECT statements.

Let’s see an example. Imagine that we store data about orders in a given region.

CREATE TABLE orders(region VARCHAR(100), amount INT, product VARCHAR(100));
Enter fullscreen mode Exit fullscreen mode
INSERT INTO orders(region, amount, product)
VALUES ('EU', 10, 'Product1'), ('EU', 20, 'Product2'), ('US', 1, 'XYZ'), ('JP', 10, 'ABC')
Enter fullscreen mode Exit fullscreen mode

We would like to show orders from regions that constituted at least 10% of the total sales. Let’s take this query:

SELECT region, product, (SELECT SUM(amount) AS total_sales FROM orders AS o2 WHERE o2.region = o.region GROUP BY region)
FROM orders AS o
WHERE region IN (
    SELECT region
    FROM orders
    GROUP BY region
    HAVING SUM(amount) > (
     SELECT SUM(amount)
     FROM orders
    ) / 10
)
Enter fullscreen mode Exit fullscreen mode

+---------+------------+-------------+
| region  |  product   | total_sales |
+---------+------------+-------------+
| EU      | Product 1  |          30 |
| EU      | Product 2  |          30 |
| JP      | ABC        |          10 |
+---------+------------+-------------+
Enter fullscreen mode Exit fullscreen mode

We show region, product, and the total number of products sold in that region. We filter regions based on the condition that the sum of orders in a given region must be at least 10% of the sum of all orders.

However, we can see that this query isn’t very straightforward. It mixes where things are calculated and filtered, calculates sum many times, and doesn’t provide a clear understanding of how we calculate the threshold. Let’s now rewrite it with the WITH statement:

WITH best_sales_threshold AS (
    SELECT SUM(amount) / 10 AS threshold
    FROM orders
), sales_by_region AS (
    SELECT region, SUM(amount) AS total_sales
    FROM orders
    GROUP BY region
), top_regions AS (
    SELECT DISTINCT(region)
    FROM sales_by_region
    WHERE total_sales > (SELECT threshold FROM best_sales_threshold)
)
SELECT o.region, o.product, o2.total_sales
FROM orders AS o
JOIN sales_by_region AS o2 ON o2.region = o.region
WHERE o.region in (SELECT region FROM top_regions)
Enter fullscreen mode Exit fullscreen mode

We first calculate the threshold as a temporary table with one column. Next, we calculate another temporary table with the sum of sales per region. We then find regions that performed the best. Finally, we just present the data.

We can see the query is much more readable and easier to understand. We can see that it’s not nested multiple times, things are "linear" and we can understand what’s going on just by following the query and meaningful names.

However, CTEs can allow us things that are not otherwise possible in standard SQL. CTEs can be recursive, so the temporary table can refer to itself to fill the content. Let’s see that in action:

WITH RECURSIVE numbers(n) AS (
        VALUES (1)
        UNION ALL
        SELECT n+1 FROM numbers WHERE n < 100
)
SELECT n FROM numbers;
Enter fullscreen mode Exit fullscreen mode

We define a recursive CTE that has one column. We set the initial value to 1. Next, we use the UNION ALL to introduce the recursive term that can refer to the query’s own output. So we take all the rows that are below one hundred, and we create new rows with values increased by one. This generates numbers from 1 to 100.

Window Functions

A window function calculates values across a set of table rows related to the current row. This is very similar to aggregation, but window functions do not group rows into a single row. The rows are still independent, however, they can contain values grouped together.

Let’s take the example from the previous section but ignore the top region part:

SELECT region, product, (SELECT SUM(amount) AS total_sales FROM orders AS o2 WHERE o2.region = o.region GROUP BY region)
FROM orders AS o
Enter fullscreen mode Exit fullscreen mode
+---------+------------+-------------+
| region  |  product   | total_sales |
+---------+------------+-------------+
| EU      | Product 1  |          30 |
| EU      | Product 2  |          30 |
| US      | XYZ        |           1 |
| JP      | ABC        |          10 |
+---------+------------+-------------+
Enter fullscreen mode Exit fullscreen mode

So we take the region, product, and the total sales for a given region. With window function we can do the following:

SELECT region, product, SUM(amount) OVER (PARTITION BY region) AS total_sales
FROM orders AS o
Enter fullscreen mode Exit fullscreen mode

We simply specify that the value for total_sales column should take all the rows for which the region column has the same value as the region value of the current row, and then sum all of them.

We can also include the ordering. Let’s say that we wanted to rank the products within a region. We can use the following:

SELECT region, product, SUM(amount) OVER (PARTITION BY region) AS total_sales, rank() OVER (PARTITION BY region ORDER BY amount DESC)
FROM orders AS o
Enter fullscreen mode Exit fullscreen mode

We once again include rows from the same region, but this time we also order them by the decreasing amount, and we calculate the value with rank() function

+---------+------------+--------------+------------+
| region  |  product   | total_sales  | sales_rank |
+---------+------------+--------------+------------+
| EU      | Product 1  |          30  |          1 |
| EU      | Product 2  |          30  |          2 |
| US      | XYZ        |           1  |          1 |
| JP      | ABC        |          10  |          1 |
+---------+------------+--------------+------------+
Enter fullscreen mode Exit fullscreen mode

There are many more functions that we can use:

  • FIRST_VALUE - returns value of the first row in the partition
  • LAST_VALUE - returns value of the last row in the partition
  • LAG - returns value of the row before the row in the partition
  • LEAD - returns value of the row after the row in the partition
  • NTH_VALUE - returns value of the n-th row in the partition

We can also use the regular aggregate functions like SUM, MIN, MAX, etc.

Practical examples

Let us now see some practical examples.

Choosing n-th element from a group

Let’s say that we would like to group rows and select top three rows in each group:

WITH numbered_orders AS (
        SELECT region, product, rank() OVER (PARTITION BY region ORDER BY amount DESC) AS rank
        FROM orders AS o
)
SELECT *
FROM numbered_orders
WHERE rank < 4
Enter fullscreen mode Exit fullscreen mode

We use CTE to number orders in a given region based on their rank. Then, we select rows for which rank is less than 4.

Running sums

Let’s say that we would like to calculate the running sum, so sum the value of a given column for all the rows before the current row:

SELECT region, product, SUM(amount) OVER (PARTITION BY region ORDER BY amount) AS running_sum
FROM orders AS o
Enter fullscreen mode Exit fullscreen mode

Difference between neighboring rows

Let’s say that we would like to show by how much the product is performing better than the previous one. We can do that like this:

SELECT region, product, amount, amount - LAG(amount, 1) OVER (PARTITION BY region ORDER BY amount)
FROM orders AS o
Enter fullscreen mode Exit fullscreen mode

So we take the current amount and subtract the amount of the product that was one row earlier using the LAG function.

Performance

It may seem that CTE and window functions bring benefits only. That’s not necessarily the case. Let’s take these two queries:

WITH cte_performance AS (
    SELECT *, MD5(MD5(ticket_no)) AS double_hash
    FROM boarding_passes
)
SELECT COUNT(*)
FROM cte_performance AS C1
JOIN cte_performance AS C2 ON C2.ticket_no = C1.ticket_no
JOIN cte_performance AS C# ON C3.ticket_no = C1.ticket_no
WHERE
    C1.double_hash = 'HASH'
    AND C2.double_hash = 'HASH'
    AND C3.double_hash = 'HASH'
Enter fullscreen mode Exit fullscreen mode

and

SELECT COUNT(*)
FROM boarding_passes AS C1
JOIN boarding_passes AS C2 ON C2.ticket_no = C1.ticket_no
JOIN boarding_passes AS C# ON C3.ticket_no = C1.ticket_no
WHERE
    MD5(MD5(C1.ticket_no)) = 'HASH'
    AND MD5(MD5(C2.ticket_no)) = 'HASH'
    AND MD5(MD5(C3.ticket_no)) = 'HASH'
Enter fullscreen mode Exit fullscreen mode

They are equivalent. The boarding_passes table has 8 million rows in total. The former query takes 13 seconds to complete, while the latter takes 8. They return the same result, but their performance differs greatly.

Summary

Common Table Expressions and window functions let you make your queries much more readable and maintainable. They also make some queries doable directly in the SQL. It’s always a good idea to explore whether they improve the readability in the specific case.

Top comments (0)