One of the good practices for database design is normalization. We decompose data into multiple tables to avoid duplication and make data storage more organized. As an outcome, we need to join tables when querying the data. SQL engine needs to calculate the result of the join operation, and there are multiple join strategies (algorithms) that can be used. In this blogpost we’ll understand typical join algorithms and when they’re used.
Overview
We’re going to use the demodb. We at Metis provide a docker container with this database.
Specifically, we’ll just use tables ticket_flights
with ~8 million rows and flights with ~200 thousand rows. We can take a look at the schema:
Table flights
has a primary key configured on the flight_id
field. This means that the table is stored as a B-tree. Similarly, ticket_flights
has a primary key configured on the tuple (ticket_no, flight_id)
.
Before moving on, let’s also set parallel scans to 0 with the following query:
SET max_parallel_workers_per_gather = 0;
Parallel scans don’t change the algorithms, so we can ignore them in the scope of this article.
Nested Loop Join
First and the simplest join strategy is called Nested Loop Join. It can be depicted with the following pseudocode:
For row1 in table1:
For row2 in table2:
If (row1 == row2):
Add_ to_result(row1, row2)
We iterate over both tables with two loops, and join them naively. This has quadratic time complexity O(size(table1) * size(table2))
. The memory complexity is O(1)
.
Let’s now see that in action. Take this query:
EXPLAIN
SELECT *
FROM ticket_flights AS tf
JOIN flights AS f ON f.flight_id < tf.flight_id
And here is the plan we obtained:
Nested Loop (cost=0.42..16532324955.82 rows=601044021228 width=95)
-> Seq Scan on ticket_flights tf (cost=0.00..153851.52 rows=8391852 width=32)
-> Index Scan using flights_pkey on flights f (cost=0.42..1253.81 rows=71622 width=63)
Index Cond: (flight_id < tf.flight_id)
We can see that the engine decided to use an index to scan the flights
table. No index was used to scan the ticket_flights
table which is bad - scanning the whole table requires reading all of the rows which amounts to plenty of data. We generally always want to avoid scanning the whole table when filtering, but we want to read as little data as possible. Next, both of these scans are joined with the Nested Loop algorithm.
Nested Loop (cost=0.42..16532324955.82 rows=601044021228 width=95)
-> Seq Scan on ticket_flights tf (cost=0.00..153851.52 rows=8391852 width=32)
-> Index Scan using flights_pkey on flights f (cost=0.42..1253.81 rows=71622 width=63)
Index Cond: (flight_id > tf.flight_id)
Let’s see if changing the order of joins matters. Take this query:
EXPLAIN
SELECT *
FROM flights AS f
JOIN ticket_flights AS tf ON f.flight_id < tf.flight_id
And here is the plan we get:
Nested Loop (cost=0.42..16532324955.82 rows=601044021228 width=95)
-> Seq Scan on ticket_flights tf (cost=0.00..153851.52 rows=8391852 width=32)
-> Index Scan using flights_pkey on flights f (cost=0.42..1253.81 rows=71622 width=63)
Index Cond: (flight_id < tf.flight_id)
The result is the same.
However, we can also see that changing the output aggregation can change the way we scan tables, but doesn’t change the algorithm. Let’s take this query
EXPLAIN
SELECT COUNT(*)
FROM ticket_flights AS tf
JOIN flights AS f ON f.flight_id < tf.flight_id
We get the following plan:
Aggregate (cost=18034924512.89..18034924512.90 rows=1 width=8)
-> Nested Loop (cost=0.42..16532314459.82 rows=601044021228 width=0)
-> Seq Scan on ticket_flights tf (cost=0.00..153851.52 rows=8391852 width=4)
-> Index Only Scan using flights_pkey on flights f (cost=0.42..1253.81 rows=71622 width=4)
Index Cond: (flight_id > tf.flight_id)
We can see that now we scan the flights
table with Index Only Scan
. This operation doesn’t even need to read the rows, it can get everything from the index which makes this operation even faster than the Index Scan. Next, scans are once again joined with the Nested Loop operation, and finally the Aggregate operation is executed to select the count of the rows.
Hash Join
The next strategy is called Hash Join. The Hash Join algorithm consists of two phases. In the first phase we build a hashtable from one of the tables that we want to join. In the second phase we iterate over the rows of the latter table, and then find the match in the hashtable. The algorithm looks like this:
For row1 in table1:
hashtable.add(row1.id, row1)
For row2 in table2:
Row1 = hashtable.get(row2.id)
If (row1 == row2):
Add_to_result(row1, row2)
The complexity is O(size(table1) + size(table2))
if we assume that the hashing algorithm is good and we have O(1)
lookup time. The memory complexity is O(size(table1))
so the order matters. The engine generally prefers to hash the smaller table.
Let’s see that in action:
EXPLAIN
SELECT COUNT(*)
FROM ticket_flights AS tf
JOIN flights AS f ON f.flight_id = tf.flight_id
This is the plan:
Hash Join (cost=9767.51..302691.07 rows=8391852 width=95)
Hash Cond: (tf.flight_id = f.flight_id)
-> Seq Scan on ticket_flights tf (cost=0.00..153851.52 rows=8391852 width=32)
-> Hash (cost=4772.67..4772.67 rows=214867 width=63)
-> Seq Scan on flights f (cost=0.00..4772.67 rows=214867 width=63)
We can see that the engine decided to scan the flights
table and then build a hash table out of it. Next, it iterates over the ticket_flights
table and matches the rows based on the condition.
If we swap the join order in the SQL query like this:
EXPLAIN
SELECT *
FROM flights AS f
JOIN ticket_flights AS tf ON f.flight_id = tf.flight_id
then we get exactly the same plan:
Hash Join (cost=9767.51..302691.07 rows=8391852 width=95)
Hash Cond: (tf.flight_id = f.flight_id)
-> Seq Scan on ticket_flights tf (cost=0.00..153851.52 rows=8391852 width=32)
-> Hash (cost=4772.67..4772.67 rows=214867 width=63)
-> Seq Scan on flights f (cost=0.00..4772.67 rows=214867 width=63)
The engine is allowed to do so. SQL queries are declarative, so they define what the result is, but they don’t dictate how the result is calculated.
However, if we add the aggregation:
EXPLAIN
SELECT COUNT(*)
FROM ticket_flights AS tf
JOIN flights AS f ON f.flight_id = tf.flight_id
Then we get this plan:
Aggregate (cost=271560.70..271560.71 rows=1 width=8)
-> Hash Join (cost=8298.51..250581.07 rows=8391852 width=0)
Hash Cond: (tf.flight_id = f.flight_id)
-> Seq Scan on ticket_flights tf (cost=0.00..153851.52 rows=8391852 width=4)
-> Hash (cost=4772.67..4772.67 rows=214867 width=4)
-> Seq Scan on flights f (cost=0.00..4772.67 rows=214867 width=4)
You can see that the flights
table is still scanned as before. There is no index scan this time.
Merge Join
Merge join algorithm is used when we can iterate the rows in order. It works like this:
Table1_sorted = table1.sort();
Table2_sorted = table2.sort();
Row1 = table1_sorted.first()
Row2 = table2_sorted.first()
while row1 is not Null and row2 is not Null:
while row1 >= row2:
if row1 == row2:
Add_to_result(row1, row2)
Row2++
Row1++
The time complexity is O(size(table1)*log(size(table1)) + size(table2)*log(size(table2)) + size(table1) + size(table2))
. The memory complexity is O(size(table1) + size(table2))
.
However, if the data is already ordered, then we get O(size(table1) + size(table2))
for time complexity and O(1)
for memory complexity.
Let’s see that in action. First, disable the hash join strategy:
SET enable_hashjoin = off;
And then run this query:
EXPLAIN
SELECT *
FROM ticket_flights AS tf
JOIN flights AS f ON f.flight_id = tf.flight_id
We get the following plan:
Merge Join (cost=1520511.52..1676140.76 rows=8391852 width=95)
Merge Cond: (f.flight_id = tf.flight_id)
-> Index Scan using flights_pkey on flights f (cost=0.42..8245.57 rows=214867 width=63)
-> Materialize (cost=1520506.91..1562466.17 rows=8391852 width=32)
-> Sort (cost=1520506.91..1541486.54 rows=8391852 width=32)
Sort Key: tf.flight_id
-> Seq Scan on ticket_flights tf (cost=0.00..153851.52 rows=8391852 width=32)
We can see that the engine had to sort the ticket_flights
table after scanning it. However, the flights
table was already sorted because it has the B-tree already built for the primary key.
The reason ticket_flights
table needs to be sorted is because the primary key consists of the ticket number and the flight id. However, the order of fields matters, so the flight id may not be stored in order.
Summary
The engine can choose how to calculate the join of two tables. Various algorithms have different time and memory complexities, so it’s useful to understand how we can speed things up. We can do that by adding indexes or making sure that we use conditions that allow us to use the more efficient join operations.
Top comments (0)