Chun Fei Lung

Posted on Aug 18, 2021 • Edited on Dec 4, 2021 • Originally published at chuniversiteit.nl

When and how to JOIN a table with itself

#sql

“They’re the same table.”

Relational databases make it very easy to join data from different tables, but did you know you can also JOIN data within the same table?

Relational databases make it easy to JOIN (combine) data from different tables. However, you can also JOIN a table with itself. This is especially handy when a table already contains all the data you need, but not in the right format.

References to the same table

One of the most common reasons to JOIN a table with itself, is because you have a table that contains a foreign key reference to itself.

In the example below we have an employees table with employees. The table includes a manager_id column that refers to an employee in that same employees table.

employees
id	name	manager_id
1	Denholm
2	Richmond
3	Roy	5
4	Maurice	5
5	Jen	1

If we want to create an overview that lists the name of each employee along with the name of their manager, we need to join the employees table with itself.

When we do this, we need to assign an alias | Think of these as variable names for your tables! to one (or both) of them so that the database can distinguish between the two instances. It doesn’t really matter what names you use, as long as they’re unique and make sense to you.

Since we have two tables with identical column names, the database no longer understands what you mean when you say you want something like a name. You therefore have to explicitly tell it that you want the name of some row in the employees or the managers version of the table. This can be done by prepending column names with a table alias, followed by a .:

SELECT 
  employees.id,
  employees.name,
  managers.name AS manager -- Note: the “AS” here is optional
FROM
  employees
LEFT JOIN -- Use an INNER JOIN to exclude employees without a manager
  employees AS managers    -- Note: this “AS” is optional as well!
    ON employees.manager_id = managers.id;

This query gives us the following result:

+----+----------+---------+
| id | name     | manager |
+----+----------+---------+
|  1 | Denholm  | NULL    |
|  2 | Richmond | NULL    |
|  3 | Roy      | Jen     |
|  4 | Maurice  | Jen     |
|  5 | Jen      | Denholm |
+----+----------+---------+

Computing durations between changes

Another common reason to use self joins is to calculate some duration or distance between pairs of rows within a table.

For example, the order_state table below keeps track of state changes that happen to orders at a webshop. Each record in this table includes a reference to the order, the name of the new state, and shows when the order changed to its new state. Can we calculate how long each of these state transitions took?

order_state
id	order_id	state	created_at
1	1	placed	2021-06-01 00:01:00
2	1	packaged	2021-06-01 00:58:00
3	1	despatched	2021-06-01 02:20:00
4	1	delivered	2021-06-01 15:20:00
5	2	placed	2021-06-02 14:10:00
6	2	packaged	2021-06-02 18:55:00
7	3	placed	2021-06-02 19:00:00
8	3	packaged	2021-06-02 20:30:00
9	3	despatched	2021-06-03 01:40:00
10	2	despatched	2021-06-03 01:40:00
11	2	delivered	2021-06-03 08:00:00
12	3	delivered	2021-06-03 16:10:00

Of course we can! The query below shows how.

Once again, we have one table, order_state, that we want to JOIN with itself. We name one of these from_state and the other to_state, to represent the collection of states before and after the transition.

Obviously, the two states involved in the transition should belong to the same order, so we add the condition that from_state.order_id must be equal to to_state.order_id. Moreover, the to_state should occur later than the from_state, so we’ll also add a condition that the to_state.id must be
higher than from_state (*).

(*) In real applications might also or only want to look at the timestamp. Be careful if multiple transitions can happen at the same time, e.g. within the same second!

Finally, we want to make sure that we only see direct state transitions, like those from placed to packages, and not indirect ones, like those from placed to delivered. This is done by first grouping by from_state.id, which ensures that the “from” state appears only once in the result, and then JOINING each from_state with the closest to_state record that exists, i.e. the record with the lowest to_state.id.

SELECT 
  from_state.order_id,
  from_state.state,
  to_state.state,
  to_state.created_at AS changed_at,
  TIMEDIFF(to_state.created_at, from_state.created_at) AS duration
FROM
  order_state from_state
    INNER JOIN
  order_state to_state
    ON 1 -- This “1” here makes it possible to nicely align our JOIN conditions
      AND from_state.order_id = to_state.order_id
      AND from_state.id < to_state.id
GROUP BY
  from_state.id
HAVING
  MIN(to_state.id)
ORDER BY
  order_id,
  from_state.id;

+----------+------------+------------+---------------------+----------+
| order_id | state      | state      | changed_at          | duration |
+----------+------------+------------+---------------------+----------+
|        1 | placed     | packaged   | 2021-06-01 00:58:00 | 00:57:00 |
|        1 | packaged   | despatched | 2021-06-01 02:20:00 | 01:22:00 |
|        1 | despatched | delivered  | 2021-06-01 15:20:00 | 13:00:00 |
|        2 | placed     | packaged   | 2021-06-02 18:55:00 | 04:45:00 |
|        2 | packaged   | despatched | 2021-06-03 01:40:00 | 06:45:00 |
|        2 | despatched | delivered  | 2021-06-03 08:00:00 | 06:20:00 |
|        3 | placed     | packaged   | 2021-06-02 20:30:00 | 01:30:00 |
|        3 | packaged   | despatched | 2021-06-03 01:40:00 | 05:10:00 |
|        3 | despatched | delivered  | 2021-06-03 16:10:00 | 14:30:00 |
+----------+------------+------------+---------------------+----------+

Creating pairs of data

Self joins can also be used to easily create large or unique pairs of data from a relatively small set of data.

Cartesian product

You can join a table with itself without any JOIN conditions to create a result set that includes every possible combination of values within a single table, for instance when you want to know how many possible configurations exist for a product or when you wish to discover which combinations are common or rare.

In the example below, we combine multiple instances of a digit table to form larger numbers.

digit
id
0
1
2
3
4
5
6
7
8
9

SELECT 
  CONCAT(B.id, A.id) AS number
FROM
  digit A
    JOIN
  digit B;

+--------+
| number |
+--------+
| 00     |
| 01     |
| 02     |
| 03     |
| 04     |
| 05     |
| 06     |
| 07     |
| 08     |
| 09     |
| 10     |
| 11     |
| 12     |
| 13     |
| 14     |
| 15     |
| 16     |
| 17     |
| 18     |
| 19     |
| 20     |
| 21     |
| 22     |
| 23     |
| 24     |
| 25     |
| 26     |
| 27     |
| 28     |
| 29     |
| 30     |
| 31     |
| 32     |
| 33     |
| 34     |
| 35     |
| 36     |
| 37     |
| 38     |
| 39     |
| 40     |
| 41     |
| 42     |
| 43     |
| 44     |
| 45     |
| 46     |
| 47     |
| 48     |
| 49     |
| 50     |
| 51     |
| 52     |
| 53     |
| 54     |
| 55     |
| 56     |
| 57     |
| 58     |
| 59     |
| 60     |
| 61     |
| 62     |
| 63     |
| 64     |
| 65     |
| 66     |
| 67     |
| 68     |
| 69     |
| 70     |
| 71     |
| 72     |
| 73     |
| 74     |
| 75     |
| 76     |
| 77     |
| 78     |
| 79     |
| 80     |
| 81     |
| 82     |
| 83     |
| 84     |
| 85     |
| 86     |
| 87     |
| 88     |
| 89     |
| 90     |
| 91     |
| 92     |
| 93     |
| 94     |
| 95     |
| 96     |
| 97     |
| 98     |
| 99     |
+--------+

I once used a very similar method to generate lists of dates in a legacy business intelligence application that was built on top of a MySQL database:

SELECT 
  DATE_ADD(
    CONCAT(
      YEAR(CURDATE()), -- Get the current year
      '-01-01'         -- Append “-01-01” to it so that we get “2021-01-01”
    ),
    INTERVAL CONCAT(B.id, A.id) DAY -- Show 2021-01-01 + x days
  ) AS date
FROM
  digit A
    JOIN
  digit B;

+------------+
| date       |
+------------+
| 2021-01-01 |
| 2021-01-02 |
| 2021-01-03 |
| 2021-01-04 |
| 2021-01-05 |
| 2021-01-06 |
| 2021-01-07 |
| 2021-01-08 |
| 2021-01-09 |
| 2021-01-10 |
| 2021-01-11 |
| 2021-01-12 |
| 2021-01-13 |
| 2021-01-14 |
| 2021-01-15 |
| 2021-01-16 |
| 2021-01-17 |
| 2021-01-18 |
| 2021-01-19 |
| 2021-01-20 |
| 2021-01-21 |
| 2021-01-22 |
| 2021-01-23 |
| 2021-01-24 |
| 2021-01-25 |
| 2021-01-26 |
| 2021-01-27 |
| 2021-01-28 |
| 2021-01-29 |
| 2021-01-30 |
| 2021-01-31 |
| 2021-02-01 |
| 2021-02-02 |
| 2021-02-03 |
| 2021-02-04 |
| 2021-02-05 |
| 2021-02-06 |
| 2021-02-07 |
| 2021-02-08 |
| 2021-02-09 |
| 2021-02-10 |
| 2021-02-11 |
| 2021-02-12 |
| 2021-02-13 |
| 2021-02-14 |
| 2021-02-15 |
| 2021-02-16 |
| 2021-02-17 |
| 2021-02-18 |
| 2021-02-19 |
| 2021-02-20 |
| 2021-02-21 |
| 2021-02-22 |
| 2021-02-23 |
| 2021-02-24 |
| 2021-02-25 |
| 2021-02-26 |
| 2021-02-27 |
| 2021-02-28 |
| 2021-03-01 |
| 2021-03-02 |
| 2021-03-03 |
| 2021-03-04 |
| 2021-03-05 |
| 2021-03-06 |
| 2021-03-07 |
| 2021-03-08 |
| 2021-03-09 |
| 2021-03-10 |
| 2021-03-11 |
| 2021-03-12 |
| 2021-03-13 |
| 2021-03-14 |
| 2021-03-15 |
| 2021-03-16 |
| 2021-03-17 |
| 2021-03-18 |
| 2021-03-19 |
| 2021-03-20 |
| 2021-03-21 |
| 2021-03-22 |
| 2021-03-23 |
| 2021-03-24 |
| 2021-03-25 |
| 2021-03-26 |
| 2021-03-27 |
| 2021-03-28 |
| 2021-03-29 |
| 2021-03-30 |
| 2021-03-31 |
| 2021-04-01 |
| 2021-04-02 |
| 2021-04-03 |
| 2021-04-04 |
| 2021-04-05 |
| 2021-04-06 |
| 2021-04-07 |
| 2021-04-08 |
| 2021-04-09 |
| 2021-04-10 |
+------------+

Unique pairs

Sometimes you’re only interested in unique combinations. For example when you need to create pairs of people or want to list edges in an undirected graph (**): it doesn’t matter which value is listed first or second – what matters is that the first and second values are listed together exactly once.

(**) Yes, you can store graphs in a relational database!

The example below shows a simple person table that contains the names of six people.

person
id	name
1	Rachel
2	Monica
3	Phoebe
4	Joey
5	Chandler
6	Ross

We can create a list of possible pairings by adding a simple condition to the JOIN clause:

SELECT 
  A.name, B.name
FROM
  person A
    JOIN
  person B ON A.id < B.id;

+----------+----------+
| name     | name     |
+----------+----------+
| Rachel   | Monica   |
| Rachel   | Phoebe   |
| Rachel   | Joey     |
| Rachel   | Chandler |
| Rachel   | Ross     |
| Monica   | Phoebe   |
| Monica   | Joey     |
| Monica   | Chandler |
| Monica   | Ross     |
| Phoebe   | Joey     |
| Phoebe   | Chandler |
| Phoebe   | Ross     |
| Joey     | Chandler |
| Joey     | Ross     |
| Chandler | Ross     |
+----------+----------+

Ranking results

Self-JOINs can also be used to compute ranks of query results, although nowadays there are much better ways to do this.

DEV Community

When and how to JOIN a table with itself

References to the same table

Computing durations between changes

Creating pairs of data

Cartesian product

Unique pairs

Ranking results

Top comments (0)

Read next

Video: How to show title & tooltip when text overflows

Running a JAX Program from Dart Using C++ FFI

Building Data pipelines

Válendy: Pohodlí a Styl pro Každý Domov