Andrej Kirejeŭ

Posted on Dec 25, 2024

Trees in SQL

#sql #database #tutorial #computerscience

Business applications often need to represent and store information such as a company's organizational hierarchy, a country's administrative divisions, an inventory structure, or simply a set of folders and files. Despite their apparent differences, these examples share a common feature— they are all trees. Not the green, wood-made kind that provide pleasant shade on a sunny day, but, mathematically speaking, connected acyclic graphs. In computer science, the links between nodes are described in terms of parent-child relationships. Each node in the tree, except the root, has exactly one parent and zero, one, or many children.

The figure above shows an example of a tree. Vertex A is called the root of the tree. Nodes without subtrees — C, D, F, G, and H — are called leaves. Edges of the tree show parent-child relationships between nodes. A rooted tree is k-ary if each node has no more than k children. A binary tree is a well-known particular case of a k-ary tree where k = 2.

One of the common tasks performed on tree data is the retrieval of all child nodes at any nesting level, given a parent node. For example, calculating the total turnover of an account involves determining the turnovers of all its subaccounts, or compiling a list of goods belonging to a given goods group includes lists from all the nested subgroups.

In imperative programming languages, we usually use a graph traversal algorithm to find all subnodes. Although most modern database servers support SQL queries with recursive common table expressions, using the recursive part inside complex queries on large amounts of data can significantly degrade performance.

The reason is that SQL deals primarily with sets, not graphs, so one should implement dedicated data structures to represent nodes and edges as nested sets for efficient data selection.

In this article, we will consider four ways of storing tree-like structures inside a relational database and evaluate them in terms of query performance and the space needed for storing the underlying data.

Link to the parent

Perhaps the most intuitive and easiest way to represent tree-like data is to use the adjacency matrix in a relational table. Every record in the table corresponds to a node in the tree and holds the unique ID of the node (primary key) and the ID of its parent (for root nodes, the parent is null).

Below is a DDL query that creates the test1 table:

CREATE TABLE test1 (
  id INTEGER NOT NULL,
  parent INTEGER,
  name VARCHAR(60) NOT NULL,
  PRIMARY KEY (id),
  FOREIGN KEY (parent) REFERENCES test1 (id)
    ON UPDATE CASCADE
    ON DELETE CASCADE
)

Pay attention to the ON DELETE CASCADE clause of the FOREIGN KEY constraint declared on the PARENT field. It guarantees that the entire subtree will be deleted at once if the parent node is deleted. To protect the subtree from accidental deletion, the rule should be changed from CASCADE to RESTRICT.

Corresponding data for the tree shown at the beginning of the article looks like:

id	parent	name
1	NULL	A
2	1	B
3	1	E
4	1	H
5	2	C
6	2	D
7	3	F
8	3	G

Operations on tree data

Addition of a new tree node is done by the INSERT INTO operator:

INSERT INTO test1 (id, parent)
VALUES (<unique id of the node>, 
  <id of the parent node or NULL for a root node>)

To move the node from one parent to another, a value of the PARENT field has to be changed:

UPDATE test1
SET parent = <id of the new parent>
WHERE id = <id of the node being moved>

Obviously, to delete the node, one should invoke DELETE FROM operator providing ID of the node.

DELETE FROM test1
WHERE id = <id of the node>

Selecting whole subtree of the given node

With such an organization of the data, a simple SELECT operator allows us to fetch only one level of nesting of node's descendants.

SELECT id
FROM test1
WHERE parent = :P

So, what should we do if the task is to extract all subnodes, regardless of how deep they lie in the subtree?

An unconventional solution would be to write a query for every possible level of nesting and then merge their outputs using the UNION operator. Fortunately, we can use either recursive stored procedures or recursive common table expressions to assist us.

The procedure below returns the IDs of all subnodes of a given node. The second parameter specifies whether the parent node will be included in the resulting set.

CREATE PROCEDURE RecTest1(AnID INTEGER, Self INTEGER)
  RETURNS (ID INTEGER)
AS
BEGIN
  IF (:Self <> 0) THEN
  BEGIN
    ID = :AnID;
    SUSPEND;
  END

  FOR SELECT id FROM test1
  WHERE parent = :AnID INTO :ID
    DO FOR SELECT id FROM RecTest1(:ID, 1) INTO :ID
      DO SUSPEND;
END;

We can use the procedure on its own or JOIN its output with the tree data to select the required columns:

SELECT t.*
FROM test1 t
  JOIN RecTest1(:P, 0) r ON t.id = r.id

If a specific SQL server supports recursive common table expressions, we can write a query that selects all nested subnodes of the given node:

WITH RECURSIVE
  tree AS (
    SELECT id, parent
    FROM test1
    WHERE parent = <id of the given node>

    UNION ALL

    SELECT g.id, g.parent
    FROM test1 g JOIN tree h
      ON g.parent = h.id
  )
SELECT
  gt.id,
  gt.parent
FROM
  tree gt

Storing auxiliary links

In the previous example, we placed an id-parent link in the table with tree data. Since we only know about the relationship between a given node and its children, we need recursive queries to scan the tree in depth, which is not efficient when working with large amounts of data.

To overcome this limitation, an auxiliary table should be introduced. Such a table holds pairs of connected nodes along with the distance between them, where a distance of 1 indicates a parent-child relationship, and distances greater than 1 correspond to deeper levels of ancestry.

CREATE TABLE test2(
  id INTEGER NOT NULL,
  name VARCHAR(60) NOT NULL,
  PRIMARY KEY (id)
);

CREATE TABLE test2_link(
  id_from INTEGER NOT NULL,
  id_to INTEGER NOT NULL,
  distance INTEGER NOT NULL,

  PRIMARY KEY (id_from, id_to),
  FOREIGN KEY (id_from) REFERENCES test2 (id)
    ON UPDATE CASCADE
    ON DELETE CASCADE,
  FOREIGN KEY (id_to) REFERENCES test2 (id)
    ON UPDATE CASCADE
    ON DELETE CASCADE
);

Here is data for the tree:

id_from	id_to	distance
B	A	1
C	B	1
D	B	1
C	A	2
D	A	2
E	A	1
F	E	1
G	E	1
F	A	2
G	A	2
H	A	1

Let's get back to our business. Selection of the subtree now done much easier. No recursion required.

SELECT
  t.*
FROM
  test2 t JOIN test2_link l
    ON l.id_from = t.id
WHERE
  l.id_to = :P

Using distance field, we can arrange the resulting set by node's closeness to the given root node.

Manipulating Tree Data

While we have managed to avoid recursion when retrieving data, this has significantly complicated the processes of inserting and modifying records.

Adding a Node

To add a node, you need to perform two operations: insert into the tree nodes table and the links table.

For example, to add a child node with an identifier 2 to a parent node with an identifier 1, execute the following commands:

INSERT INTO test2 (id) VALUES (2);

INSERT INTO test2_link (id_from, id_to, distance)
VALUES (2, 1, 1);

Creating Triggers

To automate the maintenance of tree relationships, it's advisable to create two triggers.

The first trigger fires after inserting a record into the test2 table and adds a self-referential link with a distance of 0 into test2_link:

CREATE TRIGGER test2_ai FOR test2
AFTER INSERT
POSITION 32000
AS
BEGIN
  INSERT INTO test2_link (id_from, id_to, distance)
  VALUES (NEW.id, NEW.id, 0);
END;

The second trigger fires after inserting a record into the test2_link table.

Its purpose is to create links between the new node and its descendants (if adding a subtree) and all ancestors of the parent node:

CREATE TRIGGER test2_link_ai FOR test2_link
AFTER INSERT
POSITION 32000
AS
DECLARE VARIABLE AnID INTEGER;
DECLARE VARIABLE ADistance INTEGER;
BEGIN
  IF (NEW.distance = 1) THEN
  BEGIN
    INSERT INTO test2_link (id_from, id_to, distance)
    SELECT down.id_from, up.id_to,
           down.distance + up.distance + 1
    FROM test2_link up, test2_link down
    WHERE up.id_from = NEW.id_to
      AND down.id_to = NEW.id_from
      AND down.distance + up.distance > 0;
  END
END;

Moving a Node

To move a node (or subtree) from one parent to another, follow these steps:

Remove the link to the current parent, detaching the subtree:

   DELETE FROM test2_link
   WHERE id_from = <ID of the node or subtree>
     AND id_to = <ID of the current parent>
     AND distance = 1;

Add the subtree to the new parent:

   INSERT INTO test2_link (id_from, id_to, distance)
   VALUES (
     <ID of the node or subtree>,
     <ID of the new parent>,
     1
   );

The previously created insert trigger will handle the new relationships automatically.

Create a trigger that, upon deleting a link with a distance of 1, removes all links between the nodes of the subtree and the nodes along the path from the parent node to the tree root:

CREATE TRIGGER test2_link_bd FOR test2_link
BEFORE DELETE
POSITION 32000
AS
DECLARE VARIABLE AnIDFrom INTEGER;
DECLARE VARIABLE AnIDTo INTEGER;
BEGIN
  IF (OLD.distance = 1) THEN
  BEGIN
    FOR
      SELECT down.id_from, up.id_to
      FROM test2_link down
      JOIN test2_link up
        ON up.id_from = OLD.id_from
       AND up.distance + down.distance > 1
       AND down.id_to = OLD.id_from
      INTO :AnIDFrom, :AnIDTo
    DO
      DELETE FROM test2_link
      WHERE id_from = :AnIDFrom
        AND id_to = :AnIDTo;
  END
END;

Deleting a Node

Deleting a node doesn't require special handling, as foreign keys in the test2_link table are configured for cascading deletions.

When a node is deleted from the test2 table, all related records in test2_link are automatically removed:

DELETE FROM test2 WHERE id = <Node ID>;

Ensuring Data Integrity

To maintain logical data integrity and adherence to our model, create two additional triggers.

The first trigger prevents editing records in the test2_link table:

CREATE EXCEPTION tree_e_incorrect_operation
  'Incorrect operation';

CREATE TRIGGER test2_link_au FOR test2_link
BEFORE UPDATE
POSITION 32000
AS
BEGIN
  EXCEPTION tree_e_incorrect_operation;
END;

The second trigger deletes a node if a record with a distance of 0 is removed from the links table:

CREATE TRIGGER test2_link_ad FOR test2_link
AFTER DELETE
POSITION 32000
AS
BEGIN
  IF (OLD.distance = 0) THEN
  BEGIN
    DELETE FROM test2 WHERE id = OLD.id_from;
  END
END;

These measures will help maintain data integrity and ensure the correct tree structure within the database.

Composite Path

It's not necessary to use a separate table to store the list of nodes leading to the root of a tree. Instead, you can concatenate the ordered list of node identifiers into a string, using a chosen delimiter, and store it directly within the node records table. For example:

CREATE TABLE test3 (
  id INTEGER NOT NULL,
  parent INTEGER,
  path VARCHAR(60) NOT NULL,

  PRIMARY KEY (id),
  FOREIGN KEY (parent) REFERENCES test3 (id)
    ON UPDATE CASCADE
    ON DELETE CASCADE
);

To optimize retrieval operations using the STARTING WITH operator, it's essential to create the following index to enhance performance:

CREATE INDEX test3_x_path ON test3 (path);

If you choose a forward slash (/) as the delimiter, the test3 table for the tree depicted in the figure above would be populated as follows:

id	parent	path
A	NULL	/
B	A	/A/
C	B	/A/B/
D	B	/A/B/
E	A	/A/
F	E	/A/E/
G	E	/A/E/
H	A	/A/

Retrieving Subtree Nodes

To retrieve all nodes in the subtree rooted at node P, execute the following query:

SELECT *
FROM test3
WHERE path STARTING WITH
  (SELECT path FROM test3 WHERE id = :P);

If you wish to exclude the root node itself from the results, modify the query accordingly:

SELECT *
FROM test3
WHERE path STARTING WITH
  (SELECT path FROM test3 WHERE id = :P)
  AND id <> :P;

Manipulating Tree Data

Similar to the simplest method of organizing hierarchical data discussed earlier, adding, moving, or deleting a node can be accomplished using standard SQL commands: INSERT, UPDATE, or DELETE.

The automatic generation of the path string can be managed by implementing the following triggers:

CREATE TRIGGER test3_bi FOR test3
  BEFORE INSERT
  POSITION 32000
AS
  DECLARE VARIABLE P INTEGER;
BEGIN
  NEW.path = '';
  P = NEW.parent;
  WHILE (NOT :P IS NULL) DO
  BEGIN
    NEW.path = CAST(:P AS VARCHAR(60)) || '/'
      || NEW.path;
    SELECT parent
    FROM test3
    WHERE id = :P
    INTO :P;
  END
  NEW.path = '/' || NEW.path;
END;

CREATE EXCEPTION tree_e_incorrect_operation2
  'Incorrect operation';

CREATE TRIGGER test3_bu FOR test3
  BEFORE UPDATE
  POSITION 32000
AS
  DECLARE VARIABLE P INTEGER;
  DECLARE VARIABLE T INTEGER;
  DECLARE VARIABLE NP VARCHAR(60);
  DECLARE VARIABLE OP VARCHAR(60);
BEGIN
  IF((NEW.parent <> OLD.parent)
    OR (NEW.parent IS NULL AND NOT OLD.parent IS NULL)
    OR (NOT NEW.parent IS NULL
      AND OLD.parent IS NULL)) THEN
  BEGIN
    NEW.path = '';
    P = NEW.parent;
    WHILE (NOT :P IS NULL) DO
    BEGIN
      NEW.path = CAST(:P AS VARCHAR(60)) || '/'
        || NEW.path;
      SELECT parent
      FROM test3
      WHERE id = :P
      INTO :P;
    END
    NEW.path = '/' || NEW.path;

    /* Check for circular references */
    IF (POSITION(('/' ||
      CAST(NEW.id AS VARCHAR(60)) || '/')
      IN NEW.path) > 0) THEN
      EXCEPTION tree_e_incorrect_operation2;

    NP = NEW.path ||
      CAST(NEW.id AS VARCHAR(60)) || '/';
    OP = OLD.path ||
      CAST(OLD.id AS VARCHAR(60)) || '/';
    T = LENGTH(OP) + 1;

    UPDATE test3
    SET path = :NP || SUBSTRING(path FROM :T FOR 60)
    WHERE path STARTING WITH :OP;
  END
END;

Storing the composite path from the tree root to the current node imposes a limitation on the maximum tree depth, approximately equal to L div (K + 1), where L is the length of the string field used to store the path, and K is the average length of the string representation of the record key.

Go to the part 2...

DEV Community

Trees in SQL

Link to the parent

Operations on tree data

Selecting whole subtree of the given node

Storing auxiliary links

Manipulating Tree Data

Adding a Node

Creating Triggers

Moving a Node

Deleting a Node

Ensuring Data Integrity

Composite Path

Retrieving Subtree Nodes

Manipulating Tree Data

Top comments (0)

Read next

Automated Load Testing in CI/CD Pipelines: Lessons from a Custom Bash Script

🔥10 Git Features You Might Not Know About

Dynamic Svelte Components

Understanding the Data Link and Physical Layers: The Final Step in Mastering OSI Model and Computer Networking