DEV Community

Rafsun Masud
Rafsun Masud

Posted on

How Apache AGE turns a Relational DBMS into a Graph DBMS

You may be wondering how it is possible to use tables to store and query a graph efficiently. Well, AGE has found its way of doing it.

In case you have not heard of AGE yet, it is an open source graph database system built as an extension of Postgres. Checkout their GitHub page and official website.

As someone who contributed to this project, I will explain how AGE does it and share some key insights on performance.

Little background on Graph DBMS

You will see two most common object in a AGE graph: vertex and edge. Generally, they are called entities or nodes.

Each entity is made of a label and some properties. Properties are simply a collection of key-value pairs (similar to a JSON object). But, what is a label? Why is label a separate thing? Why can't it be part of the properties?

Label is meant to provide categories (like students, professors), while properties are for details (like id, name, role etc.) Although you could still put category information inside properties, but using labels to categorise gives you performance advantage. In fact, labels do influence AGE's internal data structure, and you will see how.

Graphs are schemas

AGE creates and uses a separate schema (a schema is basically a namespace in Postgres) for each graph. Let's see by examples.

Run Postgres and load AGE (you'll find how to do that in the GitHub link above).

First, create a graph:

SELECT ag_catalog.create_graph('mygraph');
Enter fullscreen mode Exit fullscreen mode

Next, run the command \dn to print all schemas:

test=# \dn
    List of schemas
    Name    |  Owner
------------+----------
 ag_catalog | rafsun42
 mygraph    | rafsun42
 public     | rafsun42
(3 rows)
Enter fullscreen mode Exit fullscreen mode

You can see AGE just created two schemas: ag_catalog and mygraph.

Besides using a schema for each graph, AGE uses one catalog schema, the ag_catalog, to hold references (by means of schema name) to all AGE graphs created within the database.

In the rest of the article, we will explore the tables within the mygraph schema to understand AGE's graph data structure and performance.

Labels are tables

Building the graph with Cypher

Let's first build the graph by adding vertices and edges (AGE implements openCypher specification as its query language):

SELECT * FROM ag_catalog.cypher('mygraph',
$$
    CREATE 
        (prof: Professor {degree: 'Bioloy'} ),
        (class {code: 'BIOL 1001'}),
        (prof)-[teach: Teaches {day: 'Mon'}]->(class)
    RETURN prof, teach, class
$$)
as (prof agtype, teach agtype, class agtype);
Enter fullscreen mode Exit fullscreen mode

Here, we created two vertices and a one-directional relation between them. Together, it represents that a professor teaches a biology class.

Notice how Cypher works. All three entities have properties (looks like JSON objects) and variables (to be used in the RETURN clause). But, only the prof vertex and the teach edge have label (right after the colon).

Tables and their schemes

Now, lets list the tables within the schema by running the commmand \dt mygraph.*:

test=# \dt mygraph.*
               List of relations
 Schema  |       Name       | Type  |  Owner
---------+------------------+-------+----------
 mygraph | Professor        | table | rafsun42
 mygraph | Teaches          | table | rafsun42
 mygraph | _ag_label_edge   | table | rafsun42
 mygraph | _ag_label_vertex | table | rafsun42
(4 rows)
Enter fullscreen mode Exit fullscreen mode

As you can see, every label has its own table. Plus, two extra tables, called defaults, for entities with no label. Entities are stored in their respective label's table as a row.

I encourage you to see the schema of the Professor and Teaches tables by running the following two command:

test=# \d mygraph."Professor" 
                Table "mygraph.Professor"
   Column   |  Type   | Collation | Nullable |     Default                                                       
------------+---------+-----------+----------+------------------
 id         | graphid |           | not null | ...
 properties | agtype  |           | not null | ...
Inherits: mygraph._ag_label_vertex


test=# \d mygraph."Teaches" 
                Table "mygraph.Teaches"
   Column   |  Type   | Collation | Nullable |      Default                                                     
------------+---------+-----------+----------+-------------------
 id         | graphid |           | not null | ...
 start_id   | graphid |           | not null | 
 end_id     | graphid |           | not null | 
 properties | agtype  |           | not null | ...
Inherits: mygraph._ag_label_edge
Enter fullscreen mode Exit fullscreen mode

Why these two tables has different schema? Because one represents vertex label, and the other represents edge label. Edge entities require two additional attributes start_id and end_id to store the endpoints.

Performance

I hope by now you understand how AGE stores graph and its entities in tables. Let's discuss why each label has separate table and how it affects performance.

With this design, AGE takes advantage of partitioning with Postgres' inheritance feature. If you are not familiar with either partitioning or inheritance in Postgres, I recommend checking out my articles on these topic first.

AGE puts all vertices in a single table at logical level and partition them by label at physical level. Notice the output of the \d commands again. All vertex label tables inherit the default _ag_label_vertex table, and all edge label tables inherit the default _ag_label_edge table.

When I mentioned before that _ag_label_vertex contains only vertices with no label, it was true only at physical level. Each label's table has its own physical file on disk. However, at logical level, the default tables, like _ag_label_vertex, is a table that contains all vertices. Because other vertex label tables are children of it. If you scan the _ag_label_vertex table, Postgres will fetch all vertices unless you specify not to.

Let's see how that helps in AGE's use cases. In a typical query, you would like to lookup some entities. Think about the two ways of looking up an entity in a graph (using Cypher):

# Example-1: By label
# Returns all professors
MATCH (x: Professor) RETURN x

# Example-2: By a property
# Returns all people with a biology
# degree- could be anyone
MATCH (x {degree: 'Biology'}) RETURN x
Enter fullscreen mode Exit fullscreen mode

The performance of a lookup query greatly depends on how closely relevant data are stored on the physical disk. Since all rows of a relation are stored contiguously on disk, with this design, all nodes with same label will be put together in the same place. When executing the MATCH in the first example, all professor nodes can be fecthed to the main memory from the disk with minimum I\O calls.

For the second example, we don't know the label. So we don't know which table to scan. Because of inheritance, we don't have to manually scan each vertex label table, instead we can only scan _ag_label_vertex to look for any vertices regardless of its label.

Overall, this approach makes matching by label faster, while keeping matching by properties simple.

Summary

To summarize, each graph is a schema. Each label in a graph has its own table. For no-label entities, there are two default tables. All vertices and edges are stored in the respective table.

As for performance, entities are put in the default table at logical level. But at physical level they are partioned by their label. As a result of partitioning, lookup query performance improves.

Top comments (0)