Spring, SchemaSpy DB docs, and GitHub Pages

#java #docker #spring #tutorial

In this article, I'm telling you:

How to generate docs of database schema structure with SchemaSpy? Why you need it, if you use a relational database (PostgreSQL/Oracle/MySQL etc.) in your application?
How to run SchemaSpy with Testcontainers?
How to host the generated database documentation on GitHub Pages?

The stack I'm using in this project consists of:

You can find the entire repository with code examples by this link. The test that generates the SchemaSpy docs is available by this link.

What is SchemaSpy?

SchemaSpy is a standalone application that connects to the database, scans its tables and schemas, and generates nicely composed HTML documentation. You can check out an example sample by this link but I'm showing you just one screenshot to clarify my point.

SchemaSpy visualizes the database structure detailedly. Besides, you can also deep dive into relations, constraints, and table/column comments (i.e. COMMENT ON COLUMN/TABLE statements).

Such documentation is helpful for a variety of specialists:

System analysts want to understand the data processing and storage principles behind the business logic.
When QA engineers face a bug, the database structure documentation helps to investigate the reason that cause the problem deeply. As a matter of fact, they can also attach additional details to the ticket to make its fixing more transparent.
Data engineers have to be aware of tables structure to deal with Change Data Capture events correctly.

So, database documentation that is always relevant (because it's generated) is excessively beneficent.

Database tables

Look at the DDL operations that create database tables. This is our application domain we're going to document.

CREATE TABLE users
(
    id   BIGSERIAL PRIMARY KEY,
    name VARCHAR(200) NOT NULL
);

CREATE TABLE community
(
    id   BIGSERIAL PRIMARY KEY,
    name VARCHAR(200) NOT NULL
);

CREATE TABLE post
(
    id           BIGSERIAL PRIMARY KEY,
    name         VARCHAR(200)                     NOT NULL,
    community_id BIGINT REFERENCES community (id) NOT NULL
);

CREATE TABLE community_role
(
    id           BIGSERIAL PRIMARY KEY,
    user_id      BIGINT REFERENCES users (id)     NOT NULL,
    community_id BIGINT REFERENCES community (id) NOT NULL,
    type         VARCHAR(50)                      NOT NULL,
    UNIQUE (user_id, community_id, type)
);

CREATE TABLE post_role
(
    id      BIGSERIAL PRIMARY KEY,
    user_id BIGINT REFERENCES users (id) NOT NULL,
    post_id BIGINT REFERENCES post (id)  NOT NULL,
    type    VARCHAR(50)                  NOT NULL,
    UNIQUE (user_id, post_id, type)
);

As you can see, there are 3 core tables (users, community, and role) and 2 linking tables (community_role, post_role).

Running SchemaSpy with Testcontainers

The easiest way to keep documentation in sync with the current database structure is updating it on each merged Pull Request. Therefore, we need somehow to run SchemaSpy during tests execution.

I'm showing you the algorithm step by step. But you can check out the entire suite by this link.

Firstly, we need to define the SchemaSpy container itself. Look at the code snippet below.

class SchemaSpyTest extends AbstractControllerTest {

  @Test
  @SneakyThrows
  void schemaSpy() {
    @Cleanup final var schemaSpy =
        new GenericContainer<>(DockerImageName.parse("schemaspy/schemaspy:6.1.0"))
            .withNetworkAliases("schemaspy")
            .withNetwork(NETWORK)
            .withLogConsumer(new Slf4jLogConsumer(LoggerFactory.getLogger("SchemaSpy")))
            .withCreateContainerCmdModifier(cmd -> cmd.withEntrypoint(""))
            .withCommand("sleep 500000");
    ...
}

The AbstractControllerTest contains PostgreSQL container configuration. You can see its source code by this link.

The Cleanup annotation comes from the Lombok project. It generates try-finally statement.

I want to point out some important details here.

The withNetwork clause assigns the container to the existing Testcontainers NETWORK. This value inherits from the AbstractControllerTest and the PostgreSQL runs with this network as well. It’s crucial because otherwise SchemaSpy won’t be able to connect to PostgreSQL.
Log consumer applies a logger to push logs from the container. It’s useful to track bugs and errors.
The withCreateContainerCmdModifier is the primary part. By default, SchemaSpy container tries to connect to a database immediately and generate documentation. Then a container terminates. However, that’s not an acceptable behaviour, because the generation result remains inside container. Therefore, we need to copy it in the OS directory. But if a container has already stopped, it’s impossible. So, we have to override a default entry-point to make a container run (almost) indefinitely. That’s why I put the sleep 500000 command. Container will hang and do nothing on its start.

Now we need to trigger the generation process. Look at the code block below.

schemaSpy.start();
schemaSpy.execInContainer(
    "java",
    "-jar", "/schemaspy-6.1.0.jar",
    "-t", "pgsql11",
    "-db", POSTGRES.getDatabaseName(),
    "-host", "postgres",
    "-u", POSTGRES.getUsername(),
    "-p", POSTGRES.getPassword(),
    "-o", "/output",
    "-dp", "/drivers_inc",
    "-debug"
);
schemaSpy.execInContainer("tar", "-czvf", "/output/output.tar.gz", "/output");

Here is what happens:

I start the container (remember that it hangs and does nothing).
Then I execute the command that generates the documentation. The host equals to postgres because that’s the PostgreSQL container’s network alias (the withNetworkAliases method). The command execution happens in a separate process inside the container. So, the sleep command is not terminated.
Finally, we put the directory with generated contents (HTML, CSS, JS) into a tarball. Testcontainers library allows to copy files from a container to the OS but not directories. That’s why we need an archive inside the SchemaSpy container.

It’s time to copy the result documentation in the OS directory and unpack the changes. Look at the code snippet below.

final var buildFolderPath =
    Path.of(getClass().getResource("/").toURI()).toAbsolutePath();

schemaSpy.copyFileFromContainer(
    "/output/output.tar.gz",
    buildFolderPath.resolve("output.tar.gz").toString()
);
schemaSpy.stop();

final var archiver = ArchiverFactory.createArchiver("tar", "gz");
archiver.extract(
    buildFolderPath.resolve("output.tar.gz").toFile(),
    buildFolderPath.toFile()
);

The steps are:

Define the buildFolderPath that points to the build/classes/java/test directory (I use Gradle in this project).
Then I copy the tarball with documentation from the container to the buildFolderPath directory.
And finally, I unpack the archive contents to the same directory (I use jararchivelib library here).

In the end, we have a pretty documentation generated on the database schema structure. Look at the screenshot below.

Hosting result on GitHub Pages

We got generated documentation based in build/classes/java/test folder. Anyway, it's not that useful yet. We need to host it on GitHub Pages and update accordingly.

Look at the pipeline definition below.

name: Java CI with Gradle

on:
  push:
    branches: [ "master" ]

permissions:
  contents: read
  pages: write
  id-token: write

jobs:
  build-and-deploy:
    environment:
      name: github-pages
      url: ${{ steps.deployment.outputs.page_url }}
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Set up JDK 17
        uses: actions/setup-java@v3
        with:
          java-version: '17'
          distribution: 'temurin'
      - name: Build with Gradle
        run: ./gradlew build
      - name: Upload artifact
        uses: actions/upload-pages-artifact@v1
        with:
          path: build/classes/java/test/
      - name: Deploy to GitHub Pages
        id: deployment
        uses: actions/deploy-pages@v1

The actions are:

Build the project and run tests with Set up JDK 17 and Build with Gradle steps.
Then the Upload artifact step uploads the build/classes/java/test/ directory folder to GitHub registry (it contains the documentation).
And finally, deploy the artefact created on the previous step to the GitHub Pages.

That's basically it. Now the SchemaSpy documentation is available on GitHub Pages and being updated automatically on each merged Pull Request.

Conclusion

That’s all I wanted to tell you about generating database schema documentation and hosting the result on GitHub Pages. Do you have any docs generation automations in your project? Is it SchemaSpy or something else? Do you host the result on GitHub or GitLab Pages? Tell your story in the comments. It’ll be interesting to discuss your experience as well.

Thanks for reading! If you like that piece, press the like button and share the link with your friends.