Franck Pachot for YugabyteDB

Posted on Nov 3, 2021 • Edited on Aug 17, 2024

Open-source🍃MongoDB API to 🚀YugabyteDB with 🥭MangoDB proxy

#yugabytedb #mongodb #ferretdb #database

Please note that MangoDB is now called FerretDB and has moved to:

There is no doubt that one great thing about MongoDB is the API. Many developers loves it. I'm a big fan of SQL, but we need to listen to all users, and they have use-cases for it. Another thing we expect from NoSQL is the scalability.

In this example we have both, open-source, ACID and resilient, with

MangoDB proxy between MongoDB and PostgreSQL protocols
YugabyteDB with its PostgreSQL compatible API on top of the fully consistent distributed storage

This MangoDB project is new, and when looking for it you will see Google still trying to tell you that you made a typo. So this post is a first quick test to validate how it works. Things will probably change with contributions to https://github.com/MangoDB-io/MangoDB

MangoDB example

I'll take de demo application from:

git clone https://github.com/MangoDB-io/example
cd example

and install it on my laptop because I became recently a big fan of Docker on Windows.

The docker-compose.yml starts a PostgreSQL database and I'll replace that with a YugabyteDB one. This is easy.

postgres service

First, I remove the postgres service. And add the yb-master and yb-tserver ones from YugabyteDB docker-compose.yml

I didn't change any parameters, and I keep the defaults:

host name is yb-tserver (in a distributed database you can connect to any server)
port is 5433 (this is our default, rather than the 54322 default for PostgreSQL)
user is yugabyte (and password is the same)
database is yugabyte (you can create a dedicated one of course)

setup service

I change this in the setup service command, which starts just to create the schema for the application (which is a "todo" list in this example):

psql -h yb-tserver -p 5433 -U yugabyte -d yugabyte -c 'CREATE SCHEMA IF NOT EXISTS todo'

and the docker-compose dependecy is set to yb-tserver instead of postgres

mangodb service

The application also needs the connection string. We use the PostgreSQL driver as it is the same protocol, and change the host, port and database name only:

    depends_on:
      - 'postgres'
    entrypoint: ["sh", "-c", "psql -h postgres -U postgres -d mangodb -c 'CREATE SCHEMA IF NOT EXISTS todo'"]

Start the application

Here is my final docker-compose.yml:

version: "3"

volumes:
  yb-master-data-1:
  yb-tserver-data-1:

services:
  client:
    build: ./app/client
    hostname: 'todo_client'
    container_name: 'todo_client'
    stdin_open: true
  api:
    build: ./app/api
    hostname: 'todo_api'
    container_name: 'todo_api'
  nginx:
    image: nginx
    hostname: 'nginx'
    container_name: 'nginx'
    ports:
      - 8888:8888
    volumes:
      - ./nginx.conf:/etc/nginx/conf.d/default.conf
  yb-master:
      image: yugabytedb/yugabyte:latest
      container_name: yb-master-n1
      volumes:
      - yb-master-data-1:/mnt/master
      command: [ "/home/yugabyte/bin/yb-master",
                "--fs_data_dirs=/mnt/master",
                "--master_addresses=yb-master-n1:7100",
                "--rpc_bind_addresses=yb-master-n1:7100",
                "--replication_factor=1"]
      ports:
      - "7000:7000"
      environment:
        SERVICE_7000_NAME: yb-master
  yb-tserver:
      image: yugabytedb/yugabyte:latest
      container_name: yb-tserver-n1
      volumes:
      - yb-tserver-data-1:/mnt/tserver
      command: [ "/home/yugabyte/bin/yb-tserver",
                "--fs_data_dirs=/mnt/tserver",
                "--start_pgsql_proxy",
                "--rpc_bind_addresses=yb-tserver-n1:9100",
                "--tserver_master_addrs=yb-master-n1:7100"]
      ports:
      - "9042:9042"
      - "5433:5433"
      - "9000:9000"
      environment:
        SERVICE_5433_NAME: ysql
        SERVICE_9042_NAME: ycql
        SERVICE_6379_NAME: yedis
        SERVICE_9000_NAME: yb-tserver
      depends_on:
      - yb-master
  mangodb:
    image: ghcr.io/mangodb-io/mangodb:latest
    hostname: 'mangodb'
    container_name: 'mangodb'
    command:
      [
        '-listen-addr=:27017',
        '-postgresql-url=postgres://yugabyte@yb-tserver:5433/yugabyte',
      ]
    ports:
      - 27017:27017
  setup:
    image: postgres:14.0
    hostname: 'setup'
    container_name: 'setup'
    restart: 'on-failure'
    depends_on:
      - 'yb-tserver'
    entrypoint: ["sh", "-c", "psql -h yb-tserver -p 5433 -U yugabyte -d yugabyte -c 'CREATE SCHEMA IF NOT EXISTS todo'"]

I pull the images, create the containers and run the services:

docker-compose up

Here is the start from command line:

Visible in Docker Desktop:

The YugabyteDB console is available on: http://localhost:7000

The example application:

The application is accessible on http://localhost:8888/ and we can add items to the To-Do list:

This calls the db.collection.insertOne() MongoDB function:

Check the database

This MongoDB call is translated to SQL by the MangoDB proxy. The collection is a table:

$ psql postgres://yugabyte:yugabyte@localhost:5433/yugabyte
psql (12.7, server 11.2-YB-2.9.0.0-b0)
Type "help" for help.

yugabyte=# \dn

  List of schemas
  Name  |  Owner
--------+----------
 public | postgres
 todo   | yugabyte
(2 rows)

yugabyte=# set schema 'todo';
SET

yugabyte=# \d
         List of relations
 Schema | Name  | Type  |  Owner
--------+-------+-------+----------
 todo   | tasks | table | yugabyte
(1 row)

yugabyte=# \d+ todo.tasks
                                   Table "todo.tasks"
 Column | Type  | Collation | Nullable | Default | Storage  | Stats target | Description
--------+-------+-----------+----------+---------+----------+--------------+-------------
 _jsonb | jsonb |           |          |         | extended |              |

yugabyte=# select * from todo.tasks;
                                                                    _jsonb
-----------------------------------------------------------------------------------------------------------------------------------------------
 {"$k": ["description", "completed", "_id"], "_id": {"$o": "6182627a17462641b80439d4"}, "completed": false, "description": "Play 😎"}
 {"$k": ["description", "completed", "_id"], "_id": {"$o": "6182627017462641b80439d3"}, "completed": false, "description": "Start MangoDB"}
 {"$k": ["description", "completed", "_id"], "_id": {"$o": "6182626817462641b80439d2"}, "completed": false, "description": "Start YugabyteDB"}
(3 rows)

yugabyte=#

The storage is very simple: one table with one JSONB column.

SQL Statements

I'll track the statements used with the pg_stat_statements extension which is enabled by default in YugabyteDB. Just resetting in my lab:

yugabyte=# select pg_stat_statements_reset();
 pg_stat_statements_reset
--------------------------

(1 row)

In the application I refresh, mark "Play" as completed, Insert a new task, delete it, and refresh multiple times.

yugabyte=# select calls,query from pg_stat_statements;
 calls |                                                        query
-------+----------------------------------------------------------------------------------------------------------------------
     1 | select query from pg_stat_statements
     1 | INSERT INTO "todo"."tasks" (_jsonb) VALUES ($1)
     1 | DELETE FROM "todo"."tasks" WHERE _jsonb->$1 = $2
     1 | SELECT _jsonb FROM "todo"."tasks" WHERE _jsonb->$1 = $2
     1 | UPDATE "todo"."tasks" SET _jsonb = $1 WHERE _jsonb->'_id' = $2
     7 | SELECT _jsonb FROM "todo"."tasks"
     1 | select pg_stat_statements_reset()
    11 | SELECT COUNT(*) > 0 FROM information_schema.columns WHERE column_name = $1 AND table_schema = $2 AND table_name = $3
    11 | SELECT COUNT(*) > 0 FROM information_schema.tables WHERE table_schema = $1 AND table_name = $2
     1 | select * from pg_stat_statements
(10 rows)

There are many things to optimize here. Reading the information_schema many time is not the most efficient. We need an index on the ID (which is in the JSON document). And updates are re-writing the whole document. I'll think about this and probably contribute to this open-source project. Probably the ID should be in another column, that we can properly index and shard, rather than scanning the whole table or adding an additional index.

JSONB indexing

As YugabyteDB plugs the distributed storage to a full PostgreSQL query layer, we can even index this Here is the table with just one JSONB (it was created by with CREATE TABLE "todo"."tasks" (_jsonb jsonb);):

yugabyte=# select * from todo.tasks;
                                                              _jsonb
----------------------------------------------------------------------------------------------------------------------------------
 {"$k": ["description", "completed", "_id"], "_id": {"$o": "618282aea9a2a141efa3c401"}, "completed": false, "description": "Franck"}
(1 row)

If I select one key, it has to scan the whole table:

yugabyte=# explain analyze select * from todo.tasks where _jsonb->>'_id' = '{"$o": "618282aea9a2a141efa3c401"}';
                                             QUERY PLAN
-----------------------------------------------------------------------------------------------------
 Seq Scan on tasks  (cost=0.00..105.00 rows=1000 width=32) (actual time=0.815..0.817 rows=1 loops=1)
   Filter: ((_jsonb ->> '_id'::text) = '{"$o": "618282aea9a2a141efa3c401"}'::text)
 Planning Time: 0.048 ms
 Execution Time: 0.874 ms
(4 rows)

But I can create a unique index on it:

yugabyte=# create unique index task_pk ON todo.tasks
           ((_jsonb ->> '_id'::text) hash);
CREATE INDEX

Do not forget the double parenthesis (this is the PostgreSQL syntax):

one for the list of columns to index,
and one because it is not directly a column but a value derived from the JSON document.

The HASH modifier is optional here because hash sharding is the default on the first column. And this is what we want on this generated identifier. But if you have range scans, you could change it to ASC or DESC.

Now I have a fast access to the document:

yugabyte=# explain analyze select * from todo.tasks where _jsonb->>'_id' = '{"$o": "618282aea9a2a141efa3c401"}';

                                                    QUERY PLAN
------------------------------------------------------------------------------------------------------------------
 Index Scan using task_pk on tasks  (cost=0.00..4.12 rows=1 width=32) (actual time=13.617..13.622 rows=1 loops=1)
   Index Cond: ((_jsonb ->> '_id'::text) = '{"$o": "618282aea9a2a141efa3c401"}'::text)
 Planning Time: 11.593 ms
 Execution Time: 13.706 ms
(4 rows)

This means that a query with the key will go to the right tablet (the tables and indexes are automatically sharded in YugabyteDB) and to the right row. We are ready to scale out and keep the low latency.

Index Only Indexes

The previous execution plan may require two RPC on a scale-out database: one to the index and one to the table. Because, for better agility, all indexes are global in YugabyteDB. And, of course, with no compromise on strong consistency. But an Index Only Scan would be better. It is easy to acheive (I explained in How a Distributed SQL Database Boosts Secondary Index Queries with Index Only Scan):

yugabyte=# create unique index task_pk ON todo.tasks
           ((_jsonb ->> '_id'::text) hash) include (_jsonb);
CREATE INDEX

And here is the fastest access you can have to a document on a SQL distributed database, still with the full agility of a JSON document:

yugabyte=# explain analyze select * from todo.tasks where _jsonb->>'_id' = '{"$o": "618282aea9a2a141efa3c401"}';
                                                     QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
 Index Only Scan using task_pk on tasks  (cost=0.00..4.12 rows=1 width=32) (actual time=2.559..2.561 rows=1 loops=1)
   Index Cond: (((_jsonb ->> '_id'::text)) = '{"$o": "618282aea9a2a141efa3c401"}'::text)
   Heap Fetches: 0
 Planning Time: 5.229 ms
 Execution Time: 2.607 ms
(5 rows)

Now you have all in the index and don't need the table at all. In PostgreSQL you have no choice as you need to maintain the heap table. But YugabyteDB stores tables in LSM trees where rows are organized for fast access on the primary key. When storing documents into a SQL table, it is better to have the identifier in its own column, an integer or uuid, to really have a (key uuid, value jsonb) schema. I'll suggest that to the MangoDB project, as well as some other optimizations for PostgreSQL or YugabyteDB. But the essence is there: a simple MongoDB API to distributed SQL database.

DEV Community