FerretDB + YugabyteDB on Kubernetes (Amazon EKS): a MongoDB API to Distributed SQL, at scale

#mongodb #yugabytedb #postgres #eks

To prepare a demo for VoxxedDays Ticino I have created a Kubernetes cluster on Amazon EKS with YugabyteDB, Open Source Distributed SQL, installed as in this post.

YugabyteDB has a Cassandra-compatible API (called YCQL) and a PostgreSQL-compatible one (called YSQL). What if you have an application written for MongoDB that you want to move to a SQL database? You probably want to rewrite it to benefit from SQL features but, rather than re-writing all at the same time, you can add a MongoDB-compatible API on top of PostgreSQL thanks to FerretDB.

Let's add that to my Kubernetes cluster. This is a test. FerretDB is still in early stage. But maybe this combination of a MongoDB proxy and a Distributed SQL database can be an open source alternative to MongoDB or DocumentDB in the future.

Run FerretDB connected to YugabyteDB

To test this possibility I'll start a single pod using the ferretdb/ferretdb image whish runs ferretdb. I set its endpoint with --listen-addr and the PostgreSQL connection to YugabyteDB with --postgresql-url:

$ kubectl run --image ferretdb/ferretdb ferretdb -- \
 --listen-addr=0.0.0.0:27017 \
 --postgresql-url="postgres://yugabyte:yugabyte@yb-tservers.yb-demo-eu-west-1a.svc.cluster.local:5433/yugabyte"

pod/ferretdb created

It connects to the ClusterIP service yb-tservers that distributes the connections to all YugabyteDB Servers in one AZ eu-west-1a.

I'm testing this from a mongosh shell:


$ kubectl get pods -o wide

NAME       READY   STATUS    RESTARTS   AGE   IP              NODE                                           NOMINATED NODE   READINESS GATES
ferretdb   1/1     Running   0          11m   192.168.50.78   ip-192-168-48-228.eu-west-1.compute.internal   <none>           <none>

$ kubectl run -it --rm --restart=Never --image mongo mongosh -- \
  mongosh mongodb://192.168.50.78:27017/mongodb

I can create a collection

mongodb> db.createCollection('demo');
{ ok: 1 }

The corresponding table is created on YugabyteDB:

It is automatically distributed (sharded and replicated):

I got the IP address from the pod, here. I'll create a service.

Run a Load Balancer for the FerretDB pods

Going though a proxy adds some latency. Then you probably need want to have more pods. I'll create a ReplicaSet for it later but for the moment let's create a LoadBalancer to valudate this idea:

$ kubectl expose pod ferretdb --port=27017 --target-port=27017 \
        --name=ferretdb-service --type=LoadBalancer

service/ferretdb-service exposed

I can test the connection through this service:

$ kubectl run -it --rm --restart=Never --image mongo mongosh -- \
 mongosh mongodb://ferretdb-service.default.svc.cluster.local:27017/mongodb --quiet

If you don't see a command prompt, try pressing enter.
mongodb> show collections;
demo
mongodb>

I can now use the FerretDB endpoint from the internet, getting the EXTERNAL-IP address:

$ kubectl get services --field-selector "metadata.name=ferretdb-service" -o jsonpath='{.items[0].status.loadBalancer.ingress[0].hostname}'

abebf50dee88a462ab90d371dfdaa44f-1345053103.eu-west-1.elb.amazonaws.com

I can connect with mongosh from my laptop:

$ mongosh mongodb://abebf50dee88a462ab90d371dfdaa44f-1345053103.eu-west-1.elb.amazonaws.com:27017/mongodb --quiet

mongodb> db.demo.insertOne({ "message": "Hello World" });
{
  acknowledged: true,
  insertedId: ObjectId("63c85f7aa3db61ccd4324a04")
}
mongodb> db.demo.findOne();
{ _id: ObjectId("63c85f7aa3db61ccd4324a04"), message: 'Hello World' }
mongodb>

Deployement as ReplicaSet

Now that I have tested by manually creating a pod and service, I'll automate that. I remove those created above:

$ kubectl delete service ferretdb-service
service "ferretdb-service" deleted

$ kubectl delete pod ferretdb 
pod "ferretdb" deleted

Here is the YAML declaration for a ReplicaSet of FerretDB:

cat > ferretdb-rs.yaml <<'YAML'
apiVersion: apps/v1
kind: ReplicaSet
metadata:
  name: ferretdb
  namespace: yb-demo-eu-west-1a
  labels:
    run: ferretdb
spec:
  replicas: 3
  selector:
    matchLabels:
      run: ferretdb
  template:
    metadata:
      labels:
        run: ferretdb
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: topology.kubernetes.io/zone
                operator: In
                values:
                - eu-west-1a
      containers:
      - name: ferretdb
        image: ferretdb/ferretdb
        args:
         - --listen-addr=0.0.0.0:27017
         - --postgresql-url=postgres://yugabyte:yugabyte@yb-tservers.yb-demo-eu-west-1a.svc.cluster.local:5433/yugabyte
YAML
kubectl apply -f ferretdb-rs.yaml

And the LoadBalancer:

cat > ferretdb-lb.yaml <<'YAML'
apiVersion: v1
kind: Service
metadata:
  name: ferretdb-service
  namespace: yb-demo-eu-west-1a
  labels:
    run: ferretdb
spec:
  type: LoadBalancer
  selector:
   run: ferretdb
  ports:
    - protocol: TCP
      port: 27017
      targetPort: 27017
YAML
kubectl apply -f ferretdb-lb.yaml

I have created those in one AZ (eu-west-1a), connecting to the YugabyteDB Tablet Servers in the same AZ.

Connect with Mongo or Postgres clients

I get the mongodb endpoint address:

$ kubectl get services \
-o jsonpath='mongodb://{.items[0].status.loadBalancer.ingress[0].hostname}:{.items[0].spec.ports[0].port}/mongodb' \
--field-selector "metadata.name=ferretdb-service" \
-n yb-demo-eu-west-1a

mongodb://a4ee051c7cf844538b1bc90fbd3ef56a-565835635.eu-west-1.elb.amazonaws.com:27017/mongodb

Using this, I load some sample data:

$ curl -s https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/latest/owid-covid-latest.json |
  jq -c ".[]" |
  while read json
do
 echo "db.covid.insertOne( $json )" 
done | mongosh mongodb://a4ee051c7cf844538b1bc90fbd3ef56a-565835635.eu-west-1.elb.amazonaws.com:27017/mongodb

Now getting the PostgreSQL db address:

$ kubectl get services \
-o jsonpath='postgres://{.items[0].status.loadBalancer.ingress[0].hostname}:{.items[0].spec.ports[?(@.name=="tcp-ysql-port")].port}/yugabyte' \
--field-selector "metadata.name=yb-tserver-service" \
-n yb-demo-eu-west-1a

postgres://a4ee051c7cf844538b1bc90fbd3ef56a-565835635.eu-west-1.elb.amazonaws.com:5433/yugabyte

I can connect with psql:

]$ psql postgres://a62cf0d1ce2c94b6a8fb7d169da3b75b-2104684727.eu-west-1.elb.amazonaws.com:5433
Pager usage is off.
psql (13.7, server 11.2-YB-2.17.0.0-b0)
Type "help" for help.

yugabyte=# select "_jsonb"->>'_id' as collection, "_jsonb"->>'table' as tablename 
           from mongodb."_ferretdb_database_metadata";

 collection |   tablename
------------+----------------
 demo       | demo_aefd3536
 covid      | covid_e0d7cdae
(2 rows)

yugabyte=# select count(*) from mongodb.covid_e0d7cdae;
 count
-------
   237
(1 row)

yugabyte=# select "_jsonb"->>'location' as location 
           ,("_jsonb"->>'new_cases_per_million')::numeric as new_cases_per_million 
           from mongodb.covid_e0d7cdae 
           where ("_jsonb"->>'last_updated_date')::date > now()- interval '5 days'
           order by 2 desc nulls last limit 10;

  location  | new_cases_per_million
------------+-----------------------
 Brunei     |              5449.864
 Cyprus     |              3742.158
 San Marino |              2374.592
 Italy      |              1423.841
 Uruguay    |              1096.764
 Japan      |              1022.971
 Tonga      |               963.815
 Taiwan     |               835.377
 Mauritius  |               783.391
 Ireland    |               781.986
(10 rows)

There is still a lot to do to get the right performance. FerretDB creates a table with a single JSONB column with no primary key and adds an index on the object ID attribute of the JSON document. Then YugabyteDB creates an internal ID to distribute on it. It would be better to have it a the sharding key but that must be out of the JSON document.

There is still work in progress in FerretDB, like Create primary key index for _id automatically #1384
. If a Primary Key is added with with the ID, it will become the sharding key in YugabyteDB. Another optimization will be to avoid reading information_schema.columns which is slow on YugabyteDB (the catalog must be shared by all nodes). This will be optimized on YugabyteDB (#7745). If it is still a scalability issue, there's also the possibility to fork the PostgreSQL handler to optimize it for YugabyteDB. All this is open source 🤩 YugabyteDB and FerretDB are Apache License 2.0