Originally published on Shippable
Shippable was founded almost 5 years ago. What started as simple CI for Docker has evolved into a full-blown DevOps automation platform that supports a plethora of tools and languages, popular third party tools, and very soon, multiple Operating systems. Today we deploy 50+ micro services, along with the core API and UI. We have also grown considerably in terms of employees and we have a mix of folks who have built massive web scale apps and also who are relative new to enterprise grade applications.
A couple of years ago, we moved our code base to a monorepo, which helped us scale tremendously in terms of code reuse and overall speed of development. We are extremely proud of our ability to run a resilient service that has 99.99% availability with zero downtime upgrades.
It started with small problems...
Even though we had the ability to add features at a lightening pace, we started seeing occasional downtimes which always seemed to come down to MongoDB. For instance:
We were very happy to have 24x7 availability with primary and secondary instances of MongoDB. However, our perf suddenly deteriorated one day and retrieval started taking more than a second per document. We tried using many tools and profilers, but could not figure out what was happening. Finally, we rebuilt a new server, switched that over as primary, and rebuilt our secondary. Retrieval times dropped to 150ms again. This is still an unsolved mystery!
Our Mongo instance reached 4TB and we were proud of our growing adoption. Due to the lack of tooling around managing large DBs, we relied on indexes to keep the search times low. When NoSQL DBs first became popular, there was no way to create uniqueness, so these features were built as an afterthought. Some of the bloating of our MongoDB was actually due to indexes, but rebuilding them was primitive and the entire DB would lock down.
At one point, we needed to reboot our DB server and it took MongoDB 4 hours to come back online. This led to an extended downtime for our service, and we had very little visibility into the MongoDB process or status.
And then came the knockout punch!
The biggest advantage, and disadvantage, of MongoDB is that it has a flexible schema. This means that documents in the same collection (aka table in the old world) do not need to have the same set of fields or structure, and common fields in a collection's documents may hold different types of data. In a nutshell, there are no strict schema rules and this opens it up to a lot of cowboy tinkering.
While many developers love the flexibility, it also puts a very high degree of responsibility on their shoulders to get things right.
For example, let's consider a simple schema that stores information about a Git repository:
field name added on
As you can guess, the schema is updated over a period of time as fields are added to meet new requirements of an evolving product. This means that depending on when a repository was added to this document, it might or might not have the isPrivate and hasTeams fields. Our backend and frontend services needed to handle both cases gracefully, which led to code iike this:
Every single place where repo.hasTeams is used, we needed to add this code. With many microservices, many schemas and 40+ developers adding these blocks all over the place, our codebase was starting to look ugly. Also, every time we saw failures across our system, it was always a spectacular crash with no easy way to recover. I would wager that 90% of our crashes were due to the fact that some piece of code expected a field that didn't exist for that document. Internal brainstorming brought up the idea of building a schema validator and all sorts of hacks, but isn't this what a Database should provide out of the box? One big black mark against Mongo (or any equivalent NoSQL database)!
The straw that broke the camel's back was when we introduced a critical field that absolutely needed to be present for each document in our most important collection. To ensure that every document included the field, we had to retrive every single document one by one, update it, and then put it back. With millions of documents in the collection, this process caused the database performance to degrade to an unacceptable degree and we had to accept 4+ hours of downtime.
And that's when I decided that NoSql wasn't going to work for us. To each his own, and we were done struggling with it and causing our customers (and ourselves) unnecessary heartache.
After this last incident which happened about a year ago, we migrated to PostgreSQL. I can explain the step by step process of migration in another blog, if you're interested. We have no regrets and the following factors have greatly improved our availability and resiliency:
Postgres has a strongly typed schema that leaves very little room for errors. You first create the schema for a table and then add rows to the table. You can also define relationships between different tables with rules so that you can store related data across several tables and avoid data duplication. All this means someone on the team can act as a 'Database architect' and control the schema which acts as a standard for everyone else to follow.
You can change tables in PostgreSQL without requiring to lock it for every operation. For example, you can add a column and set a default value quickly without locking the entire table. This ensures that every row in a table has the column and your codebase remains clean without needing to check if the column exists at every stage. It is also much quicker to update every row since Postgres doesn't need to retrieve each row, update, and put it back.
Postgres also supports JSONB, which lets you create unstructured data, but with data constraint and validation functions to help ensure that JSON documents are more meaningful. The folks at Sisense have written a great blog with a detailed comparison of Postgres vs MongoDB for JSON documents.
Our database size reduced by 10x since Postgres stores information more efficiently and data isn't unnecessarily duplicated across tables.
As was shown in previous studies, we found that Postgres performed much better for indexes and joins and our service became faster and snappier as a result.
We have been very happy with Postgres since we migrated and we are no longer struggling with managing our database. As a result, we have seen our NPS go up significantly as customers are happier with a platform that "always works"..
Hurrah to 99.99% availability!