Nebiyu Elias Talefe

Posted on Nov 5, 2023 • Updated on Nov 10, 2023

Scaling Databases in the Tech World: Best Practices Unveiled

#database #mysql #postgres #scaling

Introduction

To build a successful digital product, you need to store lots and lots of data. Databases have become significantly better over the years to be able to store massive amounts of data. But when your product becomes so popular it means the workload on your database is going to grow. If your database is performing poorly then it can be a bottleneck for your application. You need to start thinking about strategies to help you deal with this growing demand.

In this article, we will look at how popular platforms we use daily scale their database to an ever-increasing demand. As users of these popular platforms, we take things for granted and don't bother about the kind of things they have to do to ensure the availability of their service. It's no easy feat and requires a lot of ingenuity.

Database Scaling Strategies

Whenever I think about scaling systems I like to use an analogy. Think of the road network in big cities around the world. As more and more people move into big cities to find better jobs and opportunities, the roads are going to become heavily crowded. You need to scale the transportation system, to deal with this growing demand.

One strategy you can use is to make more roads, but this could be more costly. Another strategy you can use is to make the roads wider by extending the already existing ones. This approach is going to be a lot cheaper but you can’t keep adding more lanes to an already existing road network. You will be limited.

There are different database scaling strategies, each with its advantages and disadvantages. The kind of approach you will choose will depend on your current situation and there is no one size fits all solution for everyone. But here are the popular strategies used to scale a database:

Vertical Scaling (Scaling Up)

Vertical scaling involves increasing the resources (CPU, RAM, Storage) of a single server to handle growing demands. This is probably the easiest approach you can choose when you first encounter performance issues with your database. But you can’t simply keep adding more CPU or RAM to a single server, there is going to be a limit. It’s also worth mentioning that your expenses will go up as you keep adding more resources.

In the case of the road scaling example mentioned earlier, this would be equivalent to making already existing roads wider by extension.

Replication

To do this, you make replicas (duplicates) of your database. You might have one main database and multiple read replicas, as opposed to just one main database handling all reads and updates. The primary database is the first place you write data whenever you need to, followed by all of the read replicas. Any of the read replicas can be used to read data when necessary.

This facilitates the scalability of programs in which users read a greater amount of data than they write (this is common in many applications, where users will read significantly more than they write). There are different approaches to implementing replication:

Asynchronous replication: Whenever there is an update request from clients, the main database does not wait until the data is replicated on all the other read replicas. It simply acknowledges to the client that the update was successful and another process would do the replication in the background later.
Synchronous replication: Whenever there is an update request from clients, the main database waits until the data is replicated on all the other read replicas. The update will not be complete until the replication is done.

Sharding

Imagine having a gigantic table with thousands of rows. Things are going to become relatively difficult when dealing with such a table. With sharding, you partition the table rows into smaller chunks, called logical shards. Sharding is a horizontal scaling technique (oftentimes called scaling out) that allows you to distribute your data across multiple database nodes. The following diagram illustrates how one big table can be partitioned into smaller shards (chunks).

Sharding Techniques

There are different ways of partitioning your data horizontally, let’s look at some common techniques:

Key-based partitioning: For this, you need a hash function, which is just a function that takes an input and produces a discrete set of outputs based on the input you’ve given to it. The hash function should produce the same output if the same input is used in the future. With key-based partitioning, you would use one of the columns of your table (something like user id) and give that to the hash function. This column is usually called a shard key. The hash function decides which partition this shard key should go to.

Range-Based Sharding: For this, you need to select a column that you can use to categorize your data. For example, if you have a table that stores items in a supermarket, you can use the price column to categorize the data. You can create categories such as < $50, < $100, and ≥ $100. Then your data could be partitioned based on the value it has for the price column. The items that have a price of less than $50 would go to one shard, the items that have a price greater than $50 but less than $100 would go to another shard, and so forth.

Lookup Table-Based Sharding: This technique is also called directory-based sharding. You need to have a directory (lookup table) that stores information about which shard to use for each shard key. For each shard key, you can randomly assign the shard it should go to. But you need to remember in the future which shard it is located in. That’s why we have the lookup table. It’s like a phone directory.

How Did Popular Platforms Handle Scalability Issues?

Now let’s look at how engineers use their ingenuity to solve database performance problems.

Notion

Notion, a popular note-taking app, hit 1 million users in 2019. That’s a significant milestone for any startup. But people started complaining about the slow performance of the app on X (FKA Twitter).

Before hitting 1 million users, Notion was using a PostgreSQL monolith. Notion engineers decided to use sharding their monolith into multiple distributed database nodes to improve the performance of their app.

The breaking point came when a PostgreSQL process called VACUUM started to stall. In PostgreSQL when you perform an update, the old record is not simply removed but saved on disk. This is because it’s not safe to directly modify existing data, as other transactions could be reading it. At a later point, you can run the VACUUM process to delete the old, outdated data and reclaim disk space. But in the case of Notion, this process started to become extremely slow. That’s when they decided to shard their database.

However, after the sharding, they had to move data from their old database to their distributed database nodes. As you can imagine this is not easy because you don’t want to shut down your app while you have users who will be using it. So this is what they did to ensure they did the database migration while the app was functioning:
1. Double-write: Incoming writes are applied to both the old and new databases.
2. Backfill: Migrate the old data to the new database.
3. Verification: Ensure the integrity of data in the new database.
4. Switch-over: Actually switch to the new database. This can be done incrementally, e.g. double-reads, then migrate all reads.

As a result of their work, people started noticing the performance improvement of the app. It’s always nice to see people appreciate your hard work.

Quora

Quora is a social platform where users can post and answer questions on anything. The website receives more than 600 million visits per month.

Quora is using MySQL for their database needs and they initially used read replicas to handle the large requests they were getting. But as the site became more popular, it was not enough. Data size stored in MySQL is of the order of tens of TB without counting replicas. Their queries per second (QPS) are in the order of hundreds of thousands.

Here are some of the ways the Quora engineers tried to handle the increasing demand:

Vertical Sharding

They were able to partition their database by moving some tables to a different MySQL host.
If a certain table was getting very large or had lots of traffic, they created a new partition for that table. Each partition consists of a master node and replica nodes.

The mapping from a partition to the list of tables in that partition is stored in ZooKeeper.
Vertical sharding enabled them to scale out, rather than always needing to scale up as our data size and QPS increased. There are some disadvantages to this approach.

When trying to move large tables from one partition to another, there is a replication lag for the slaves of the destination partition.
It was impossible to do a join of tables that live in different partitions. Joins inside MySQL were strongly discouraged in the codebase

Horizontal Sharding

Vertical sharding was a good first step to scale the usage of MySQL, but individual tables can still grow very large. The engineers also decided to use horizontal sharding.

Horizontal sharding involves splitting a logical table into multiple physical tables, which are then referred to as shards of the table.

When implementing sharding, engineers at Quora had to make quite a few decisions:

Build vs. Buy: Quora decided to build an in-house solution rather than use a third-party MySQL sharding solution (Vitess for example).
Range-based sharding vs. Hash-based sharding: Quora makes frequent use of range queries so they decided to use range-based sharding. Hash-based sharding performs poorly for range queries.

GitHub

Until recently, GitHub was built around one MySQL database cluster that housed a large portion of the data used for repositories, issues, pull requests, user profiles, etc. This MySQL cluster was called mysql1.

In 2019, GitHub set up a plan to improve their ability to partition their relational databases. Here are the steps they used to achieve this:

Virtual Partitions

Before database tables are physically partitioned, they need to be virtually partitioned in the application layer. You can’t have SQL queries that span partitioned (or soon to be partitioned) database tables.

In order to implement Virtual Partitioning, GitHub first created schema domains, where a schema domain describes a tightly coupled set of database tables that are frequently used together in queries.

An example of a schema domain is the gists schema domain, which consists of the tables gists, gist_comments, and starred_gists. These tables would remain together after a partition.

GitHub stored a list of all the schema domains in a YAML configuration file. Here’s an example of a YAML file with the gists, repositories and users schema domains and their respective tables.

Now, GitHub needed to enforce these schema domains. They want to make sure that application code doesn’t have SQL queries or transactions that span schema domains. They enforce this with SQL Linters.

Moving Data without Downtime

Now that GitHub has virtually isolated schema domains, they can physically move their schema domains to separate database clusters.

To do this on the fly, GitHub uses Vitess. Vitess is an open-source database clustering system for MySQL that was originally developed at YouTube.

GitHub uses Vitess’ vertical sharding feature to move sets of tables together in production without downtime. To do that, GitHub uses Vitess’ VTGate proxies as the endpoint for applications to connect to instead of direct connections to MySQL. Vitess handles the rest.

Since implementing these changes, GitHub has seen a significant decrease in load on the main database cluster.

In 2019, mysql1 answered 950,000 queries per second on average, 900,000 queries per second on replicas, and 50,000 queries per second on the primary.

In 2021, the same database tables were partitioned over several database clusters. The average load on each host halved despite the total queries per second increasing to 1,200,000 queries per second.

For more details, you can read the full blog post here.

Conclusion

It’s truly a big achievement to get adoption by users for your app. But once you start having lots of users, you will start to have performance issues. But one thing that is common among the popular platforms is that they started small and then once they started growing and noticed performance issues, they started to address it. So you should always keep that in mind. So you should always follow the KISS (Keep it simple stupid) principle and start to address performance issues when you notice them. Trying to optimize everything from the start is a form of overthinking. Don’t waste engineering resources in the earlier stage of your startup.

“Premature optimization is the root of all evil” - Donald Kuth

DEV Community