If you want to reach your full potential as a developer you need to have at least a basic understanding of how all levels of an application work.
At the higher levels for engineers it's more about your ability to design a system rather than pure coding ability. Knowing this stuff is essential if you want to advance your career
In this post I'm going to cover database sharding in depth, as well as some other basic concepts related to database scaling.
- What is Database Sharding
- Why is Database Sharding Used
- Pros and Cons of Sharding
- Database Sharding vs Datebase Partitions
- Database Sharding Architectures
- What causes databases to slow down?
- Database Sharding alternatives
If you prefer watching a video and would also like to learn more about scaling databases beyond just sharding, check out the video below:
In simple terms it means you are taking your data and breaking it up into "shards", different pieces that together hold all of your data. This allows for applications to scale far beyond the constraints of a single traditional database.
In human terms, think about owning a successful bakery. Eventually you get too much work to handle on your own so you hire employees for the tasks of making cookies, cakes, and brownies. You've sharded your work so that your business can get more done than what you could do alone.
Databases are sharded for 2 main reasons, replication and handling large amounts of data.
Sharding allows for replication because we can copy each shard of data onto multiple servers, which makes our application more reliable. If one of those machines goes down, we still have additional machines with the data that can serve requests. With a sharded database your application no longer has a single point of failure and is more fault-tolerant.
Big data requires sharding for the simple fact that at large scale a single machine can't hold the entire dataset. Imagine Facebook trying to store their petabytes of data in a single database on a single server? It would be impossible. At massive scale, the data has to be sharded into manageable pieces and then distributed over a large number of servers.
Without sharding your only option for improving the performance of your database would be vertical scaling. That means adding more disk space, RAM, and faster CPUs to a single machine to try and handle the increasing amount of data your application is handling.
With sharding you break your data into equal sized, manageable chunks that can be spread across a number of cheap commodity servers which can be scaled up easily.
A side effect of having all those small servers means your app is more resilient to failure. You can have each shard stored on multiple servers/nodes.
This means that even if one server fails, your app can still retrieve the data it needs. If you were relying on a single massive database server you would be in big trouble if it went down.
Because your data is broken into smaller pieces, queries only have to search smaller amounts of data. This speeds up database performance and response times.
You also have the option of distributing those shards across multiple data centers around the world, which means reduced latency and a better overall experience for your users.
Sharding comes at a price, which is the additional complexity of dealing with having your data spread around all those servers. Sharding should always be a last resort when it comes to scaling your database, other alternatives like read replicas and caching should be implemented first because they are much easier to implement.
When you do have to shard a database, your goal should be to abstract it away from your application developers. To your engineers it should seem like they are just writing to the database like they always have, behind the scenes a service or library should do the heavy lifting of making sure data is distributed to the proper shard.
Some databases have built-in sharding support and handle this outside the application layer, which reduces the complexity for your team.
Keeping data consistent across nodes is one example of additional complexity that comes with sharding. You now have to worry about things like how to update all nodes of a particular shard when data is updated.
In some cases you could end up serving stale data to your users if something is updated but those changes are not propagated to all the shards serving data to other users.
One of the biggest decisions when sharding your database is how you decide to break up your data. The goal should be to distribute the load equally across all the shards.
For example, sharding your database into equal sized chunks based on User ID sounds pretty smart and like an ideal solution.
However, you may run into a similar issue as Instagram, where Justin Bieber's account got significantly higher rates of traffic compared to normal accounts. This resulted in that particular database shard being overloaded constantly despite having the same total number of user accounts partitioned to it. Every time Justin Beiber made a post the entire Instagram app would slow down for users across the world.
This problem is known as having a "hotspot" or "hot key" in your database. Your goal when picking how to shard your database is to prevent issues like hotspots from impacting your application performance.
There are several different methodologies for sharding a database. In this section I'll cover a few of the most common strategies.
Key based sharding works by taking a value from a newly submitted piece of data and running it through a hashing function. This hashing function determines which specific shard the data will be written to. The value from the data chosen is known as the shard key.
A downside of this strategy is the need to remap data to hash values when servers are added or removed.
Range based sharding involves sharding data based on defined ranges of the values being written. Some examples:
- Items with a price below $10 and items above $10 get put into different shards
- Items submitted each week are put into a shard
This type of sharding is intuitive and easy to implement, but could easily result in hotspots and imbalanced data distribution.
The benefit of this is that the underlying sharding scheme can be changed without impacting application code. The downside is that the lookup service can become a point of failure.
The terms "sharding" and "partitioning" get thrown around a lot when talking about databases. For me this was one of the most confusing aspects of learning this stuff because they are often used interchangeably and there is a certain amount of overlap between the terms.
In this section I'll try to clear up some of the confusion and save you the pain of searching around online trying to find an answer.
Partitioning is a general term for breaking up a single dataset into multiple pieces.
Database sharding is technically a subset of database partitioning, specifically a horizontal partition of your data. When you shard you are effectively creating a replica of your current database column schema and then dividing it by the number of shard keys you decide to use.
A vertical partition involves splitting a database table on columns. An example could be breaking a single User data table into several different tables like personal information and address/location data.
Visualizing horizontal sharding is pretty easy. Imagine having an Excel spreadsheet with a huge amount of rows and a set amount of columns.
Horizontally sharding this Excel spreadsheet would be done by choosing a column as the shard key, and then choosing how many keys you want to create. For this example we'll choose the "name" column as our shard key.
You could split the data in half by creating 2 shards, names A-M and N-Z. If the amount of data is so large that 2 shards isn't enough, you could break it into 26 shards, 1 for each letter.
In production would be considered a naïve implementation to just shard based on the starting letter, but the core concept is the same. You choose a column of data to shard and then choose how to group that data.
Vertical sharding involves splitting your data by column rather than rows. This is less commonly used but can be powerful in certain situations.
A real world example of vertical sharding is Instagram. When popular accounts posted new images the like count would increase extremely fast and cause performance issues. One part of solving this problem involved splitting off the like count from the rest of the photo metadata.
A good signal that you should vertically partition a table is when you notice lots of queries only requesting a few of the columns at a time.
Another confusing aspect when learning about database sharding are the terms logical vs physical shards.
Logical shards are the chunks of data, for example your users chunked by the starting letter of their name A-Z.
These logical shards are then distributed across physical shards, which are the actual database node instances that are running on servers like PostgreSQL, MongoDB, or MySQL.
There can be multiple copies of each logical shard spread across multiple physical instances. The reason for this is reliability. Even if 1 server containing the data we need fails, our application can still retrieve a duplicate copy from another physical shard/database server.
In this section I will go over what exactly causes databases to slow down. There isn't one single reason and depending on the type of application your database will struggle for various reasons like CPU, RAM, or storage limits.
Probably the easiest to understand reason a database would require sharding. In this case you simply have too much data to fit on a single hard drive.
Even if the application only got a few requests per day, you would still need to shard this database so all the data is available. Vertical scaling AKA getting a bigger hard drive only works for so long and it becomes more cost efficient to use more machines with commodity hard drives at some point
If you have a large number of users querying your database at the same time your CPU may become the bottleneck that causes your database to slow down.
In the worst case scenario user requests would begin to timeout because the requests are stuck waiting while the CPU tries to process them. If you've seen a website go down from suddenly getting a bunch of traffic from somewhere like Hacker News or Reddit, this is a likely cause.
Sharding in this case would spread the load over multiple servers so each can handle a portion of the requests.
Serving data from RAM is much faster than having to seek the data from disk/hard drive. Pretty much all databases automatically cache the most frequently used data. Problems begin to occur when the DB simply doesn't have enough RAM to serve this data quickly. The DB then falls back to retrieving data from disk which is extremely slow.
Under high traffic conditions the server won't be able to keep up and much like with high CPU usage the result is that user requests will begin to timeout. Sharding solves this problem in much the same way, more machines each holding a chunk of the overall data means that there is more RAM available to efficiently serve data.
Data moves through fiber optic cables that have a fixed capacity. If the amount of data being sent back and forth exceeds that amount, requests will begin to fail.
Think about network bandwidth like a bucket of water, if you pour in more than it can hold it overflows. In the case of your application, all that overflow results in failed requests.
There are a number of options available for scaling a database, many of which you should try to use before attempting to shard your database.
A simple way to optimize database performance is to index a table based on a column. This allows the database to look find rows for queries without having to do a full scan of every row in the table.
Caching involves storing data in memory to serve it faster and more efficiently. As you've read above, your database already has some caching built in. To further boost performance and help keep traffic load off the database, most applications have another dedicated caching layer using something like Redis.
These dedicated servers are used only for caching and sit in front of the database. They store the most frequently accessed data and if possible will handle the response. In the case they don't have the requested data stored, the request will be forwarded to the DB.
In many web applications reads are far more common than writes. Twitter for example can have millions of reads for a tweet that was written to the database only once. Caching is perfect for these types of read-heavy apps where updates are less frequent.
Read replicas are clones of your database that are configured to only receive read requests from users. This makes them far easier to implement then sharding. All updates and writes are directed towards a single server and those changes are eventually sent to all the read replicas as well.
One problem with using read replicas is stale data. Think about what would happen if a user makes a request for something that has been updated in the main DB, but that change hasn't yet reached the read replicas? They would get an old version of the data. This might not matter for some applications, but could cause major issues in others.
The simplest solution, it doesn't require any changes to your DB or application, just move to a bigger server for as long as you can. However, if the traffic to your app is going to increase long term this is basically just kicking the can down the road.
Hopefully this post helped you understand database sharding and scaling. It's a complicated subject and takes time for everything to click mentally, so don't worry if your head is still spinning.
If you found anything confusing let me know in the comments below so I can update the post and clear things up!