Kostas Kalafatis

Posted on May 24, 2023 • Originally published at dfordebugging.wordpress.com

Database Sharding

#architecture #database

As an application or website gains traction and attracts a significant number of users, it inevitably reaches a point where scaling becomes imperative. The ability to accommodate the surge in visitors is vital for data-driven applications and websites, and it is equally crucial to uphold the security and integrity of the stored data. Given the unpredictable nature of popularity and its duration, many businesses opt for a flexible database architecture that facilitates dynamic scaling. This approach acknowledges the challenges associated with accurately forecasting the future popularity of a website or application.

In this article, we are going to discuss one such architectural technique: database sharding. Sharding has been receiving a lot of attention in recent years, but what it is or the scenarios in which it might make sense to shard a database isn't yet clearly understood.

What is Sharding?

Sharding is a database architecture pattern related to horizontal partitioning - the technique of separating one table's rows into multiple different tables, or partitions. All partitions have the exact same schema, but completely different rows. Likewise, the data held in each partition is independent of the data held in other partitions.

When considering horizontal partitioning, it may be helpful to think about how it relates to vertical partitioning. In a vertically partitioned database, columns are divided and placed into new, separate tables. The data held within each vertical partition has unique rows and columns and is independent of the data held in the others. A table can be divided both horizontally and vertically.

Sharding involves breaking up data into two or more smaller chunks, named logical shards. The logical shards are then distributed across multiple database nodes, referred to as physical shards, which can hold multiple logical shards. The data held within all the shards collectively represent an entire logical dataset.

Database shards implement a shared-nothing architecture. This means that the shards don't share any data or computing resources. In some cases, though, it might make sense to replicate certain tables into each shard to serve as lookup or reference tables. For example, imagine a database that uses fixed conversion rates for different temperature and weight measurements. By replicating a reference table that contains the necessary conversion rate data into each shard, we ensure that all of the data required for queries is held in every shard.

Often, sharding is implemented at the application level, meaning that the application defines the algorithms that define which shard to transmit reads and writes to. However, more and more modern database management systems like Apache Cassandra have sharding capabilities built in, allowing us to implement sharding directly at the database level.

What are the options before sharding?

Like any distributed architecture, sharding comes with its own costs. Setting up shards, keeping the data on each shard up to date and making sure requests are sent to the appropriate shard is both time-consuming and complicated.

Also, sharding is not a panacea, and we should not just shard our databases for the sake of it. Before you start sharding, you should consider any of the other options.

Option 1: Do Nothing

If you don't have any clear bottleneck or limiting factor like running out of hardware, you should not even consider sharding. If it isn't broken, don't fix it.

Option 2: Vertical Scaling

Vertical scaling, or "scaling up", is the old-school way of scaling a database. It involves adding more resources to a server instance. By increasing CPU, memory, storage and network bandwidth, the performance of the database node can be improved. We either procure the largest machine we can afford or give our existing server as many resources as we can.

Vertical scaling is typically easier for developers and administrators since they do not have much to do to support a larger machine. But at some point, we may reach the capacity ceiling for the hardware we have, and we will need to spend an ever-increasing amount of time optimizing the application and the database design to squeeze out every last bit of performance. Also not only can initial hardware costs be high due to the need for high-end hardware and virtualization, but upgrades can be both expensive and limited. After all only so much can be added to one machine before it is still outgrown by the database.

Option 3: Read Replicas

Database replicas, also known as read replicas, are duplicates of a primary database created within distributed database systems to enhance performance and scalability. The process of replication ensures that consistent copies of the data are maintained across multiple servers. In this setup, the primary server is responsible for handling write operations, while the replicas are tasked with handling read operations.

Read replicas allow data to be available for reading across any number of servers, called “slaves”. One server remains the “master” and accepts any incoming write requests, along with read requests. This technique is common for relational databases, as most vendors support the replication of data to multiple read-only servers. The more read replicas installed, the more read-based queries may be scaled.

In addition to supporting a greater scale of read-only queries, the read replica strategy is used to offload long-running reports that require long-running queries to dedicated servers and away from typical user traffic. This technique is also used for high availability of the database, since the master read/write server can be replaced by a slave server as it is an exact copy.

Changes made to the primary database are propagated to the replica servers during the replication process. To keep the copies in sync with the main database, a replication method is typically employed. This can be accomplished using a variety of methods, including log-based replication, which uses the transaction log of the primary database to record changes and apply them to the replicas.

Write-Ahead Logging (WAL)

The Write Ahead Logging (WAL) technique is a popular method to preserve the atomicity and durability of data writes. This technique operates on the concept of logging data writes in some secure storage before making any permanent changes in the database. This way we can ensure the durability of our writes in case of a server crash.

This allows us to just copy the required log entry and use it again as a data write if there is any sudden loss in the database. Moreover, since the log entries will be in sequential order, we can maintain the atomicity of the data operations by referring to the logs in the event of a server crash.

Write Ahead Logging maintains individual logs for every server process stored sequentially. Every single log entry has a unique identifier to avoid any confusion. Moreover, these identifiers facilitate certain operations on the log, such as Log Segmentation, Log Cleaning with Low-Water Mark etc.

Pros and Cons of Sharding

The main advantage of sharding a database is that it can help in the facilitation of horizontal scaling, or scaling out. When we scale horizontally, we add more machines to an existing stack to spread out the load and allow more traffic and faster processing.

Another advantage of sharded databases is the decreased query response times. When we submit a query on a database that hasn't been sharded, it may have to search every row in the table you're querying. Even if you indexed the columns appropriately, in a large monolithic database, queries can become prohibitively slow. By sharding one table to multiple, though, queries have to review fewer entries and their result sets are returned quicker.

Finally, sharding can help to make an application more robust, by mitigating the impact of outages. If the application relies on an unsharded database, an outage can potentially make the entire application unavailable. On the contrary, in a sharded database, outages typically affect only a few shards. Even though this might make some functionality temporarily unavailable. the overall impact would be less than an entire application crash.

The main disadvantage of sharding is the sheer complexity of properly implementing a sharded database architecture. If the restructuring is done incorrectly, there's a significant risk of data loss and table corruption. Even when done correctly, though, sharding is likely to have a major impact on the existing workflows. Rather than accessing, managing and testing data from a single entry point, each team must manage data across multiple shard locations, which could potentially be disruptive.

Another problem that users sometimes encounter after having sharded a database is that the shards eventually become unbalanced. Let's say that we have a database with two separate shards, one for customers that have a debit card and one for customers that have a credit card. However, the application serves an inordinate amount of people who don't want a credit card. In fact, every time a user opens an account, they automatically get a debit card. The credit card shard has become what is known as a database hotspot. In this case, any benefits of sharding are cancelled out by the bottleneck of the debit cards shard. The database would probably need resharding to allow for a more even data distribution.

Finally, once a database has been sharded, it can be very difficult to return it to its unsharded form. Any backups of the database made before it was sharded will be outdated. As a result, rebuilding the original database architecture would require merging the new partitioned data with the old backups, or alternatively, creating a new single database, both of which would be costly and time-consuming endeavors.

Sharding If You Must

Whether or not sharding a database is a viable strategy, is almost always a matter of debate. Some see sharding as an inevitable outcome of databases that reach a certain size, a critical mass if you want, while others see it as a headache that should be avoided at all costs.

Before you shard your database, you need to answer a few important questions. Your plan will depend on how you answer these questions.

How do my tables interact, and what queries do I run on my database? Sharding works best when the data can easily be partitioned into independent subsets. If there are complex relationships or you have frequent cross-shard queries, sharding might not be the best solution.
How will data grow? Can I redistribute it later? If you anticipate that your application will significantly expand in the future, sharding can provide a scalable foundation from the beginning. It's generally easier to implement sharding early on than to retrofit it into an existing monolithic database.
How do we distribute the data across shards? Are there any potential hotspots if data isn't distributed evenly? Sharding can help mitigate hotspots by distributing the load across multiple shards. If a specific subset of data is expected to generate a high volume of read or write operations, you can distribute that data across multiple shards to distribute the load. This way, the hotspots are shared among multiple servers, reducing the likelihood of performance issues.

Sharding Architectures

Once you've decided that sharding is for you, the next thing you need to decide is how you'll go about doing so. Let's go over a few common sharding architectures.

Key-Based Sharding

Key-based sharding, also known as hash-based sharding, involves using a value from the dataset, a customer's ID, an IP address or some other value, and plugging it into a hash function to decide which shard the data should be written to.

The values passed in the hash function should all originate from the same column to guarantee that entries are placed in the right shards and consistently. The term "shard key" refers to this column. Shard keys are comparable to primary keys in that both are columns that serve as a distinctive identifier for each row. A shard key should, in general, be static, meaning it shouldn't contain data that could change over time. Otherwise, it would add to the workload associated with update operations, thereby slowing down performance.

Although key-based sharding is a very common sharding design, adding or removing extra servers from a database on the fly can be challenging. Many of the current entries, if not all of them, will need to be remapped to their new, right hash value and then migrated to the appropriate server when we add servers because each one will require a corresponding hash value. The old and new hashing functions won't work once we start rebalancing the data. As a result, during the migration, the server won't be able to write any new data, and the application may experience downtime.

The key benefit of this approach is that hotspots may be avoided by using it to equally disperse data. Additionally, since the data is distributed algorithmically, there is no need to keep track of where each piece of data is located, unlike with other approaches like range-based sharding or directory-based sharding.

Range-based Sharding

Range-based sharding divides data by value ranges. Let's imagine we have a database of a retailer's merchandise. We might build several shards and divide product information by price range, like this:

The key advantage of range-based sharding is how easy it is to implement. Despite the fact that each shard contains a unique set of data, they all share the same schema with the main database. The application code determines which range the data belongs to before writing it to the appropriate shard.

On the other hand, range-based sharding does not prevent data from being dispersed unevenly, which results in the database hotspots stated before. Even if each shard in the example diagram contains an equal quantity of data, it is likely that some products will get more attention than others. In result, their particular shards will get a disproportionately high amount of reads.

Directory-based Sharding

A lookup table is used in directory-based sharding to keep track of which shard contains which data. It establishes a one-to-one mapping between the data and the shard in which it is kept.

A good example is provided below. As with key-based sharding, a column from the base table is chosen as the shard key. Then, for each shard key, a unique shard ID is assigned, indicating which shard contains the data indicated by the associated shard key. In this manner, various shards are created from all of the rows in the original table.

The ID field is a shard key in this case. The data from the shard key is written to the search table, along with which shard each row should be written to. This is similar to range-based sharding, but instead of figuring out which range the shard key's data goes into, each key is tied to its own shard. Directory-based sharding is better than range-based sharding when the shard key has few possible values and it doesn't make sense for a shard to store a range of keys. It's also different from key-based sharding in that it doesn't run the shard key through a hash function. Instead, it checks the key against a lookup table to see where the data needs to be put.

Directory-based sharding is most appealing because of how flexible it is. Range-based sharding architectures only let you define ranges of values, while key-based ones only let you use a fixed hash function, which, as we've already said, can be very hard to change in the future. With directory-based sharding, on the other hand, you can assign data entries to shards using any method or algorithm you want, and it's easy to add shards on the fly.

Even though directory-based sharding is the most flexible of the ways we've talked about, having to connect to the lookup table before every query or write can slow down an app. Also, the reference table can become a single point of failure: if it gets corrupted or fails in some other way, it can make it hard to write new data or get to data that has already been written.

Conclusion

We've talked about sharding, when to use it, and how to set it up. Sharding is an effective solution for applications that require a huge volume of data to be managed and easily accessible for high levels of reading and writing. Nonetheless, it makes operations more difficult. Before you begin implementation, consider whether the advantages outweigh the expenses or if there is a simpler solution.

DEV Community