Originally published on blog.yugabyte.com
A distributed SQL database provides a service where you can query the global database without knowing where the rows are. You connect to any node, without having to know the cluster topology. You query your tables, and the database will determine the best access to your data, whether it’s close to your client or geographically distant.
The organization of data, whether co-located or partitioned, is the most important consideration for high performance, high availability, and geo-distribution. YugabyteDB has several features that help here. These features can be divided into three levels of physical organization:
- At the query level (YSQL), using the PostgreSQL syntax, the user partitions a logical table into multiple ones, based on column values. This is called table partitioning.
- At the storage level (DocDB), using the primary key and index column definition, we distribute rows into multiple tablets. We call this sharding.
- At the file level, the rows — or index entries — are stored on the key, within each Sorted Sequence Table (SST).
In this blog post, we’ll discuss the relevant terms and definitions behind sharding and partitioning in YugabyteDB and show you how to use both correctly.
Partitioning is a general term, and sharding is commonly used for horizontal partitioning to scale-out the database in a shared-nothing architecture. But these terms are used for different architectural concepts. However, since YugabyteDB provides both, it’s important to use the right terminology.
Here are three general definitions before we dive into the YugabyteDB architecture.
If you split your data into multiple databases, this can be called sharding. For example, let’s say you have a database for your EMEA customers: one for the US, and one for Asia. The application connects to one or the other depending on the country code of the user. This is called application sharding, but is not a distributed database. You don’t need any special database for that, only some automation to manage it. You can also use multiple instances of a monolithic database. But this requires a lot of code in the application and operations to manage it.
If you add middleware to handle part of this additional code, you then have a sharded database. For example, CitusDB runs multiple PostgreSQL databases, and adds a master node. You have to connect to the master, or aggregator, which will redirect the query — or part of it — to the worker nodes. Oracle Sharding is similar, with Shard Director and Connection Pools to coordinate the work that is offloaded to the many databases. This technique is heavily used in data warehouses where you can split queries and process them in parallel. But OLTP transactions on relational tables have to maintain indexes and referential integrity across multiple shards.
In a distributed SQL database, you have no master or director. All nodes are equal. You can connect wherever you want, and query your entire database from there. Each node knows the topology of the cluster, the location of data and metadata, the transaction status, and sequence next value (with a cache) because it is a single database. You have elasticity without giving up on SQL features from the most advanced relational databases.
Now that we’ve defined those techniques, let’s go deeper into all levels of data organization in YugabyteDB, starting from the storage layer.
At the lower level, the organization of rows — and index entries — is within the Sorted Sequence Table (SST) on a Linux filesystem. This is done automatically, during writes and compaction, so that reads can find a row easily. Each SST holds rows ordered by their key and version, with additional indexes and bloom filters to locate them. In terms of co-location, the rows — or index entries — with a key that has the same hash value, and within a narrow range of values, are stored together in each SST file. The number of SST files is limited by compaction. At this level, everything is automated in YugabyteDB. The user only provides the filesystem paths.
The SST files store the key-value pairs for tables and indexes. Sharding is the right term here because each tablet is a database (based on RocksDB), with its own protection. This looks like the sharded databases we described above, except that they are not SQL databases but key-value document stores. They have all the required features for a reliable datastore, with transactions and strong consistency. However, they don’t have the burden of managing them as multiple databases because the SQL layer is above. Joins and secondary indexes are not processed at this level because this prevents cross-shard transactions.
The sharding is automatic, based on the SQL data model options:
- The colocated property, or tablegroup, defines which SQL tables (i.e., indexes or partitions, as we will see later) are sharded, with multiple dedicated tablets. And which ones are colocated, replicated but not sharded.
- The primary key or index column definition defines the sharding method, with the HASH or range (ASC or DESC) sharding. This is where we choose to distribute data on the hash column to load balance and avoid hot points. It’s also where we keep rows together on the ascending or descending columns for range scans (like ‘<‘, >’, ‘BETWEEN’ predicates)
- Optionally, the additional definition on the number of tablets with the
--ysql_num_tabletsat system level or SPLIT AT/INTO a table at index level.
With sharding, we define the distribution method for the data model at metadata level. But at the data level this is done automatically: the distributed storage layer manages where rows or index entries must be written or read. This approach is sufficient for many applications:
- Co-locate the small tables together to reduce the memory footprint.
- Shard the large and growing tables into multiple tablets, one per node (or more), to be able to rebalance and scale them out when adding more nodes.
- For sharded tables, think about which columns will be accessed by a range scan, when rows are queried together using a range predicate.
The previously-discussed approach to sharding considers all nodes to be equal, whether they are located in the same Data Center or multiple availability zones (AZ). You’ll want to load-balance across all compute and storage instances, and scale-out as needed. The organization of data is decided per table and index. However, you may want to go further with more control on rows, based on their value. This is where you partition a table at SQL level, on column values, with a HASH function, RANGE, or LIST of values. This is a PostgreSQL feature, known as declarative partitioning, which can be used with YugabyteDB because it is fully code compatible with PostgreSQL.
The main reason for partitioning, besides partition pruning, is information lifecycle management. If you partition by month or years, purging old data is as simple as dropping a partition. This is the only scalable way to do it.
Another reason to partition at this level in a distributed SQL database is when you want to constrain some data to specific locations, such as:
- For legal reasons, when you are required to store data from some countries in a specific region.
- To ensure predictable response time, with data close to the users.
- For cost reasons, to lower cross-AZ, cross-region, or cross-cloud provider network traffic.
For this last option, the partitions are mapped to tablespaces. Tablespaces are used in PostgreSQL to store the partition in a specific filesystem. With YugabyteDB, you map tablespaces to the cluster topology with the placement attributes zone, region, and cloud provider.
After describing the levels of data distribution, let’s explain why we need all of them. SQL, especially for OLTP, is complex. You have multiple keys, usually one surrogate key and one or many natural keys. They all need to be enforced by a unique index. And because the tables can be partitioned on only one key, the others require a global index. A global index can be partitioned, but not on the same key as the table.
With YugabyteDB, the tablets described above store table rows and index entries as key-value pairs. This is for high performance — with distribution of rows — and High Availability — with synchronous replication. When sharded databases, as mentioned earlier, add SQL at this level, it comes with severe limitations: SQL features are limited to a single-shard. This means that indexes are only local, that foreign keys can be enforced only on interleaved tables, and that transactions cannot touch rows from multiple shards. We are back to a hierarchical database where we have to access a table by the primary key only.
In OLTP applications, there is a primary key, which is a generated surrogate key from a sequence or uuid to obtain an immutable identifier. But there’s also the natural key that we need to enforce with a unique constraint. This is why we use SQL databases: multiple access paths and integrity constraints to secure our data with less application code. A data warehouse can accept non-transactional global indexes, but this is not desirable for OLTP. Enforcing the unique constraint is also required to avoid data corruption, like double writes after a failure or through logical replication.
Let's take a simple example: a table of country citizens, where the natural key is the Social Security number. This is enforced with a unique constraint. The primary key, generated, is immutable, and then can be referenced by foreign keys.
With a sharded database like CitusDB or any other database based on a foreign data wrapper, you cannot do that. Because each shard is a monolithic database, you cannot enforce unicity across multiple databases. If you shard by hash on the surrogate key, you can enforce its uniqueness, and foreign keys will be possible from the same shard only. But there’s no way to guarantee the unique natural key. You have SQL features within each shard, but not at the global level.
What you need for this table is sharding the table on the primary key, for joins through foreign keys, and sharding the index on the natural key for queries by business value. This is where global indexes are required. Above it, the table can be partitioned by country to isolate each continent to a physical region. And there, local indexes are sufficient as the goal is partition placement rather than data distribution.
With a distributed SQL database like YugabyteDB, the purpose of sharding into tablets is to scale-out at row level. On top of this, you can create unique constraints, secondary indexes, foreign key constraints, and transactions. Partitions and tablespaces give more control on the physical location. A distributed SQL database needs to combine both and then be exposed as one database application that can connect to any node transparently. Sharding operates on tablets for data distribution, applying a hash or range function on rows and global index entries. Partitioning operates on table partitions for data placement, applying range or list defined on the table, with local indexes. Tablet sharding applies to YCQL and YSQL but partitioning is a YSQL feature. Documentation details can be found here:
- Partitioning for data placement, described with tablespaces
- Sharding for data distribution, described in primary key and index definition
The concept and syntax for YSQL is an extension of PostgreSQL CREATE TABLE and CREATE INDEX statements. In YCQL, you may see the general term of “partitioning” used for the automatic sharding of tablets. This is its name in Cassandra, and there’s no table-level declarative partitioning in YCQL.