One limitation of PostgreSQL declarative partitioning, when compared to some other databases like Oracle, is the impossibility to create global indexes. This includes the unique index that is necessary to enforce a primary key. It means that there's no built-in way to enforce the unicity of a key across all partitions, except when it includes the partition key. In this latter case, local indexes that enforce the local uniqueness are sufficient to guarantee the global uniqueness of the compound key, but some applications do not like compound primary keys.
With YugabyteDB you don't need declarative partitioning to scale because tables are split with automatic sharding to small tablets, and, at this level, all indexes are global and can enforce uniqueness. However, on top of it, you may want to use the PostgreSQL declarative partitioning for two reasons: lifecycle management (with the ability to drop old partitions) or geo-partitioning (to assign partitions to specific regions with tablespaces).
How to guarantee global uniqueness? There are two easy solutions when the application is designed to scale, and one alternative for legacy applications, which is the goal of this blog post.
The easy solutions are:
- A primary key should be generated from a UUID or a Sequence, and should be immutable. Both are designed to generate unique values, with a high probability in the case of UUID or even a 100% guarantee in the case of a Sequence. You may not need an additional index.
- include the partition key in the primary key. This means adding the date (for lifecycle management) or the region (for geo-partitioning) to the local identifier. Applications designed for geo-distribution should do that. Other alternatives are for legacy applications.
If for any reason you want an additional guarantee of uniqueness for the part that doesn't include the partition key, there's no other choice than having a global source of truth, and it will limit the scalability. Sequence is one of this kind, and are optimized in YugabyteDB by limiting their transactional behavior to the minimum necessary: numbers are cached and the incrementing operations are never rolled back. However, a global index must be fully ACID and a transaction inserting a new primary key then becomes a global transaction. Another approach, a global query to check all partitions, must also be a serializable global transaction.
Here is an example of building a global unique index, as a table maintained by a trigger. In YugabyteDB, tables and indexes, are the same because a table is stored in its primary key. Then, in YugabyteDB, this solution is not only logically equivalent to a global index but also physically equivalent. In PostgreSQL this solution is limited because there's no sharding, and a table with a primary key is two structures: a B-Tree index and a Heap Table.
I've setup a cluster as in this previous post, a tree-region cluster (across the solar system just for fun) with a Docker Compose.
The customer table is geo-partitioned to earth, moon and mars. Its primary key is compound of with a generated UUID (
id) and the region identifier (
yugabyte=# \d+ customers Table "public.customers" Column | Type | Collation | Nullable | Default | Storage | Stats target | Description --------+------+-----------+----------+-------------------+----------+--------------+------------- id | uuid | | not null | gen_random_uuid() | plain | | planet | text | | not null | | extended | | info | text | | | | extended | | Partition key: LIST (planet) Indexes: "customers_pkey" PRIMARY KEY, lsm (id HASH, planet ASC) Partitions: customers_earth FOR VALUES IN ('earth'), customers_mars FOR VALUES IN ('mars'), customers_moon FOR VALUES IN ('moon')
This is sufficient and optimal: an insert will be a local transaction, the composite primary key is guaranteed to be unique, and we would be very unlucky if seeing duplicate UUIDs.
If, for any reason, there is a need to guarantee that the UUID is globally unique, I cannot directly create a unique index:
yugabyte=# create unique index customers_unique_id on customers(id); ERROR: insufficient columns in UNIQUE constraint definition DETAIL: UNIQUE constraint on table "customers" lacks column "planet" which is part of the partition key.
Here is my alternative. I create a table with the same columns as the main table's primary key, but where only the
id is in the primary key:
yugabyte=# create table customers_unique_id (id uuid primary key, planet text not null); CREATE TABLE
This table is not partitioned. This is not a problem in YugabyteDB because automatic sharding applies. The only thing you have to take care is that if you partitioned for data governance reasons (to keep sensitive data in specific regions) then the information in this global table should not contain sensitive information. This should not be a problem with a UUID and a region name.
To guarantee the uniqueness of
id, I don't need another column but I've added the region discriminent,
planet, as this table could also be used to find the region when only the
id is known. This is an alternative solution to the previous post I'm taking the example from where duplicate indexes are maintained for this purpose.
This table must be maintained automatically when rows are inserted, deleted or when the
id is updated (which should not happen as it si part of the primary key, but we are talking about legacy application, so better be safe for all unexpected cases).
Here is the trigger function:
create or replace function customers_unique_id() returns trigger as $$ declare rows smallint; begin if tg_op in ('DELETE', 'UPDATE') then delete from customers_unique_id where id = old.id and planet=old.planet ; elsif tg_op in ('INSERT', 'UPDATE') then insert into customers_unique_id (id,planet) values (new.id, new.planet); end if; get diagnostics rows = row_count; if rows != 1 then raise '% affected % rows (expected: 1)',tg_op, rows; end if; return new; end; $$ language plpgsql;
and the trigger:
create trigger customers_unique_id after delete or insert or update of id, planet on customers for each row execute function customers_unique_id();
Finally I initialize this table:
begin transaction; delete from customers_unique_id; insert into customers_unique_id select id, planet from customers; end;
I did that within a transaction, starting with a delete, in case there were some inserts after I created the trigger. In PostgreSQL you can do all that, including the DDL, in a transaction (but you must be in serializable for this case). However, doing so will lock the table in exclusive mode for the DDL commands. For online changes, I prefer to separate DDL (short but with exclusive lock) from DML (can be long, but non blocking). YugabyteDB has an optimistic approach for DDL (no exclusive lock but applications may get a serializable error) and run DDL out of the main transaction, so this is the right way to do.
I try to insert the same
id with different regions and check the correct behavior:
yugabyte=# insert into customers values('f00dcafe-dead-beef-face-c0ffeec0de12', 'mars', 'one'); INSERT 0 1 yugabyte=# insert into customers values('f00dcafe-dead-beef-face-c0ffeec0de12', 'mars', 'one'); ERROR: duplicate key value violates unique constraint "customers_mars_pkey" yugabyte=# insert into customers values('f00dcafe-dead-beef-face-c0ffeec0de12', 'moon', 'one'); ERROR: duplicate key value violates unique constraint "customers_unique_id_pkey" yugabyte=# delete from customers where id::text like 'f00dcafe%'; DELETE 1 yugabyte=# insert into customers values('f00dcafe-dead-beef-face-c0ffeec0de12', 'moon', 'one'); INSERT 0 1 yugabyte=#
DML that doesn't violate my business logic (
id being globally unique) succeeded, the other failed.
In my lab, the docker-compose also starts a 'metrics' container that runs my YBWR script every 10 seconds to show the tablet activity. I have run the following inserting 1000 rows to the
yugabyte=# insert into customers(planet, info) select 'moon', generate_series(1,1000); INSERT 0 1000
Before creating the trigger, this was running in
Time: 75.999 ms with the following reads and writes:
rocksdb_seek | rocksdb_next | rocksdb_insert | dbname / relname / tserver / tabletid / leader --------------+--------------+----------------+---------------------------------------------------------------------------------------- 498 | | 166 | yugabyte customers_moon 393b02758fcf487994b666589d39e31c L yb-tserver-1.base.moon.star 441 | | 147 | yugabyte customers_moon 57cca7ebc5494848ad00b30a3ac88c44 L yb-tserver-1.base.moon.star 543 | | 181 | yugabyte customers_moon 60fd4b4984f2433385aa716572a73aca L yb-tserver-1.base.moon.star 513 | | 171 | yugabyte customers_moon 7a7ca74a195949baaa819775980a1eb8 L yb-tserver-1.base.moon.star 537 | | 179 | yugabyte customers_moon d3df57a21e35437b84f17a31b6fffd88 L yb-tserver-1.base.moon.star 468 | | 156 | yugabyte customers_moon e5e46ea408764a62950a6cb18954e77e L yb-tserver-1.base.moon.star (6 rows)
The inserts have to seek (to read if the row exists to check for duplicates in the partitioned table) and insert (write the row to the LSM-Tree) to multiple tablets. The tablets are all on the same region (
moon) and, in this small lab, they are even in the same server. This is a local transaction.
I've run the same after creating the trigger:
rocksdb_seek | rocksdb_next | rocksdb_insert | dbname / relname / tserver / tabletid / leader --------------+--------------+----------------+---------------------------------------------------------------------------------------------- 492 | | 164 | yugabyte customers_moon 393b02758fcf487994b666589d39e31c L yb-tserver-1.base.moon.star 558 | | 186 | yugabyte customers_moon 57cca7ebc5494848ad00b30a3ac88c44 L yb-tserver-1.base.moon.star 480 | | 160 | yugabyte customers_moon 60fd4b4984f2433385aa716572a73aca L yb-tserver-1.base.moon.star 513 | | 171 | yugabyte customers_moon 7a7ca74a195949baaa819775980a1eb8 L yb-tserver-1.base.moon.star 492 | | 164 | yugabyte customers_moon d3df57a21e35437b84f17a31b6fffd88 L yb-tserver-1.base.moon.star 465 | | 155 | yugabyte customers_moon e5e46ea408764a62950a6cb18954e77e L yb-tserver-1.base.moon.star 480 | | 160 | yugabyte customers_unique_id 01a52bd802f342d9902042dbb8360ca5 L yb-tserver-2.base.mars.star 492 | | 164 | yugabyte customers_unique_id 13c9a37b25244d69a1782cc3591f852e L yb-tserver-1.base.moon.star 513 | | 171 | yugabyte customers_unique_id 6138c8e697e1483b8a1950e734ec9b85 L yb-tserver-0.base.earth.star 492 | | 164 | yugabyte customers_unique_id 7cee267d09f24362b4a1089b808b8db6 L yb-tserver-2.base.mars.star 558 | | 186 | yugabyte customers_unique_id ee3fa437841648c280b9a824d7d80ebc L yb-tserver-1.base.moon.star 465 | | 155 | yugabyte customers_unique_id fa9febc32ec14bfcaebdf475f5d6cb53 L yb-tserver-0.base.earth.star (12 rows)
This has run in
Time: 778.955 ms, 10 times slower. The reads and writes are 2 times higher, because of the new table to maintain, but the major difference is that, now, multiple servers and regions are touched by the transaction. YugabyteDB implements many optimizations for single-shard, single-server and single-region transactions. In the latest case, the transaction table itself can be remote. With this global table we cannot benefit from those single-region optimizations.
When you want to scale a geo-distributed application you should
- choose the right database. The closest to PostgreSQL is YugabyteDB which provides all SQL features on a horizontally scalable infrastructure.
- design your application to run the critical services locally to one region and this means avoiding transactions that have to read and write to other regions. The SQL features that help for this design are: declarative partitioning, composite primary key, sequences, UUID generation extensions, and triggers.
That's the reason why YugabyteDB implements all those features, by re-using the PostgreSQL code when possible, and by pushing down the others to the distributed storage. The other distributed databases that do not support triggers require you to change your application and add, in addition to the business logic, the necessary code to validate the data integrity like uniqueness, and the regression tests for all race conditions on it.