DEV Community

Franck Pachot for YugabyteDB

Posted on • Updated on

Declarative vs. Application side Foreign Key referential integrity

I started this series to compare the support of basic SQL features like unique constraint and foreign keys. I do that to categorize the NewSQL database that can scale-out without cross-shard integrity constraints, from Distributed SQL which provides all SQL features in distributed transactions. The former is sufficient for datawarehouse, analytics, and simple transactions. The latter can scale-out all OLTP workloads found in System of Records.

When talking about foreign keys, the databases which do not support them will often answer “You don't need it” and developers think they can enforce the integrity from the application. But this works only on a single-user database. Think about this scenario, based on the traditional EMP/DEPT schema used to demonstrate SQL features since it's invention:

  1. User A starts a transaction, queries DEPT and EMP and realizes that Department 40 has no employees. She deletes it.
  2. User B starts a transaction, adds an employee to Department 40. She checked that the department exists, which is true because, thanks to the transaction isolation, the I in ACID, the uncommitted delete from User A is not visible. Hopefully, the most sensible databases never implemented dirty reads. Then she inserted the employee and committed the transaction.
  3. User A commits her transaction. In Read Committed isolation level, which is the default for the most common SQL databases, there is no conflict detected.

This scenario leaves the new employee as an orphean that belongs to an inexistant department. This corrupts data, breaking the C in ACID.

How can we avoid this?

  • The best is to declare a foreign key constraint. The database can see, internally, the concurrent changes that are invisible to the user. With such integrity constraint, User B insertion would have waited to see if the User A deletion is committed or not and would have raised a "parent key not found" error after step 3 if it was committed. Or, with optimistic transactions, would not have waited but detected the conflict later.
  • If User A's transactions were started with a higher isolation level, Serializable, and checked for the absence of child rows, a conflict would have been detected because, at this level, no anomaly is allowed (phantom reads in this case).
  • If User A locked the EMP whole table before querying for children, it would have prevented the concurrent insert. But this is not scalable, especially with databases that do not support range locks.
  • If User B locked the parent row when looking at the existence of Departement 40, it would have waited on User A's transaction. But this is not scalable, especially with databases that do not shared row locks.

There is only one solution that does not depend on the specific database capabilities and be there as long as it is a SQL database: Referential Integrity with Foreign Key. You may be suprised by this, but even the Oracle Database, the most popular traditional DB used for critical enterprise application does not support Serializable isolation level, doesn't provide range locks, and can lock rows only in exclusive mode.

I'll run my example with a modern Distributed SQL database, built on PostgreSQL and bringing all those features to a scale-out architecture: YugabyteDB.

Example on YugabyteDB Managed

I'll run this in YugabyteDB Managed free tier with the tables from the tutorial:
cloud

Those are the same tables as in the previous post. Here it is for quick copy/paste:

PGHOST=eu-west-1.88d36171-48a8-4116-93ff-587a5ea344be.aws.ybdb.io
PGPORT=5433
PGDATABASE=yugabyte
PGSSLMODE=required
PGUSER=admin
PGPASSWORD=Franck

psql

CREATE TABLE IF NOT EXISTS dept ( deptno integer NOT NULL, dname text, loc text, description text, CONSTRAINT pk_dept PRIMARY KEY (deptno asc));
CREATE TABLE IF NOT EXISTS emp ( empno integer generated by default as identity (start with 10000) NOT NULL, ename text NOT NULL, job text, mgr integer, hiredate date, sal integer, comm integer, deptno integer NOT NULL, email text, other_info jsonb, CONSTRAINT pk_emp PRIMARY KEY (empno hash), CONSTRAINT emp_email_uk UNIQUE (email), CONSTRAINT fk_deptno FOREIGN KEY (deptno) REFERENCES dept(deptno), CONSTRAINT fk_mgr FOREIGN KEY (mgr) REFERENCES emp(empno), CONSTRAINT emp_email_check CHECK ((email ~ '^[a-zA-Z0-9.!#$%&''*+/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$'::text))); 
INSERT INTO dept (deptno, dname, loc, description) VALUES (10, 'ACCOUNTING', 'NEW YORK','preparation of financial statements, maintenance of general ledger, payment of bills, preparation of customer bills, payroll, and more.'), (20, 'RESEARCH', 'DALLAS','responsible for preparing the substance of a research report or security recommendation.'), (30, 'SALES', 'CHICAGO','division of a business that is responsible for selling products or services'), (40, 'OPERATIONS', 'BOSTON','administration of business practices to create the highest level of efficiency possible within an organization'); 
INSERT INTO emp (empno, ename, job, mgr, hiredate, sal, comm, deptno, email, other_info) VALUES (7369, 'SMITH', 'CLERK', 7902, '1980-12-17', 800, NULL, 20,'SMITH@acme.com', '{"skills":["accounting"]}'), (7499, 'ALLEN', 'SALESMAN', 7698, '1981-02-20', 1600, 300, 30,'ALLEN@acme.com', null), (7521, 'WARD', 'SALESMAN', 7698, '1981-02-22', 1250, 500, 30,'WARD@compuserve.com', null), (7566, 'JONES', 'MANAGER', 7839, '1981-04-02', 2975, NULL, 20,'JONES@gmail.com', null), (7654, 'MARTIN', 'SALESMAN', 7698, '1981-09-28', 1250, 1400, 30,'MARTIN@acme.com', null), (7698, 'BLAKE', 'MANAGER', 7839, '1981-05-01', 2850, NULL, 30,'BLAKE@hotmail.com', null), (7782, 'CLARK', 'MANAGER', 7839, '1981-06-09', 2450, NULL, 10,'CLARK@acme.com', '{"skills":["C","C++","SQL"]}'), (7788, 'SCOTT', 'ANALYST', 7566, '1982-12-09', 3000, NULL, 20,'SCOTT@acme.com', '{"cat":"tiger"}'), (7839, 'KING', 'PRESIDENT', NULL, '1981-11-17', 5000, NULL, 10,'KING@aol.com', null), (7844, 'TURNER', 'SALESMAN', 7698, '1981-09-08', 1500, 0, 30,'TURNER@acme.com', null), (7876, 'ADAMS', 'CLERK', 7788, '1983-01-12', 1100, NULL, 20,'ADAMS@acme.org', null), (7900, 'JAMES', 'CLERK', 7698, '1981-12-03', 950, NULL, 30,'JAMES@acme.org', null), (7902, 'FORD', 'ANALYST', 7566, '1981-12-03', 3000, NULL, 20,'FORD@acme.com', '{"skills":["SQL","CQL"]}'), (7934, 'MILLER', 'CLERK', 7782, '1982-01-23', 1300, NULL, 10,'MILLER@acme.com', null);

Enter fullscreen mode Exit fullscreen mode

I'm using psql with environment variables (but do not think I left my password unchanged) so that I can run a nested session while the other is opened:

yugabyte=> begin transaction;
BEGIN

yugabyte=> \! psql
              from pg_stat_activity
              where client_addr=inet_client_addr();
           \q

psql (12.7, server 11.2-YB-2.15.0.0-b0)
SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, bits: 256, compression: off)
Type "help" for help.

yugabyte=>               select pid, state, backend_start, state
yugabyte->               from pg_stat_activity
yugabyte->               where client_addr=inet_client_addr();

  pid  |        state        |         backend_start         |        state
-------+---------------------+-------------------------------+---------------------
 28581 | idle in transaction | 2022-09-25 18:35:40.922706+00 | idle in transaction
 12616 | active              | 2022-09-25 19:11:55.387199+00 | active
(2 rows)

yugabyte=> rollback;
ROLLBACK
Enter fullscreen mode Exit fullscreen mode

This makes it easy to demo from a single screen. But of course, you can run this from multiple terminals or cloud shells.

With Foreign Key

There's no doubt about it, the right solution in SQL is to create the tables with all Foreign Keys, as I did before. Even if there are other solutions, nothing will beat the declare-once approach which requires no additional code and tests.

I'll run this where a first transaction inserts a new employee in a department deleted by another user:

begin transaction;
insert into emp(ename, deptno) values ('Franck',40);
\! psql 
begin transaction;
select count(*) from emp where deptno=40;
delete from dept where deptno=40;
commit;
\q
commit;
Enter fullscreen mode Exit fullscreen mode

This must fail because it violates the referential integrity. And in race condition, one of the transactions must be aborted:

yugabyte=> begin transaction;
BEGIN

yugabyte=> insert into emp(ename, deptno) values ('Franck',40);
INSERT 0 1

yugabyte=> \! psql
psql (12.7, server 11.2-YB-2.15.0.0-b0)
SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, bits: 256, compression: off)
Type "help" for help.

yugabyte=> begin transaction;
BEGIN

yugabyte=> select count(*) from emp where deptno=40;
 count
-------
     0
(1 row)

yugabyte=> delete from dept where deptno=40;
DELETE 1

yugabyte=> commit;
COMMIT

yugabyte=> \q

yugabyte=> commit;
ERROR:  Operation expired: Heartbeat: Transaction 714422b4-2c41-4fb4-bea9-4707364546a7 expired or aborted by a conflict: 40001
yugabyte=>

Enter fullscreen mode Exit fullscreen mode

This is a serialization error where the application must retry, and then will get:

yugabyte=> insert into emp(ename, deptno) values ('Franck',40);
ERROR:  insert or update on table "emp" violates foreign key constraint "fk_deptno"
DETAIL:  Key (deptno)=(40) is not present in table "dept".
Enter fullscreen mode Exit fullscreen mode

The foreign key constraint prevents any corruption of data integrity. You don't see the parent not found error because I'm using the default optimistic locking here, be the conflict is detected to avoid anomalies.

Without Foreign Key but serializable

I add again the empty department and drop the foreign key:

insert into dept(deptno) values (40);
alter table emp drop constraint fk_deptno;
Enter fullscreen mode Exit fullscreen mode

Without the Foreign Key, the conflict would not have been detected with the default isolation level. But YugabyteDB supports Serializable, which is required to avoid phantom reads:

begin transaction isolation level serializable;
insert into emp(ename, deptno) values ('Franck',40);
\! psql 
begin transaction isolation level serializable;
select count(*) from emp where deptno=40;
delete from dept where deptno=40;
commit;
\q
commit;
Enter fullscreen mode Exit fullscreen mode

I have similar conflict detection as above. But this, as no referential integrity constraint was declared, requires careful code to set the serializable isolation level, read the state that must stay from the whole transaction, and of course, this must be validated by your tests. Here is the result:

yugabyte=> begin transaction isolation level serializable;
BEGIN

yugabyte=> insert into emp(ename, deptno) values ('Franck',40);
INSERT 0 1

yugabyte=> \! psql
psql (12.7, server 11.2-YB-2.15.0.0-b0)
SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, bits: 256, compression: off)
Type "help" for help.

yugabyte=> begin transaction isolation level serializable;
BEGIN

yugabyte=> select count(*) from emp where deptno=40;
 count
-------
     0
(1 row)

yugabyte=> delete from dept where deptno=40;
DELETE 0

yugabyte=> commit;
COMMIT

yugabyte=> \q

yugabyte=> commit;
ERROR:  Operation expired: Heartbeat: Transaction 0857dad9-c72c-46d7-be4b-7b3ad61f3d0b expired or aborted by a conflict: 40001
yugabyte=>

Enter fullscreen mode Exit fullscreen mode

This, even without the declarative Foreign Key, detects the conflict. This works because I select count(*) from emp where deptno=40; in a isolation level serializable transaction. Without this, it would have accepted the insert with no parent, breaking the data integrity. And such application bug happens only at race condition: will probably not be detected during you CI tests but will happen one day in production.

With declarative referential integrity, you create the foreign key once, and have the guarantee of integrity.

Without Foreign Key but lock the parent

What if your database doesn't provide Serializable? You can use explicit locking. I can also show that on YugabyteDB.

I add again the empty department:

insert into dept(deptno) values (40);
Enter fullscreen mode Exit fullscreen mode

I run the same code as above, without foreign key and without serializable isolation, but locking the parent row in share mode before inserting a child to it:

begin transaction;
select * from dept where deptno=40 for share;
insert into emp(ename, deptno) values ('Franck',40);
\! psql 
begin transaction;
select count(*) from emp where deptno=40;
delete from dept where deptno=40;
commit;
\q
commit;
Enter fullscreen mode Exit fullscreen mode

Here is the result, again with conflict detected:

yugabyte=> begin transaction;
BEGIN

yugabyte=> select * from dept where deptno=40 for share;

 deptno | dname | loc | description
--------+-------+-----+-------------
     40 |       |     |
(1 row)

yugabyte=> insert into emp(ename, deptno) values ('Franck',40);
INSERT 0 1

yugabyte=> \! psql
psql (12.7, server 11.2-YB-2.15.0.0-b0)
SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, bits: 256, compression: off)
Type "help" for help.

yugabyte=> begin transaction;
BEGIN

yugabyte=> select count(*) from emp where deptno=40;
 count
-------
     1
(1 row)

yugabyte=> delete from dept where deptno=40;
ERROR:  Operation failed. Try again: 8c7508fa-03fd-48cc-9907-b6cbf6479188 Conflicts with higher priority transaction: 151a2cab-7b2d-4f46-a1a2-beb0dd9b8ecb
yugabyte=> rollback;
ROLLBACK

yugabyte=> \q

yugabyte=> commit;
COMMIT
Enter fullscreen mode Exit fullscreen mode

This works. The drawback is, again, additional code. And the code is database specific. Not all databases support for share mode for row locks. Most databases support exclusive ones with for update but this would prevent concurrent uses to insert children rows for the same parent, which is not scalable.

Without Foreign Key but lock the child table

Lock can be done in the other way, locking the child when deleting the parent.

I add again the empty department:

insert into dept(deptno) values (40);
Enter fullscreen mode Exit fullscreen mode

Here is my code. This is a lock table command because you cannot lock rows here. The goal is to prevent potential insertions. You cannot select for update on rows that do not exist.

begin transaction;
insert into emp(ename, deptno) values ('Franck',40);
\! psql 
begin transaction;
lock table emp in share mode;
select count(*) from emp where deptno=40;
delete from dept where deptno=40;
commit;
\q
commit;
Enter fullscreen mode Exit fullscreen mode

This lock table, even in share mode is not scalable because it blocks any DML on the table, even not related to the parent key that I want to delete. As this is not scalable, it is not a priority to support it in a Distributed SQL database. YugabyteDB doesn't allow it in the current version (2.15):

yugabyte=> begin transaction;
BEGIN
yugabyte=> insert into emp(ename, deptno) values ('Franck',40);
INSERT 0 1
yugabyte=> \! psql
psql (12.7, server 11.2-YB-2.15.0.0-b0)
SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, bits: 256, compression: off)
Type "help" for help.

yugabyte=> begin transaction;
BEGIN
yugabyte=> lock table emp in share mode;
ERROR:  SHARE not supported yet
LINE 1: lock table emp in share mode;
                          ^
HINT:  Please report the issue on https://github.com/YugaByte/yugabyte-db/issues
Enter fullscreen mode Exit fullscreen mode

I'll stop there. Explicit locks is never a good idea. First, because it doesn't scale, and also because it is additional code that requires full understanding of isolation levels, lock behavior, pessimistic or optimistic implementation. With SQL constraints, you don't have to understand those details. You just declare what you need, the reference and the cascade behavior, and rely on the RDBMS to optimize and validate it.

This EMP/DEPT series

In this blog series where I test the support of unique and foreign keys to verify which one qualifies for Distributed SQL database, I'll also check about the support for Serializable isolation level when Foreign Keys are not supported. So you can guess that one of the next post will be about SingleStore again, and others. The goal is to compare on OLTP specific features (unique and foreign keys) - this is not a general comparison between databases which may be best suited for other workloads.

A reminder that this series is about features and EMPT/DEPT is the best to build small samples. Feel free to project this to your modern web-scale application and to test it at scale. But small samples with understanding on how it works are sufficient to validate feature support.

Top comments (0)