Franck Pachot for YugabyteDB Distributed PostgreSQL Database

Posted on Aug 21

Implementing “Get or Create” in YugabyteDB (or PostgreSQL)

#yugabytedb #distributed #postgres #database

In SQL, you can perform a "Get or Create" operation using a WITH clause, where each operation is a Common Table Expression (CTE). The following tweet and article inspired this idea, but I suggest an efficient design with robust testing of performance and race conditions.

When executing two statements conditionally, start with the most frequent scenario. Avoid inserting if it fails most of the time because writing is expensive. Databases tend to optimize successful operations over exceptions.
Avoid the ON CONFLICT construct, which relies on error handling (constraint violation) and may have surprising performance and transactional behavior. Common Table Expressions (CTEs) are safer with clean input and output states and declarative dependencies.

In this example, we have a USERS table with two keys: a natural key ("email") and a surrogate key ("id"). I decided to use the surrogate key as the primary key. The natural key is defined with a constraint and serves as a secondary index.

create table users (
    id bigint generated always as identity primary key,
    email varchar(255) unique not null
);

The optimal SQL query

Here is my query, which I declare as a prepared statement to execute many times:

prepare get_or_create_user_email(text)
 as
with
-- the input set
val(email) as (
  values ($1)
),
-- the result of get (can be empty)
get as (
  select users.* from users natural join val
),
-- the result of insert (if get is empty)
ins as (
  insert into users(email) select email from val
  where not exists (select * from get)
  returning *
)
-- get or create
select * from get
union all
select * from ins
;

Common table expressions make it easy to understand:

val captures the input email.
get attempts to retrieve the user with the given email. The result is empty if not found.
ins creates the new user only if get is empty and returns the new row.
Union: Combine the get and ins results to return one user id.

Let's execute the "create" scenario with explain analyze and the dist option, which displays the number of read and write requests. (In PostgreSQL, you may look at buffers, wal to understand the execution cost.)


yugabyte=# explain (analyze, dist, buffers)
execute get_or_create_user_email('john.doe@example.com')
;
                                                                                                                            QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
 Append  (cost=4.20..4.25 rows=2 width=524) (actual time=8.482..8.487 rows=1 loops=1)
   CTE val
     ->  Result  (cost=0.00..0.01 rows=1 width=32) (actual time=0.001..0.002 rows=1 loops=1)
   CTE get
     ->  YB Batched Nested Loop Join  (cost=0.00..4.15 rows=1 width=524) (actual time=0.986..0.986 rows=0 loops=1)
           Join Filter: ((users.email)::text = val.email)
           ->  CTE Scan on val  (cost=0.00..0.02 rows=1 width=32) (actual time=0.004..0.004 rows=1 loops=1)
           ->  Index Scan using users_email_key on users  (cost=0.00..4.12 rows=1 width=524) (actual time=0.923..0.923 rows=0 loops=1)
                 Index Cond: ((email)::text = ANY (ARRAY[val.email, $2, $3, ..., $1024]))
                 Storage Index Read Requests: 1
                 Storage Index Read Execution Time: 0.849 ms
   CTE ins
     ->  Insert on users users_1  (cost=0.02..0.04 rows=1 width=524) (actual time=7.489..7.492 rows=1 loops=1)
           Storage Table Write Requests: 1
           Storage Index Write Requests: 1
           InitPlan 3 (returns $1026)
             ->  CTE Scan on get get_1  (cost=0.00..0.02 rows=1 width=0) (actual time=0.000..0.000 rows=0 loops=1)
           ->  Result  (cost=0.00..0.03 rows=1 width=524) (actual time=7.423..7.425 rows=1 loops=1)
                 One-Time Filter: (NOT $1026)
                 ->  CTE Scan on val val_1  (cost=0.00..0.02 rows=1 width=32) (actual time=0.001..0.002 rows=1 loops=1)
   ->  CTE Scan on get  (cost=0.00..0.02 rows=1 width=524) (actual time=0.988..0.988 rows=0 loops=1)
   ->  CTE Scan on ins  (cost=0.00..0.02 rows=1 width=524) (actual time=7.492..7.497 rows=1 loops=1)
 Planning Time: 7.889 ms
 Execution Time: 12.009 ms
 Storage Read Requests: 1
 Storage Read Execution Time: 0.849 ms
 Storage Rows Scanned: 0
 Storage Write Requests: 2
 Catalog Read Requests: 12
 Catalog Read Execution Time: 10.036 ms
 Catalog Write Requests: 0
 Storage Flush Requests: 1
 Storage Flush Execution Time: 3.231 ms
 Storage Execution Time: 14.117 ms
 Peak Memory Usage: 568 kB

The Catalog Read Requests are because it is the first parsing of such query, so the catalog information related to this table is not in the cache.

To create a new email, a Read Request is first made to check if the email exists in the Secondary index. This is followed by two write requests to update the table and the secondary index. These two write requests are then batched into one Flush request to reduce the latency.

The next call with the same email is faster:


yugabyte=# explain (analyze, dist, buffers)
execute get_or_create_user_email('john.doe@example.com')
;

                                                              QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
 Append  (cost=4.20..4.25 rows=2 width=524) (actual time=1.420..1.433 rows=1 loops=1)
   CTE val
     ->  Result  (cost=0.00..0.01 rows=1 width=32) (actual time=0.001..0.001 rows=1 loops=1)
   CTE get
     ->  YB Batched Nested Loop Join  (cost=0.00..4.15 rows=1 width=524) (actual time=1.417..1.422 rows=1 loops=1)
           Join Filter: ((users.email)::text = val.email)
           ->  CTE Scan on val  (cost=0.00..0.02 rows=1 width=32) (actual time=0.004..0.004 rows=1 loops=1)
           ->  Index Scan using users_email_key on users  (cost=0.00..4.12 rows=1 width=524) (actual time=1.351..1.354 rows=1 loops=1)
                 Index Cond: ((email)::text = ANY (ARRAY[val.email, $2, $3, ..., $1024]))
                 Storage Table Read Requests: 1
                 Storage Table Read Execution Time: 0.542 ms
                 Storage Table Rows Scanned: 1
                 Storage Index Read Requests: 1
                 Storage Index Read Execution Time: 0.656 ms
                 Storage Index Rows Scanned: 1
   CTE ins
     ->  Insert on users users_1  (cost=0.02..0.04 rows=1 width=524) (actual time=0.004..0.004 rows=0 loops=1)
           InitPlan 3 (returns $1026)
             ->  CTE Scan on get get_1  (cost=0.00..0.02 rows=1 width=0) (actual time=0.000..0.000 rows=1 loops=1)
           ->  Result  (cost=0.00..0.03 rows=1 width=524) (actual time=0.002..0.002 rows=0 loops=1)
                 One-Time Filter: (NOT $1026)
                 ->  CTE Scan on val val_1  (cost=0.00..0.02 rows=1 width=32) (never executed)
   ->  CTE Scan on get  (cost=0.00..0.02 rows=1 width=524) (actual time=1.419..1.425 rows=1 loops=1)
   ->  CTE Scan on ins  (cost=0.00..0.02 rows=1 width=524) (actual time=0.005..0.005 rows=0 loops=1)
 Planning Time: 0.856 ms
 Execution Time: 1.631 ms
 Storage Read Requests: 2
 Storage Read Execution Time: 1.198 ms
 Storage Rows Scanned: 2
 Storage Write Requests: 0
 Catalog Read Requests: 0
 Catalog Write Requests: 0
 Storage Flush Requests: 0
 Storage Execution Time: 1.198 ms
 Peak Memory Usage: 536 kB

With an existing email, two Read Requests are made to read the index entry and the remaining columns from the table.

I can include the "id" column to the secondary index to avoid this additional read request to the table:

yugabyte=# \d users;
                                    Table "public.users"
 Column |          Type          | Collation | Nullable |              Default
--------+------------------------+-----------+----------+-----------------------------------
 id     | bigint                 |           | not null | nextval('users_id_seq'::regclass)
 email  | character varying(255) |           | not null |
Indexes:
    "users_pkey" PRIMARY KEY, lsm (id HASH)
    "users_email_key" UNIQUE CONSTRAINT, lsm (email HASH)

yugabyte=# create unique index users_email_unique on users ( email asc ) include ( id );
CREATE INDEX

yugabyte=# alter table users add constraint users_email_unique unique using index users_email_unique;
ALTER TABLE

yugabyte=# alter table users drop constraint users_email_key;
ALTER TABLE

As I created the new index before dropping the existing constraint, my table was always protected from duplicates. However, my query doesn't depend on the constraint, like ON CONFLICT would, making it safer.

The "create" scenario is the same, as an additional column in the index entry has no overhead, but the "get" scenario is reduced to one read request thanks to the Index Only Scan:

yugabyte-# execute get_or_create_user_email('john.doe@example.com')
;
                                                                  QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------
 Append  (cost=4.19..4.24 rows=2 width=524) (actual time=0.954..0.964 rows=1 loops=1)
   CTE val
     ->  Result  (cost=0.00..0.01 rows=1 width=32) (actual time=0.001..0.001 rows=1 loops=1)
   CTE get
     ->  YB Batched Nested Loop Join  (cost=0.00..4.14 rows=1 width=524) (actual time=0.952..0.955 rows=1 loops=1)
           Join Filter: ((users.email)::text = val.email)
           ->  CTE Scan on val  (cost=0.00..0.02 rows=1 width=32) (actual time=0.004..0.004 rows=1 loops=1)
           ->  Index Only Scan using users_email_unique on users  (cost=0.00..4.11 rows=1 width=524) (actual time=0.888..0.889 rows=1 loops=1)
                 Index Cond: (email = ANY (ARRAY[val.email, $2, $3, ..., $1024]))
                 Heap Fetches: 0
                 Storage Index Read Requests: 1
                 Storage Index Read Execution Time: 0.757 ms
                 Storage Index Rows Scanned: 1
   CTE ins
     ->  Insert on users users_1  (cost=0.02..0.04 rows=1 width=524) (actual time=0.003..0.003 rows=0 loops=1)
           InitPlan 3 (returns $1026)
             ->  CTE Scan on get get_1  (cost=0.00..0.02 rows=1 width=0) (actual time=0.000..0.000 rows=1 loops=1)
           ->  Result  (cost=0.00..0.03 rows=1 width=524) (actual time=0.002..0.002 rows=0 loops=1)
                 One-Time Filter: (NOT $1026)
                 ->  CTE Scan on val val_1  (cost=0.00..0.02 rows=1 width=32) (never executed)
   ->  CTE Scan on get  (cost=0.00..0.02 rows=1 width=524) (actual time=0.954..0.958 rows=1 loops=1)
   ->  CTE Scan on ins  (cost=0.00..0.02 rows=1 width=524) (actual time=0.004..0.004 rows=0 loops=1)
 Planning Time: 0.790 ms
 Execution Time: 1.171 ms
 Storage Read Requests: 1
 Storage Read Execution Time: 0.757 ms
 Storage Rows Scanned: 1
 Storage Write Requests: 0
 Catalog Read Requests: 0
 Catalog Write Requests: 0
 Storage Flush Requests: 0
 Storage Execution Time: 0.757 ms
 Peak Memory Usage: 536 kB

Declare as a SQL function

I used a prepared statement to execute it multiple times in my session and examined the execution plan. After confirming its scalability, creating a simple SQL function might be helpful:

create function get_or_create_user_email(email text)
 returns setof users
 language sql as $body$
with
-- the input set
val(email) as (
  values (get_or_create_user_email.email)
),
-- the result of get (can be empty)
get as (
  select users.* from users natural join val
),
-- the result of insert (if get is empty)
ins as (
  insert into users(email) select email from val
  where not exists (select * from get)
  returning *
)
-- get or create
select * from get
union all
select * from ins
;
$body$;

This can be called with:


yugabyte=# select * from get_or_create_user_email('john.doe@example.com');
 id |        email
----+----------------------
  1 | john.doe@example.com
(1 row)

yugabyte=# select * from get_or_create_user_email('john.doe@example.com');
 id |        email
----+----------------------
  1 | john.doe@example.com
(1 row)

You can validate that both have tried the insert by setting \timing on and observing that both waited 10 seconds because of the trigger, or tracing restarts with set yb_debug_log_internal_restarts=on

Test race conditions

The read and write states can differ using a Multi-Version Concurrency Control database with a write-after-read logic. The read time starts at the beginning of the statement (in read committed mode, it starts at the beginning of the transaction in repeatable read or serializable mode) to avoid read locks. However, the write must occur on the current state. What happens if another transaction modifies the rows that you're reading? There are two options to handle this:

Pessimistic approach: Lock what you've read so no one can modify it until you commit. It can be a row to avoid non-repeatable reads or a range of values to prevent phantom reads.
Optimistic approach: Detect conflicts between the read and write states. The conflicting write operation can detect these conflicts. With Optimistic Concurrency Control, this check is deferred to the end of the transaction.

The following scenario can occur in my example: Two sessions make the same call. Both sessions, reading simultaneously, discover that the user doesn't exist. They proceed to the "create" scenario and attempt to insert. At least one will fail due to a duplicate key constraint violation.
The pessimistic approach would require locking the value, even though it's not an existing row, and this can be achieved with the serializable isolation level. However, I don't want to lock it at each execution for this low-probability scenario. The optimistic approach involves detecting the error and immediately retrying. With a newer read time, the new execution will see the existing user and proceed to the "get" scenario.

With PostgreSQL, you must implement this retry logic even with the Read Committed isolation level because it doesn't support transparent statement restarts. This would involve setting a savepoint before each statement, which is not a scalable operation in PostgreSQL. With YugabyteDB, the database manages this transparently, and you will never get an error.

To test this, instead of running a high concurrency test from PgBench and hoping it encounters race conditions without certainty, I can artificially slow down the operation with a trigger.
The following adds a five-second wait before and after any inserted row:

create or replace function insert_with_sleep()
returns trigger as $$
begin
    perform pg_sleep(5);
    return NEW;
end;
$$ language plpgsql;
create trigger insert_trigger1 before insert on users for each row execute procedure insert_with_sleep();
create trigger insert_trigger2 after  insert on users for each row execute procedure insert_with_sleep();

With this trigger enabled, I can run two calls concurrently and observe the issue in PostgreSQL:

yugabyte=# delete from users;
DELETE 1
\! sleep 1;

postgres=> \! psql -c "select get_or_create_user_email('john.doe@example.com')" & sleep 2

postgres=> select get_or_create_user_email('john.doe@example.com');

ERROR:  23505: duplicate key value violates unique constraint "users_email_key"

DETAIL:  Key (email)=(john.doe@example.com) already exists.
CONTEXT:  SQL function "get_or_create_user_email" statement 1
SCHEMA NAME:  public
TABLE NAME:  users
CONSTRAINT NAME:  users_email_key
LOCATION:  _bt_check_unique, nbtinsert.c:663

 get_or_create_user_email
--------------------------
 (1,john.doe@example.com)
(1 row)

Only one call returned the result. The other raised a duplicate key error detected when inserting the index entry into the unique index that enforces the constraint.

You will not encounter this problem with YugabyteDB because it implements automatic restarts. I'm running the same, with the same trigger:

yugabyte=# delete from users;
DELETE 1
\! sleep 1;

yugabyte=# \! psql -c "select get_or_create_user_email('john.doe@example.com')" & sleep 2

yugabyte=# select get_or_create_user_email('john.doe@example.com');

 get_or_create_user_email
--------------------------
 (2,john.doe@example.com)
(1 row)

 get_or_create_user_email
--------------------------
 (2,john.doe@example.com)
(1 row)

YugabyteDB has detected a conflict and transparently restarted the operation using a more recent read time. This is possible in the Read Committed isolation level (which allows a different read time for each statement) or a higher isolation level when the statement is the first one in the transaction.

Not that you may still get a Duplicate key error with a different timing, if the conflict occurs after the first session has committed:

yugabyte=# delete from users;
DELETE 1
\! sleep 1;

yugabyte=# \! psql -c "select get_or_create_user_email('john.doe@example.com')" & sleep 7

yugabyte=# select get_or_create_user_email('john.doe@example.com');

 get_or_create_user_email
--------------------------
 (3,john.doe@example.com)
(1 row)

ERROR:  23505: duplicate key value violates unique constraint "users_email_key"
LOCATION:  YBFlushBufferedOperations, ../../src/yb/yql/pggate/pg_perform_future.cc:36

The probability is low but using triggers helps to validate all possibilities.

Consider consolidating this data logic into a single statement using the WITH clause and Common Table Expressions (CTE) in a declarative manner rather than using multiple roundtrips or a block of procedural code. Each CTE's input is the database state at the beginning of the statement or the output of another CTE in the same statement. Dependency determines the order of execution. Note that this works only with databases that allow Data-Modifying Statements in WITH like PostgreSQL or compatible (Oracle Database allows only SELECT statements).

Start with the most frequent scenario to avoid exception handling. Exceptions are often more expensive: the insert has to update the table and the index and generate bloat and WAL before detecting the conflict.

Remember, the database isn't a black box: you can review the execution plan to understand the exact time complexity and simulate race conditions with triggers to slow down the operations.

Read Committed isolation level | YugabyteDB Docs

Details about the Read Committed isolation level

docs.yugabyte.com

DEV Community

Implementing “Get or Create” in YugabyteDB (or PostgreSQL)

The optimal SQL query

Declare as a SQL function

Test race conditions

Read Committed isolation level | YugabyteDB Docs

Top comments (0)

Read next

How to Use JSONB in PostgreSQL with Step-by-Step Instructions

Comprehensive Weather Data Analysis Using Python: Temperature, Rainfall Trends, and Visualizations

The True Cost of Poor Data Quality: Why It Matters and How to Improve It

[Postgres] Isolation levels