Franck Pachot for YugabyteDB

Posted on Jan 5, 2024 • Edited on Jan 16, 2024

Isolation Levels - part XIII: Explicit Locking with SELECT (FOR UPDATE) intention

#database #postgres #sql #yugabytedb

Historically, database isolation levels were designed with implicit locking in mind. This means that when you read from the database, it automatically locks the data to protect against anomalies during subsequent reads and writes. The isolation level determines the duration of this lock. This approach aims to minimize performance overhead while ensuring the minimal required data consistency.
At the time (SQL-92), where SQL was designed for end-users, nobody wanted them to think about concurrent access with explicit locking. No LOCK commands are defined in the SQL standard, and the updatability clause (FOR UPDATE) is for updatable cursors, allowed only in DECLARE CURSOR, and is not about rows lock.

It should be noted that any explanation of isolation level based on the SQL standard definition is inaccurate. Most databases use Multi-Version Concurrency Control (MVCC) to read data without locking it. Additionally, SELECT FOR UPDATE is used to explicitly lock rows when reading them depending on what the transaction is doing later.

The original idea behind SQL was great. You could run your queries in long transactions without having to worry about what other users were doing at the same time. You didn't even have to declare your intention beforehand. You could read the data without notifying the database that you'd be making a decision based on that data, which would affect any subsequent statements. However, in reality, things are a bit different:

SQL database transactions often involve single statements that are auto-committed. We need to prevent isolation anomalies only for specific use cases and can avoid the overhead of anomaly protection for the majority of transactions.
Transactions that span multiple user interactions are impractical due to potential database blocking. They were possible only when users had a single terminal and could not switch between applications and only if they didn't forget to commit before going to the coffee machine.
Applications can determine the intent of a transaction right from the start. For instance, if an application searches for available hotel rooms, it already knows if it will use the list to book one. This makes it straightforward to specify what information needs to be accessed later and lock it explicitly.
Application developers may not know all the specificities of isolation levels in different databases, but considering concurrent changes is part of their skills, given how modern applications are multi-threaded.

Due to those characteristics of modern SQL applications, the Serializable isolation level is not commonly used. This is because lock-free reads and transparent retries are preferred over row, range, or predicate locking and serialization errors, which require a retry logic.
However, concurrency issues are still prevented by obtaining an explicit read lock on the rows that concurrent users cannot update until the transaction is completed.

Many SQL databases use the "FOR UPDATE" updatability clause to allow applications to define their locking behavior. These databases also provide options for handling conflicts, for instance, pessimistic locking to wait or optimistic locking to get an error and retry. The terms "optimistic" and "pessimistic" locking may be unclear or informative. In addition, there is another scenario in queuing or outbox use cases where we simply want to exclude the rows in conflict from the result. To better understand the situation when a conflict is detected, we must consider the behavior we desire: whether to wait, fail, or skip.

Wait on conflict: SELECT FOR UPDATE

Most SQL databases wait when concurrent write is detected when reading with SELECT ... FROM ... FOR UPDATE. Some can add WAIT ... to specify the number of seconds to wait before failing. Others can set a session parameter like lock_timeout for PostgreSQL (or YugabyteDB) or innodb_lock_wait_timeout for MySQL.

Skip on conflict: SELECT FOR UPDATE SKIP LOCKED

Many SQL databases skip the locked rows with SELECT ... FROM ... FOR UPDATE SKIP LOCKED. This statement should be used for specific cases and carefully tested since it can lead to varying results even when used on the same data, that can be considered as an inconsistent result.

Fail on conflict: SELECT FOR UPDATE NOWAIT

Sometimes, you may prefer getting an error immediately rather than waiting. Many SQL databases allow you to achieve this using SELECT ... FROM ... FOR UPDATE NOWAIT.

The FOR UPDATE locks are exclusive. They wait or fail on concurrent writes but also on concurrent SELECT FOR UPDATE. Here are two scenarios to explain the difference.

Scenario 1: When dealing with a relationship between orders and order items, it is common to lock the row in the order table when inserting new rows in the order item table. This is necessary, for example, to maintain the total amount of the order or simply to number the items with no duplicates and gaps. To avoid phantom reads, it is practical to lock the parent row. An exclusive lock is appropriate in this scenario since it is unlikely that two different users will attempt to update the same order simultaneously.

Scenario 2: When managing the relationship between products and order items, it is essential not to use an exclusive lock on the parent row. Doing so would limit scalability, as only one customer could order a product at a time. Instead, it is preferable to use a share lock on the parent to prevent any attempts to delete it while still allowing for concurrent share locks. Not all databases allow such shared row lock.

Share row locks

Oracle Database cannot lock a row in shared mode. You must carefully choose between locking the whole table in share mode or the row in exclusive mode. You can also use custom locks with dbms_lock, or rely on the referential integrity constraint that can lock a range in an indexed foreign key.

PostgreSQL has a FOR SHARE clause similar to the FOR UPDATE clause, but it applies shared locks instead of exclusive ones. The clauses can also be declined to KEY SHARE and NO KEY UPDATE when the locks concern only keys or non-key columns. This is mainly used to ensure referential integrity, as foreign keys can reference the primary key and unique keys. You can think of FOR UPDATE as the intent to delete a row or update its key and FOR NO KEY UPDATE as the intent to update the non-key columns.

YugabyteDB behaves like PostgreSQL but with a scalable implementation (the list of locking transactions is in the key-value datastore, not limited by in-block storage).

Example with YugabyteDB

YugabyteDB uses the PostgreSQL query layer and provides the same isolation level and row-locking behavior on top of a distributed key value. The database nodes store the transaction intents to persist the changes. With this architecture it is easy to understand what happens internally when using those FOR SHARE/UPDATE options.

I start a one-node YugabyteDB lab (in a Docker container created with docker run -it --rm yugabytedb/yugabyte:latest bash):

yugabyted start --tserver_flags="yb_enable_read_committed_isolation=true,enable_wait_queues=true,TEST_docdb_log_write_batches=true,tserver_enable_metrics_snapshotter=false,ysql_pg_conf_csv={yb_debug_log_docdb_requests=on}"

The tracing parameters are yb_debug_log_docdb_requests for the PostgreSQL layer (YSQL) and TEST_docdb_log_write_batches for the distributed storage layer (DocDB). I have also enabled the Read Committed isolation level and Wait Queues, which are not yet enabled by default for backward compatibility (I'm running YugabyteDB 2.19.3). However, enabling them is necessary to achieve the same behavior as PostgreSQL.

The traces go to their respective logfiles. Here is the quick script I use to tail them in the background and grep for the relavant info:

yugabyted status | awk '
/YugabyteDB UI/{ tserver_varz=gensub(":[0-9]+",":9000/varz?raw",1,$5) }
/Log Dir/      { tserver_logs=$5"/tserver" }
END {
print "curl -s \"" tserver_varz "\" | egrep --color=auto \"[^-]*(read_committed_isolation|transaction_isolation|wait_queue|docdb_log_write|ysql_pg_conf_csv)[^=]*\""
print "tail -F \""tserver_logs"/yb-tserver.INFO\" \""tserver_logs"\"/postgres*.log | egrep --color=auto \"(Applying operation|Flushing buffered operations|Buffering operation|row_mark_type:.*|wait_policy:.*|PutCF: SubDocKey.* HT{)\""
} ' | sh &

It also shows the relevant settings gathered from the table server endpoint on port 9000 /varz?raw:

I connect with yugabyted connect ysql or any PostgreSQL client like psql on port 5433 and create a small table for my tests :

create table demo (id text primary key, value1 text, value2 text);
insert into demo values ('42','MyValue1','MyValue2');
insert into demo values ('43','MyValue1','MyValue2');

To see only the read/write for the row, consider running the statement multiple times as the initial parsing may also read from the catalog.

Here is a simple select without locking, because I'm not in Serializable isolation level and I didn't add a FOR SHARE or FOR update clause:

select * from demo where id='42';

This is a READ operation sent from the PostgreSQL query layer to DocDB (YQL_CLIENT_PGSQL). The key ('42') has been converted to a ybctid with the sharding value.

There's nothing written to DocDB as there's no read lock. If you run the same in a serializable transaction you will see the read intent ([I]) in the DocDB write trace (PutCF):

begin transaction isolation level serializable;
select * from demo where id='42';
commit;

When using the Serializable isolation level, a read operation will acquire a shared lock (kStrongRead) at the row level. The key being read is identified by its hash code and primary key value, which is (DocKey(0x3066, ["42"], []), []). Additionally, there is a weaker shared lock at the tablet level, which is (DocKey([], []), []) [kWeakRead]. You can find a more detailed explanation in the documentation. In essence, weak locks do not conflict with each other but work alongside strong locks to optimize conflict detection.

In Read Committed, there's no lock acquired transparently by the reads, thanks to MVCC. Then, depending on what your transaction will do later, you can acquire locks explicitly to avoid anomalies. Here is a SELECT FOR KEY SHARE when you don't want the key to change (like by concurrent delete, or update on key columns):

select * from demo where id='42' for key share;

There is an additional READ operation to lock the row that has been read, marking the row as KEY SHARE (row_mark_type: ROW_MARK_KEYSHARE) and with a Wait-on-Conflict behavior (wait_policy: WAIT_BLOCK). If you start the cluster with wait queues disabled (enable_wait_queues=false) you will see wait_policy: WAIT_ERROR because the wait is managed with transparent retries in the query layer.

The KEY SHARE acquires only a weak shared lock on the row ((0x3066, ["42"], []), []) [kWeakRead]) to prevent deletes. This also protects from an update of the key as it is a delete+insert in DocDB, which organizes the rows by their primary key. At tablet level, it is also a weak lock (DocKey([], []), []) [kWeakRead]).

A SELECT FOR SHARE differs by acquiring a strong share lock on the row:

select * from demo where id='42' for share;

The READ operation is marked as ROW_MARK_SHARE and WAIT_BLOCK and acquires a shared read lock on the row ((0x3066, ["42"], []), []) [kStrongRead]) and the weaker version of it for the tablet (([], []), []) [kWeakRead]).

It's worth noting that the locks acquired by SELECT FOR SHARE is equivalent to ones when using a simple SELECT statement and the Serializable isolation level is set. This is why I've included this blog post it as part of my Isolation Levels series. Explicit locking is often used in database applications to provide finer-grained control at the application level, instead of relying solely on higher isolation levels, despite being totally ignored by the SQL standard.

SELECT FOR SHARE seems to be sufficient to guarantee repeatable reads, but be careful with it. In Read Committed it will allow multiple sessions to SELECT FOR SHARE the same row. If, later they want to update the row, the first one will wait (conflict between kStrongRead and kStrongWrite). When it commits, the second one can continue but it will detect that the row has changed and will transparently re-start the statement at a later read-time. This will create a Lost Update, which is acceptable in Read Committed but is different than Repeatable Read. When your intention is to update the row, it is preferable to acquire a Write lock earlier with FOR UPDATE.

Here is a FOR NO KEY UPDATE that acquires a weak exclusive lock on the row:

select * from demo where id='42' for no key update;

The READ operation is marked with ROW_MARK_NOKEYEXCLUSIVE and WAIT_BLOCK. It acquires a string read like the FOR SHARE but adds a weak write lock ((0x3066, ["42"], []), []) [kWeakWrite, kStrongRead]). If a concurrent session wants to update a column, it will acquire a kStrongWrite at the column level and will be blocked by the kWeakWrite at the row level. At tablet level, we see the weaker versions of them (([], []), []) [kWeakRead, kWeakWrite]).

YugabyteDB differs from many SQL databases with the possibility to lock at column level without locking the whole row.

To lock a row in exclusive more we can SELECT FOR UPDATE:

select * from demo where id='42' for update;

I guess you start to understand the pattern. The READ operation is marked with ROW_MARK_EXCLUSIVE and WAIT_BLOCK and this writes an exclusive lock on the row ((0x3066, ["42"], []), []) [kStrongRead, kStrongWrite]) with the weaker version on the tablet (DocKey([], []), []) [kWeakRead, kWeakWrite]).

All these were Wait-on-Conflict, also known as pessimistic locking, where you may wait to acquire a lock (until the statement timeout). It is the most common desired behavior in OLTP: you declare your intention to update a row, then you queue and be notified when the concurrent transaction completes.

Wait-on-Conflict doesn't help to scale an outbox or de-queuing from a queue table because each session will start the scan at the same point and will wait on the others. You can define some buckets to scale it out, but many SQL database also offers the possibility to ignore the locked rows and continue, with FOR UPDATE SKIP LOCKED:

select * from demo where id='42' for update skip locked;

The locks acquired as those of FOR UPDATE but the difference is visible in the READ operation wait_policy: WAIT_SKIP so that we don't wait in DocDB and just continue the scan to the next row.

There's also the possibility to get an error and handle the retry in the application (maybe later, maybe from another bucket...):

select * from demo where id='42' for update nowait;

Here, the READ operation is tagged with wait_policy: WAIT_ERROR and the SQL layer will report the error (ERROR: 55P03: could not obtain lock on row in relation "demo")

When locking multiple rows, the FOR SHARE or FOR UPDATE does a READ operation for each row that is locked:

select * from demo for update nowait;

This is different from the UPDATE statement that are buffered WRITE operations:

update demo set value1=value1;

For each row, we have a Buffering operation for a WRITE operation related to the update of the row (PGSQL_UPDATE), writing the new value for the column (column_new_values { column_id: 1 expr { value { string_value: "MyValue1" } }) and they are finally sent by Flushing buffered operations.

Another difference is that UPDATE updates columns with additional locks. The exclusive lock is at the column level ((DocKey(0x3066, ["42"], []), [ColumnId(1)]) [kStrongRead, kStrongWrite]). At row level, it acquires only the weaker version (DocKey(0x3066, ["42"], []), []) [kWeakRead, kWeakWrite]). This means that one session can update value1 while the other can update value2.

For geeks, the operation markers are documented in the protocol definition:
https://github.com/yugabyte/yugabyte-db/blob/v2.20.1.0/src/yb/common/common.proto#L361

A look into the past with Oracle Database

I previously pointed out that SQL standard definitions were implemented differently, and this precedes the SQL definition of isolation levels. To illustrate this, let's look at the User Guide of Oracle Version 2.3, the first commercial implementation of an SQL database, in 1979.

In Oracle 2.3, the user doesn't have to lock rows or declare an update intention. It is the responsibility of the database:

However, the intention to update had to be declared at the table level:

This declaration of intent looks similar to the strong read/write locks we have observed in YugabyteDB traces, and the implicit locking resembles the weak ones. Here is the blocking matrix from the Oracle 2.3 User Guide that is very similar to the strong/weak read/write ones:

That was before the SQL standard defined isolation levels, with "phenomena" and locking in mind. Oracle implemented Multi-Version Concurrency Control later by storing a before image (later called rollback segments) to avoid reads blocking writes. We can see that the concept already existed that users must declare their intention to read or write. The database should acquire relevant locks to prevent concurrent transactions from conflicting with this intention. Later, SQL databases introduced LOCK TABLE SHARE/EXCLUSIVE mode for table-level locks and SELECT FOR SHARE/UPDATE for the row-level ones.

In Summary

Knowing the conflicts that can arise when multiple users work on the same data set simultaneously is essential. This is particularly true if your database doesn't implement a true Serializable isolation level (which is the case with Oracle Database) or if you don't use it for performance reasons (reads blocking writes).

Instead of learning on traditional SQL courses that describe isolation levels with anomalies and for which no implementation matches the definitions, it's more relevant to understand the approach taken by modern databases. Today, most databases use Snapshot Isolation with MVCC, defaulting to Read Committed (RCSI) or Snapshot Isolation (SI), which is wrongly referred to as Repeatable Read because of the SQL standard.

To prevent concurrent read/write anomalies, the database application developer must take action and decide what to do in case of a conflict for the best scalability (wait, skip, or fail) by declaring the transaction intents, using SELECT FOR UPDATE. This is part of session isolation (the I in ACID) even if the SQL standard ignores it.