The lowest level of MVCC databases is Read Committed, which is commonly used as the default setting. However, it is also possibly the least understood and the least database-agnostic. As the name suggests, it only reads committed data but allows for all types of anomalies except dirty reads.
So, does using Read Committed corrupt your database? Not if you understand it and manage race conditions yourself. MVCC databases typically allow concurrent reads and writes without locking the data for reads by default. However, in certain scenarios, it may be necessary to use explicit locking to ensure data consistency. For example, if you are concerned about lost updates, you can use the SELECT FOR SHARE or SELECT FOR UPDATE commands to lock the rows you've read. This approach provides protection that is similar to the Cursor Stability or Repeatable Read isolation levels, as it prevents UPDATE or DELETE operations on the read set but with a reduced scope on a statement-by-statement basis. To prevent other anomalies, such as phantom reads, you can use LOCK TABLE to prevent new insertions from altering the read state, since you cannot lock a row that doesn't exist yet. Some databases also provide an API for custom locks, like PostgreSQL Advisory lock.
What is the advantage of Read Committed over Repeatable Read? A MVCC database allows the database to roll back and restart a statement at the statement level, avoiding the need for the application to handle serialization errors.
Every database is unique when it comes to a transparent restart and explicit locking.
Oracle doesn't offer a LOCK FOR SHARE option that blocks writers while allowing other readers to access the data. Instead, it uses LOCK FOR UPDATE, which has a lower level of concurrency as readers can block each other. On the other hand, PostgreSQL and YugabyteDB provide shared and exclusive row locks, which enable more efficient data access and better concurrency control.
In case of a conflict between the read state (using MVCC) and the write state (the current state), when Oracle or YugabyteDB encounters such a situation, it can roll back the statement to an implicit savepoint and restart it to ensure a consistent result based on a more recent read time, all of which is done seamlessly and transparently.
In the same condition, SQL Server with READ_COMMITTED_SNAPSHOT implements MVCC for Read Committed. It locks the read state instead of restarting. More details can be found at https://www.dbi-services.com/blog/how-sql-server-mvcc-compares-to-oracle-and-postgresql/, which means that readers still block writers.
When using Read Committed in PostgreSQL, inconsistencies can arise when there is a conflict during a write operation. In such cases, if a row has been modified since it was last read, PostgreSQL will re-read the row to avoid corrupting it. However, this re-read is based on a new time, which can be inconsistent with the previous reads. I think the main reason why it doesn't rollback and restart is that it requires savepoint before each statements, and those do not scale in PostgreSQL.
To ensure result consistency, YugabyteDB and Oracle follow a different approach. Instead of re-reading the row, they rollback and restart the entire statement. This ensures that the entire dataset reflects the same state from the new read time.
YugabyteDB implements a read restart to ensure statement-level consistency without blocking writes, and SELECT FOR SHARE/UPDATE for explicit locking, providing a powerful Read Committed isolation level.
The main difference between Read Committed and Repeatable Read in MVCC databases lies in the read time. In Read Committed, the read time is the start of the statement, while in Repeatable Reads and higher levels, it is the same for the whole transaction. Having a different read time for each statement doesn't protect against anomalies in complex transactions, but it allows more transparent statement restarts, which means that the database can roll back a statement (to an implicit savepoint taken before) and restart it transparently with a different read time.
In higher levels, when the read time must be the beginning of the transaction, the entire transaction must be rolled back and restarted. The database cannot perform this action on its own as it lacks knowledge of what else the application has done during the transaction. Therefore, to protect against anomalies with higher isolation levels, an MVCC database must raise a serializable error when a conflict is detected. This allows the application to retry the transaction itself.
This provides a clue for optimizing Read Committed transactions: run the entire business transaction as a single statement with WITH and RETURNING clauses instead of multiple statements.
Here are the characteristics of Read Committed isolation level in YugabyteDB (when --yb_enable_read_committed_isolation=true
)
- Read time: the start of the statement
- Possible anomalies: all (except dirty reads)
- Performance overhead: none except when using explicit locking
- Development constraint: explicit locking when repeatable reads is necessary Default in: PostgreSQL, Oracle, YugabyteDB
Top comments (8)
Thanks for this series, Franck.
One question on those two statements:
"when Oracle or YugabyteDB encounters such a situation, it can roll back the statement to an implicit savepoint and restart it to ensure a consistent result based on a more recent read time"
"In such cases, if a row has been modified since it was last read, PostgreSQL will re-read the row to avoid corrupting it"
To be honest, I did not understand the differences between them. What's the difference between "rollback+restart" and "re-read"? They seem the same thing to me.
The difference is re-read the row at a newer time (but do not discard what the statement has already read before) vs rollback all to re-run the statement so that the result is consistent
Thanks for the explanation, Franck.
Sorry, it is still confusing to me.
Do you have any articles with something more practical so I can understand them better?
Let's take an example of "re-read" and inconsistent result in PostgreSQL:
Do not commit this transaction yet, and in a second session run:
It waits on the first session. Then
commit
the first session and then the SELECT FOR UPDATE shows:Here is what happened in PostgreSQL:
flag>0
, then is selectedflag>0
, then is discardedflag>0
, then is selected but, because an ongoing mutation is detected, is re-read as of after the session 1 commit and the mutated value does not verifyflag>0
, then is discardedflag>0
, then is discarded, even if the mutated value would verify the predicateflag>0
, then is selected but, because an ongoing mutation is detected, is re-read as of after the session 1 commit and the mutated value still verifiesflag>0
, then is selected, and shows the new valueflag>0
, then is discardedThe result is inconsistent, with rows 3 and 5 seeing the concurrent update but row 4 ignoring it. It is not isolated from the concurrent updates, breaking the I in ACID.
This is PostgreSQL behavior, with "re-read", and results in an inconsistent result, not using a single database state to evaluate the predicate. To read as-of a single database state, the result should be either
(1),(3),(5)
if evaluated before the other session commits, or(1),(4),(5)
if evaluated after it.Due to the use of FOR UPDATE,
(1),(3),(5)
is not usable as, being a stale state, it cannot be updated.The current state that can be locked and updated is
(1),(4),(5)
. However, as the conflict is detected during execution, the statement must be rolled back to an implicit savepoint taken just before it and restarted to execute the whole statement on the new state. I think PostgreSQL doesn't implement that because savepoints are not scalable (see PostgreSQL Subtransactions Considered Harmful). To avoid this anomaly you can use higher isolation level, get a serializable error, and implement the restart yourself.The same code run on YugabyteDB, which is PostgreSQL-compatible but implements "rollback+restart", returns a consistent result:
Each row evaluated its predicate and result on the state after the other session was committed, ensuring a consistent and updatable result.
This is described in a PostgreSQL code README
Thanks @franckpachot !
Now, the difference is more apparent. It was very didact, you're amazing! 👊🏻
I think what postgres does is technically allowed under the ANSI standard of
READ COMMITTED
, what you are describing as inconsistency would not be allowed monotonic atomic view, which postgres'READ COMMITTED
mostly behaves underANSI standard definition is not clear because it supposes you alway read the current state (no MVCC). Yes, PostgreSQL doesn't violate RC definition as it reads only committed data. For MVCC, RC supposes a per-query snaphot to be consistent. Because you have no control on the execution order of the reads and writes. Different read times are accepted between queries because you have control on it.