Franck Pachot for AWS Heroes

Posted on May 30, 2023 • Edited on Jun 3, 2023

DROP TABLE while being read: wait or wound?

#postgres #yugabytedb #distributed #sql

It sounds silly to DROP TABLE from a production database where users are working, but that's what you need when combining High Availability with Continuous Delivery. If you read from a table in one session and drop the same table from another session, what do you think will happen? In a MVCC database with snapshot isolation, we can expect that nothing waits and nothing fails. Afterall, the read session reads from a state of the database at a point in time where the table was not dropped yet.

However, dropping a table does more than marking it as deleted in the catalog. With the most popular MVCC databases, Oracle, InnoDB and PostgreSQL (and all compatible databases or managed services), dropping a table also reclaims the physical space, and this includes the past versions. How can you continue to read from it from another session even if the read point is from the past?

Oracle Database

A read-only SELECT in Oracle does not request any lock. The DROP TABLE requests an exclusive lock, but there's no conflict with concurrent reads: it doesn't wait, and can successfully drop the table. The select can continue. However, this works only while the physical blocks that are read were not reused for another object allocation. At some point, the query will fail with ORA-08103: object no longer exists.

I've run a simple by creating a demo table as:

create table demo as select rownum id from xmltable('1 to 1000000');

Then I started to read the first rows - it can be on the primary or an active data guard standby:

set pause on
select * from demo;

In another session I drop the table:

drop table demo;

If there is no other activity in the database, the first session may continue to read all rows without error. However, if I try to read them again, even in the same session, I'll get ORA-00942: table or view does not exist.

If there were some activities that reused the blocks from the table that is dropped, I'll not be able to read all rows and at some point, get: ORA-08103: object no longer exists. Here is how Oracle checks for conflicts without locks: the data object id in the block is compared to the data object id the query expects.

In short, in Oracle the DROP TABLE doesn't wait but may wound the concurrent SELECT. This is a kind of optimistic locking.

MySQL InnoDB

I'm not an expert in MySQL. The behavior of DDL is documented in InnoDB and Online DDL and Metadata Locking documentation. I've run two simple tests in Amazon Aurora with MySQL compatibility.

I have created a large table by running:

create database franck;
use franck;
create table demo (id int primary key);
insert into demo values(1);
insert into demo select id+(select max(id) from demo) from demo;
insert into demo select id+(select max(id) from demo) from demo;
insert into demo select id+(select max(id) from demo) from demo;
...

In another session, I'm running a very long query:

use franck;
start transaction;
select count(*) from demo a , demo b , demo c , demo d ;

In the first session I'm dropping the table:

drop table demo;

The behavior depends where I run the SELECT

if I run the SELECT on the read-write instance, then the DROP TABLE, that runs on the same instance, waits indefinitely. This also blocks any new session that wants to read this table.

synch/cond/sql/MDL_contex is a wait on metadata lock while the concurrent read is running (or waiting on storage)

if I run the SELECT on the read replica, the DROP statement succeeds but the SELECT is immediately canceled:

mysql> select @@innodb_read_only;
+--------------------+
| @@innodb_read_only |
+--------------------+
|                  1 |
+--------------------+
1 row in set (0.03 sec)

mysql> start transaction;
Query OK, 0 rows affected (0.03 sec)

mysql> select count(*) from demo a , demo b , demo c , demo d ;
ERROR 2013 (HY000): Lost connection to MySQL server during query
No connection. Trying to reconnect...
Connection id:    175
Current database: franck

ERROR 1146 (42S02): Table 'franck.demo' doesn't exist
mysql>

Before failing, the SELECT waited a very short time on metadata locks which are replicated to the read replica:

In short, the behavior of an online DROP TABLE is wait or wound depending if the conflicting transaction runs on the single read-write instance or on the read replica.

PostgreSQL

PostgreSQL runs DDL in transactions, with pessimistic locking. The DROP TABLE statement requests an ACCESS EXCLUSIVE. The read-only SELECT requests an ACCESS SHARE. They conflict, which means that the DROP TABLE will wait if there's another transaction reading the table.

You may think that this is acceptable and that it allows online migrations. If you drop a table (probably with the goal of re-creating it, or replacing it with a view) you don't want to wound the ongoing DML and prefer to wait for your release to be deployed at the right time. However, this overlooks too important cases.

The first and most important one is that lock requests are queued one after the other. When you wait on an exclusive lock, even if you allow the concurrent transaction to continue, you are blocking all new transactions on this table. Even if DDL is transactional, it is in practice an outage on a busy system. Exclusive locks do not scale.

The second consideration is that it is common to scale-out the read transactions, if they accept eventual consistency, with read replicas (a physical standby opened in read-only mode). What we have seen above, waiting for concurrent transactions to complete, works only for the transactions on the primary. There are no distributed lock in PostgreSQL. You can drop a table while a long running transaction is reading it. The DROP TABLE itself doesn't wait because it doesn't see the read replica transactions, but its propagation to the replicas will wait.
Because PostgreSQL physical replication is one monolithic stream of WAL, the DROP TABLE that cannot be applied on the replica while an ongoing transaction blocks the whole replication. Then re replication lag lag increases, with the eventual consistency getting problematic. Because an increasing lag is not acceptable, the read transaction that blocks the apply of the DROP TABLE is killed (after max_standby_streaming_delay) and ends with:

FATAL:  terminating connection due to conflict with recovery
DETAIL:  User was holding a relation lock for too long.
HINT:  In a moment you should be able to reconnect to the database and repeat your command.

So, finally, PostgreSQL waits for some transactions but wounds others, with the side effect that the latter increases the lag of the replica and the former blocks all new reads and writes until the DDL is completed.

I've tested it on AWS in RDS PostgreSQL and Aurora with PostgreSQL compatibility by creating the following table:

create table demo as select generate_series(1,1000000);

In one session (on the read-write instance, or on a read replica) I start:

select pg_is_in_recovery();
begin transaction;
select count(*) from demo;

On another session I start:

drop table demo;

If the read was executed on primary:
the drop waits forever, blocking others, until statement_timeout kicks in if it was set to another value that its default.
If the read was executed on replica:
the drop works but the replication is blocked, until max_standby_streaming_delay which is set by default to 30 seconds

YugabyteDB

Those were databases available as AWS RDS managed services. They are all monolithic databases with additional read replicas. When the need to scale-out goes beyond eventually consistent read replicas, YugabyteDB distributeds all read-write connections, SQL processing, and data to multiple instances, in different Availability Zones or Regions. It is PostgreSQL-compatible but encounters a different message when a DROP TABLE, or any DDL, conflicts with a SELECT, or any DML: Catalog Version Mismatch: A DDL occurred while processing this query. It looks like it differs from PostgreSQL behavior until you get the full picture in a scale-out deployment.

Being PostgreSQL-compatible, YugabyteDB adopts the PostgreSQL semantics as much as possible. Being distributed, with many instances doing reads and writes on the same table in different nodes, the monolithic behavior of taking an exclusive lock on one primary instance only doesn't apply. It makes sense to request no exclusive lock and adopt the wound behavior like PostgreSQL does with read replicas. However, YugabyteDB being ACID with strong consistency, a lag is not acceptable, and the DROP TABLE must wound immediately the concurrent read rather, similar to what we have seen with Aurora MySQL. This is optimistic locking applied to the schema changes: the DDL succeeds, bringing the catalog to a new version, and the transactions that conflict with this new version fail with the following error and can immediately retry:

ERROR:  Not found: The object '00004160000030008000000000000000.demo' does not exist: OBJECT_NOT_FOUND
CONTEXT:  Catalog Version Mismatch: A DDL occurred while processing this query. Try again.

It is not excluded that a wait behavior is implemented in the future, as mentioned in the Stage 2 of Issue #7831: DDL operations should wait for conflicting transactions to finish. However, in a distributed system with High Availability, exclusive locks are usually not desired and applications should be ready to retry.

Note that currently (YugabyteDB 2.17.3) the SQLSTATE to catch for retry is XX000 - internal error. Note also that this happens not only for an explicit DROP TABLE but with any DDL that implicitly re-creates the table. The following demonstrates those two points:

-- create a simple table (with no PK as the goal is to create it later)
yugabyte=# create table demo as select 
           generate_series(1,1000) as id;
SELECT 1000

-- show verbose message with SQLSTATE and code location
yugabyte=# \set VERBOSITY verbose

-- read transaction
yugabyte=# begin transaction;
BEGIN
yugabyte=*# select count(*) from demo;
 count
-------
  1000
(1 row)

-- DDL from another session
yugabyte=*# \! psql -c "alter table demo add primary key (id)"
ALTER TABLE

-- continue the transaction: error and retry
yugabyte=*# select count(*) from demo;

ERROR:  XX000: The object '000033f1000030008000000000000000.demo_temp_old' does not exist: OBJECT_NOT_FOUND
CONTEXT:  Catalog Version Mismatch: A DDL occurred while processing this query. Try again.
LOCATION:  YBPrepareCacheRefreshIfNeeded, postgres.c:3914

-- retry
yugabyte=!# rollback;
ROLLBACK
yugabyte=# select count(*) from demo;
 count
-------
  1000
(1 row)

Changing the primary key in YugabyteDB is re-creating the table in the background because tables are stored in their primary key LSM-Tree to get fast access for the main access pattern (this is different from PostgreSQL and Oracle heap tables). The new table is created and backfilled and then switched (renamed) with the old one. You can see in the message above that the read failure happened on demo_temp_old, the name of the old table, on which the transaction started to read.

The verbose error message shows the error code to catch (XX000), and details where this is raised in the code (a YugabyteDB hook in the PostgreSQL fork: YBPrepareCacheRefreshIfNeeded). XX000 is ERRCODE_INTERNAL_ERROR and maybe a retriable error similar to ERRCODE_T_R_SERIALIZATION_FAILURE could be better (like it was done in #8597) 🤔

Summary

Allowing online DDL, changing the schema while the application is running, is an important feature in modern databases, to combine High Availability and Continuous Delivery. Because the structure is shared by all sessions, conflicts may occur. Acquiring an exclusive lock on the structures that change is acceptable only when the transactions (the DDL one and the application ones) are guaranteed to be short. In general, this pessimistic locking approach cannot scale to multiple node clusters. PostgreSQL uses it only for the primary read-write instance. An optimistic approach is taken by the most popular MVCC databases, by detecting conflict when it happens. The application that must be available during online DDL should implement a retry logic on those errors.

DEV Community