DEV Community

Jim Hatcher
Jim Hatcher

Posted on

Bulk Backfills in CockroachDB

CockroachDB is an OLTP, real-time database, and its architecture and design is all built-around handling these OLTP-style transactions (i.e., high concurrency; many small, discrete reads/writes). For use cases where you need to do batch-style scans of data, there are various strategies, including:

  • using CDC (Change Data Capture) to push mutations to a downstream system like Snowflake where those batch operations can be easily handled
  • use Follower Reads to loosen consistency constraints and make scan-style reads faster

But, what about bulk write operations?

It's important to understand that CockroachDB (CRDB, for short) supports full ACID-compliance (atomicity, consistency, isolation, durability). And, in the "I" (Isolation) category, we only support the strictest level of Serializable. Enforcing these data integrity constraints is typically simpler on short-duration writes to single records than it is against large blocks of rows.

There are some ways around these constraints in CockroachDB which can be leveraged. One common theme is that when CRDB is creating a new table, it can skip lots of constraint checks because there is no chance of contention! For instance:

  • for importing data into CRDB from external sources, IMPORT (which creates a new table) is preferred over IMPORT INTO (which writes into an existing table)
  • for loading data into a table from an existing table, "CREATE TABLE x AS SELECT" (CTAS) is preferred over "INSERT INTO x SELECT FROM tbl"

Let's look at a simple example. Suppose you have a table that has 30 million records in it, and you want to add a new field to it called "f1" with a value filled in on all the existing rows.

Here are three potential solutions to that.

Let's create a database to house our tables and create three tables with 30,000,000 records in each:

CREATE DATABASE schema_backfill;
USE schema_backfill;

CREATE TABLE backfill_default ( id PRIMARY KEY ) AS SELECT g.i FROM generate_series(1, 30000000) g(i);
CREATE TABLE backfill_ctas    ( id PRIMARY KEY ) AS SELECT g.i FROM generate_series(1, 30000000) g(i);
CREATE TABLE backfill_update  ( id PRIMARY KEY ) AS SELECT g.i FROM generate_series(1, 30000000) g(i);
Enter fullscreen mode Exit fullscreen mode

In my 3-node, 16vCPU-per-node, single-region cluster, these table creation statements finish in about ~90 seconds.

Approach 1 - use a DEFAULT

For our first approach, let's use a DEFAULT constraint on our new column and let CRDB fill in the values for us. Notice that in order to force CockroachDB to fill in the values, we need to include a NOT NULL constraint. We can drop the DEFAULT constraint afterwards if we don't want it to be there. And we can drop the NOT NULL constraint, too, if desired.

ALTER TABLE backfill_default ADD COLUMN f1 INTEGER NOT NULL DEFAULT 1;
ALTER TABLE backfill_default ALTER COLUMN f1 DROP DEFAULT;
ALTER TABLE backfill_default ALTER COLUMN f1 DROP NOT NULL;
Enter fullscreen mode Exit fullscreen mode

This takes ~10 minutes to run on my cluster.

Approach 2 - use CTAS

For the second approach, we can use the CREATE TABLE AS SELECT (CTAS) feature in Cockroach to create a new table with the new column we want:

CREATE TABLE backfill_ctas_new ( id PRIMARY KEY, f1 ) AS SELECT id, 1 FROM backfill_ctas;
ALTER TABLE backfill_ctas RENAME TO backfill_ctas_old;
ALTER TABLE backfill_ctas_new RENAME TO backfill_ctas;

Enter fullscreen mode Exit fullscreen mode

This takes ~2 minutes to run on my cluster.

Approach 3 - use an UPDATE

For the third approach, we add the column and run an UPDATE statement to populate the column.

ALTER TABLE backfill_update ADD COLUMN f1 INTEGER;
UPDATE backfill_update SET f1 = 1 WHERE f1 IS NULL;
Enter fullscreen mode Exit fullscreen mode

While this seems like a perfectly reasonable, intuitive approach, this is a very slow thing to do in CockroachDB. If I let this run, it will go for hours. And, if I stop it by cancelling the statement, then all of that work has to be rolled back.

There are definitely ways to speed up this kind of thing in CockroachDB (like batching), but I won't go into that here.

Conclusion

We can see that in this particular test, CTAS is the clear winner, followed by DEFAULT, and UPDATES as a distant third.

My intention is to provide some intuition about how to think about bulk backfills (and bulk updates in general) in CockroachDB and how to plan them to be as efficient as possible.

Top comments (0)