Dave Cridland

Posted on Jul 17, 2020 • Edited on Jul 22, 2020

Efficient INSERT MULTIPLE with Postgres

#sql #database #postgres

Foreword

I'd like to apologise for the lack of cover art. In my defence, I did search for a suitable image to illustrate "multiple insertion", and I feel it important to warn anyone else never to do the same.

Look Familiar?

Do you have code that looks like this?

new_id = db.query("""
  INSERT INTO foo(blah, wibble)
  VALUES($1, $2)
  RETURNING id""",
  "blah", "wibble")
db.query("""
  INSERT INTO other(foo_id, floogle)
  VALUES($1, $2)""",
  new_id, 42)

Maybe it's more complicated, maybe it has SELECT statements and all sorts.

But it works. So why am I going to suggest changing it?

Round trips

The moment we introduce a network connection, we end up with two key parameters for performance. There's "How long something takes to execute", and "How far is it away". The former we have some control over; we can write things more efficiently, add database indexing, boost instance sizes and whatever else needs doing.

The latter, though, we have little control over. If the round-trip time to the database is 5ms, then that code above will take a minimum of 10ms, no matter how fast the database and app service is.

There's other posts that will take you through the delights of query optimisation - I hope, and if not, then nag me and I'll write one.

What we're going to do is make the round-trips go away - and maybe make things a bit faster.

INSERT MULTIPLE

Sadly, there's no such thing as INSERT MULTIPLE. You can put multiple rows into a single table (INSERT ... SELECT ...), but not a row into multiple tables.

You could use a trigger here - that will absolutely work, but I find triggers that are beyond really trivial cases to be astonishingly difficult to test and debug.

Instead, we need a technique called a "Modifying Common Table Expression", or "Modifying CTE".

Common Table Expressions

Normally, any sign of WITH in a query has my hackles up. It's generally the sign of an overcomplex query on a schema not designed to service it.

At their simplest, they look like a rephrasing of a sub-select JOIN:

WITH thing AS (
  SELECT id, blah FROM foo
  WHERE wibble LIKE '%ibble'
)
SELECT floogle FROM other
JOIN thing ON other.foo_id=foo.id

But the magic is that the WITH causes the query to be a fully-named table - a Common Table Expression - so you could use it multiple times, or even within a second Common Table Expression.

WITH thing AS (
  SELECT id, blah FROM foo
  WHERE wibble LIKE '%ibble'
),
other_thing AS (
  SELECT id, foo_id, floogle, blah FROM other
  JOIN thing ON other.foo_id=foo.id
)
SELECT id, foo_id FROM other_thing
JOIN table_i_havent_mentioned_before t ON t.id=other_thing.id

The query optimizer will be clever here, and knows it needsthing before other_thing can be ready. But in practise, it'll also treat the entire query as a single SELECT, so this isn't that exciting.

But what if we don't use a SELECT in the CTE, but an INSERT, UPDATE, or DELETE?

Modifying CTE

WITH step_one AS (
  INSERT INTO foo(blah, wibble)
  VALUES($1, $2)
  RETURNING id
)
INSERT INTO other(foo_id, floogle)
SELECT id, $3 FROM step_one

This does the same as the (pseudo) Python at the beginning.

Postgres will run step_one first, as the outer statement depends on it, and then run the outer statement with the result - just as the Python did before it. Only this time, it's all in a single round-trip.

Let's try something more complicated - how about four statements?

MOAR Modifying CTE

new_id = db.query("""
  INSERT INTO foo(blah, wibble)
  VALUES($1, $2)
  RETURNING id""",
  "blah", "wibble")
other_new_id = db.query("""
  INSERT INTO bar(blook)
  VALUES($1)
  RETURNING id""",
  "blah", "wibble")
db.query("""
  INSERT INTO other(foo_id, floogle, bar_id)
  VALUES($1, $2, $3)""",
  new_id, 42, other_new_id)

Here, we have some Python that executes two statements, one after the other, and inserts the new ids from both in a second table.

In Python, we have to run one after the other - you can't run statements concurrently in a single transaction - so 3 round-trips, 3 statement executions. (Worse actually, since the transaction cost an additional 2 round-trips, for a total of 5.)

But as a modifying CTE statement:

WITH step_one AS (
  INSERT INTO foo(blah, wibble)
  VALUES($1, $2)
  RETURNING id
),
WITH step_two AS (
  INSERT INTO bar(blook)
  VALUES($3)
  RETURNING id
),
INSERT INTO other(foo_id, floogle, bar_id)
SELECT s1.id, $4, s2.id FROM step_one s1, step_two s2

This uses just one round-trip - but also, step_one and step_two are executed concurrently, because they don't depend on each other... so only two statement execution times.

Also, it's a single statement, so you no longer need a transaction for isolation.

I admit, though, that the parameters are getting hard to track. Hey, I know what would solve this:

EVEN MOAR CTE!

WITH my_data(blah, wibble, blook, floogle) AS (
  VALUES ($1, $2,$3, $4::BIGINT)
),
WITH step_one AS (
  INSERT INTO foo(blah, wibble)
  SELECT m.blah, m.wibble FROM my_data m
  RETURNING id
),
WITH step_two AS (
  INSERT INTO bar(blook)
  SELECT m.blook FROM my_data m
  RETURNING id
),
INSERT INTO other(foo_id, floogle, bar_id)
SELECT s1.id, m.floogle, s2.id
FROM step_one s1, step_two s2, my_data m

Look! Variables! (Sorta!)

Note that I've used a SQL cast to ensure my floogle value is the correct BIGINT type, though normally the libraries get these right for you - but for timestamps and things, it might get hard to figure out.

Conclusion

Modifying Common Table Expressions are powerful tools for making a range of common patterns involve a lot less back-and-forth between the application service and your database.

Not only that, but despite the arcane syntax, they can often be faster, and with a little effort, easier to read. They're also easier to debug than triggers.

Have fun!

DEV Community