How Long Does It Take To Create An Index?

#postgres #sql

A recent tweet asked the following question.

Does anybody know how long it should take to create an index of two integer columns with approximately 110 Million records in the DB? #postgres #postgresql #NotAnDBA

Let's put together an experiment. Create a table with 2 columns populated with 110,000,000 million rows. Each column has a random integer value.

Use the command line psql client that’s included with PostgreSQL to create the index.

To see how long the CREATE INDEX statement will take toggle \timing from psql.

CREATE TABLE t (
    a INTEGER,
    b INTEGER
);

-- Populate 110,000,000 million records
INSERT INTO t(a,b)
SELECT
  (RANDOM() * 1000)::INT,
  (RANDOM() * 1000)::INT
FROM GENERATE_SERIES(1, 110000000) seq;

-- Confirm 110,000,000 records
SELECT COUNT(*) FROM t;
   count
-----------
 110000000

Enable timing to collect the duration of running the CREATE INDEX statement.

-- turn timing on
\timing
Timing is on.

-- Create index "t_a_idx" on table "t" on the "a" column
CREATE INDEX t_a_idx ON t (a);

Time: 52348.022 ms (00:52.348)

Answer

In this test, it took about 52 seconds to create the index.

Test Environment Details

Tested on a M1 MacBook Air with 16GB RAM.
The table has no sequences, constraints, or other indexes
Vacuum was running but was cancelled when the CREATE INDEX started according to the postgresql.log.
No other queries were running on the table.

This post is intended to be a demonstration of how to conduct this sort of experiment on your own and some of the factors that contribute to the time needed for index creation.

The best way to answer questions like this is to create a test on your own server instance or using a separate disconnected instance made from a snapshot.

In a production system with live queries, use the CONCURRENTLY keyword when creating indexes. While this lengthens the time to create the index, it avoids blocking concurrent write operations.

Collect timing info

This experiment could be considered a best case scenario for my machine. The data was loaded into a minimal set of pages, with inserts and no updates or deletes. There were no other queries or background processes running.

A best case scenario gives us a starting point, but the best way to answer this question will be to try this out on your system. Good luck!