José Thomaz

Posted on Oct 2, 2023

Speed up your PostgreSQL bulk inserts with COPY

#database #postgres #performance #programming

Have you ever had the task of migrating data from one database to another? Or even from one table to another? Or to make too many INSERTs all at once. If so, this article is for you: today we will discuss the COPY command. An amazing alternative to performing intensive writes to the database is very useful for data migration imports, and bulk inserts.

“I have to insert 1mi rows in X seconds”

Having to migrate/insert data in a few seconds is a very common requirement for a task. When people hear that, the first thing they will say to you is:

“Use a distributed processing tool to insert into the database”
“Add more CPU and RAM to the database”
“Use cache”
"Use a NoSQL database instead"
"Break down in microservices and put an Apache Kafka broker in front of the database”

Is this really necessary? Maybe not. For many cases, migration tools like the COPY command, are way faster than the options above, and also cheaper in terms of infrastructure. COPY is a native PostgreSQL command and its syntax is like this:



COPY table_name (column1, column2, ...) FROM '/path/to/data.csv';



COPY table_name (column1, column2, ...) FROM STDIN;

The COPY command in PostgreSQL is a powerful tool for performing bulk inserts and data migrations. It allows you to quickly and efficiently insert large amounts of data into a table.

By using the COPY command, you can avoid the need for distributed processing tools, adding more CPU and RAM to the database, or using a NoSQL database. It provides a simpler and more cost-effective solution for bulk inserts and data migrations.

So, if you have a task that requires inserting a large number of rows in a short amount of time, consider using the COPY command in PostgreSQL. It can significantly speed up your data migration and import processes.

Note: PostgreSQL 16 has improved COPY’s performance by over 300%.

How does COPY work internally?

Internally, the COPY command in PostgreSQL works by leveraging the server's file system. When you use the COPY command to import data from a file, PostgreSQL reads the file directly from the file system and writes the data directly to the table, bypassing many of the normal SQL processing steps.

This direct file-to-table transfer allows for much faster and more efficient data inserts compared to traditional INSERT statements. It also reduces the overhead on the database server, making it a preferred method for bulk inserts and data migrations.

But let’s go deeper to understand all the details.

COPY vs INSERT (multi-line)

Let’s see a quick comparison between COPY and INSERT commands, to understand why COPY performs better.

	COPY	INSERT (multi-line)
Logging	One log for the entire load	One log for each line/entry
Network	No latency, data is streamed	Latency between inserts
Parsing	Only one parsing operation	Parsing overhead
Transaction	Single transaction	Each insert statement is a separate transaction
Query Plan	Simpler query execution plan	Lots of different query execution plans

Summing up, COPY is faster because it reduces the overhead of logging, network latency, parsing, and transaction management compared to multi-line INSERT statements. It allows for a simpler query execution plan, resulting in faster and more efficient bulk inserts and data migrations. A tradeoff is that it requires direct access to the file system, so it may not be suitable for all scenarios where you need to insert data. Another tradeoff is durability, COPY generates few logs, and executes all in a single transaction, which makes it more risky.

Measuring performance

I did some code with Bun (yes, Bun), to compare the performances of INSERT and COPY loading heavy amounts of data to Postgres. Let’s see:




import { Client } from 'pg';
import fs from 'fs';
import { from as copyFrom } from 'pg-copy-streams';
import { pipeline } from 'node:stream/promises';

const client = new Client({
    host: 'localhost',
    port: 6472,
    user: 'copy-insert-db',
    password: 'copy-insert-db',
    database: 'copy-insert-db'
});

const numRecords = 2000000;
const data = Array.from({ length: numRecords }, (_, i) => [`data_${i}`, new Date().toISOString()]);

async function measureInsert() {
    const chunkSize = 1000;
    const startTime = Date.now();

    for (let i = 0; i < data.length; i += chunkSize) {
        const chunk = data.slice(i, i + chunkSize);
        const placeholders = chunk.map((_, index) => `($${index * 2 + 1}, $${index * 2 + 2})`).join(',');
        const values = chunk.flat();

        const insertQuery = `
            INSERT INTO test_table (data, time_added) VALUES ${placeholders}
        `;

        await client.query(insertQuery, values);
    }

    const duration = (Date.now() - startTime) / 1000;
    console.log(`Multi-line INSERT took: ${duration} seconds`);
}

async function measureCopy() {
    return new Promise<void>((resolve, reject) => {
        const writableStream = fs.createWriteStream("data.csv");

        writableStream.on("finish", async () => {
            try {
                const startTime = Date.now();
                const stream = fs.createReadStream("data.csv");
                const copyPsqlStream = client.query(copyFrom('COPY test_table (data, time_added) FROM STDIN WITH (FORMAT CSV)'));
                await pipeline(stream, copyPsqlStream);
                const duration = (Date.now() - startTime) / 1000;
                console.log(`COPY took: ${duration} seconds`);
                resolve();
            } catch (err) {
                reject(err);
            }
        });

        data.forEach(d => writableStream.write(d.join(',') + '\n'));
        writableStream.end();
    });
}

async function main() {
    await client.connect();

    await measureInsert();
    await measureCopy();

    await client.end();
}

main();

This code inserts 2mi rows at once into the database. Notice that, for multi-line inserts, there is a limit of 1000 lines per INSERT statement. If we run this code, we get:

Multi-line INSERT took: 37.187 seconds
COPY took: 10.088 seconds

Ramping up (5mi records)

Changing the code to load 5 million records, gives us the following result:

Multi-line INSERT took: 78.957 seconds
COPY took: 20.534 seconds

Ramping up (10mi records)

Now, with 10 mi records, we get:

Multi-line INSERT took: 134.412 seconds
COPY took: 36.965 seconds

We can see how absurdly fast is the COPY command, and why we should consider its use in favor of INSERT in some cases. But we can make COPY even faster, by parallelizing it with threads!

Improving COPY performance with multithreading

For our initial tests with 2 million records, we have:

COPY took: 5.915 seconds

Ramping up (5mi records)

Our parallelized COPY with 5 million records to insert, performs:

COPY took: 15.41 seconds

Ramping up (10mi records)

Our parallelized COPY with 10 million records to insert, performs:

COPY took: 21.19 seconds

Final results

I conducted a performance comparison using a simple plot generated with a notebook running Matplotlib. The results are as follows:

The graph clearly illustrates the superior efficiency of the COPY command in comparison to the INSERT command for this specific task. The difference in speed is quite remarkable.

Top comments (5)

William Stam • Oct 4 '23 • Edited

We evaluated 3 database servers (python) postgresql, oracle and for funzies MySQL.

Inserting from 10k records up to 5m in batches of 1000 at a time

Postgres executemany was the slowest (by far) when just doing executemany
Postgres copy was the fastest (we didn't think anything could come close to PG here.)
Oracle executemany was the 2nd fastest and it was actually really close to PG copy

The difference between PG 14 and 16 was negligible (16 was faster but really low margins)
Oracle was on 19c, 23c isnt ready yet...

On different hardware the pattern was the same just the times were different.

MySQL actually held its own surprisingly well

the benchmark was to read the dataset into a polars dataframe so "input" wasnt an issue. drop table, create table, insert x records 50 times over for each dataset.

the actual "times" here arent important. this was a comparison of the servers

oracle, mysql, postgresql and postgresql copy were all on the exact same server (clean install vm default configs)
(also running it against docker db servers on my local dev machine showed the same pattern. just obviously much much higher values)
postgres novus is on a production grade server

the pattern stays the same here. i found it strange how postgres could be "that bad" and "that good" at the same time. the normal executemany was by far (really far) behind, yet copy was insanely fast

our takeaway from this was that postgresql looks like the most amazing tech ever when comparing pg executemany to copy. you feel that if the bar is at pg executemany (and think the others servers are at the same or similar) and then you get introduced to copy and its like whoa unicorns and rainbows. but in reality its just that executemany is just "that bad". (heavy emphasis on the quotation marks)

José Thomaz • Oct 4 '23

amazing, thanks for sharing! postgres is awesome

William Stam • Oct 5 '23

it is. and the amount of cool "connection" methods (from pretty much anything). but we have a oracle DBA in the company. postgres didnt win enough for us to use it over oracle. (pg was my choice)

Stephen Flavin • Oct 3 '23 • Edited

I recently did some testing on this with jdbc and managed 3M rows per second on an M1 mac with copy in a single thread however you lose the ability to use on conflict ....
It's possible to improve the performance of inserts I found by replacing values (..., ...), (..., ...) with select unnest(?), unnest(?) with each ? representing an array of values for column1, column2, this method was 3x faster than values in my tests at 850k rows per second.
(Lib used for copy with jdbc github.com/PgBulkInsert/PgBulkInsert)

MuraliKrishna Ch • Oct 4 '23

How can I get recently modified functions and tables list in postgresql
As like SQL
Select * from sys.procedures order by modify_date desc
Select * from sys.tables order by modify_date desc

DEV Community

Speed up your PostgreSQL bulk inserts with COPY

“I have to insert 1mi rows in X seconds”

How does COPY work internally?

COPY vs INSERT (multi-line)

Measuring performance

Ramping up (5mi records)

Ramping up (10mi records)

Improving COPY performance with multithreading

Ramping up (5mi records)

Ramping up (10mi records)

Final results

Links

Top comments (5)

Read next

crazy-complete: Generate shell auto-completion files with a single configuration file

Mastering package.json: A Comprehensive Guide 🚀

Tired of AI Tech Writing? Here’s How to Make Your Posts More Human

Exploring Swift for iOS Development