Have you ever had the task of migrating data from one database to another? Or even from one table to another? Or to make too many INSERTs all at once. If so, this article is for you: today we will discuss the COPY command. An amazing alternative to performing intensive writes to the database is very useful for data migration imports, and bulk inserts.
“I have to insert 1mi rows in X seconds”
Having to migrate/insert data in a few seconds is a very common requirement for a task. When people hear that, the first thing they will say to you is:
- “Use a distributed processing tool to insert into the database”
- “Add more CPU and RAM to the database”
- “Use cache”
- "Use a NoSQL database instead"
- "Break down in microservices and put an Apache Kafka broker in front of the database”
Is this really necessary? Maybe not. For many cases, migration tools like the COPY command, are way faster than the options above, and also cheaper in terms of infrastructure. COPY is a native PostgreSQL command and its syntax is like this:
COPY table_name (column1, column2, ...) FROM '/path/to/data.csv';
OR
COPY table_name (column1, column2, ...) FROM STDIN;
The COPY
command in PostgreSQL is a powerful tool for performing bulk inserts and data migrations. It allows you to quickly and efficiently insert large amounts of data into a table.
By using the COPY
command, you can avoid the need for distributed processing tools, adding more CPU and RAM to the database, or using a NoSQL database. It provides a simpler and more cost-effective solution for bulk inserts and data migrations.
So, if you have a task that requires inserting a large number of rows in a short amount of time, consider using the COPY
command in PostgreSQL. It can significantly speed up your data migration and import processes.
Note: PostgreSQL 16 has improved COPY’s performance by over 300%.
How does COPY work internally?
Internally, the COPY
command in PostgreSQL works by leveraging the server's file system. When you use the COPY
command to import data from a file, PostgreSQL reads the file directly from the file system and writes the data directly to the table, bypassing many of the normal SQL processing steps.
This direct file-to-table transfer allows for much faster and more efficient data inserts compared to traditional INSERT
statements. It also reduces the overhead on the database server, making it a preferred method for bulk inserts and data migrations.
But let’s go deeper to understand all the details.
COPY vs INSERT (multi-line)
Let’s see a quick comparison between COPY and INSERT commands, to understand why COPY performs better.
COPY | INSERT (multi-line) | |
---|---|---|
Logging | One log for the entire load | One log for each line/entry |
Network | No latency, data is streamed | Latency between inserts |
Parsing | Only one parsing operation | Parsing overhead |
Transaction | Single transaction | Each insert statement is a separate transaction |
Query Plan | Simpler query execution plan | Lots of different query execution plans |
Summing up, COPY is faster because it reduces the overhead of logging, network latency, parsing, and transaction management compared to multi-line INSERT statements. It allows for a simpler query execution plan, resulting in faster and more efficient bulk inserts and data migrations. A tradeoff is that it requires direct access to the file system, so it may not be suitable for all scenarios where you need to insert data. Another tradeoff is durability, COPY generates few logs, and executes all in a single transaction, which makes it more risky.
Measuring performance
I did some code with Bun (yes, Bun), to compare the performances of INSERT and COPY loading heavy amounts of data to Postgres. Let’s see:
import { Client } from 'pg';
import fs from 'fs';
import { from as copyFrom } from 'pg-copy-streams';
import { pipeline } from 'node:stream/promises';
const client = new Client({
host: 'localhost',
port: 6472,
user: 'copy-insert-db',
password: 'copy-insert-db',
database: 'copy-insert-db'
});
const numRecords = 2000000;
const data = Array.from({ length: numRecords }, (_, i) => [`data_${i}`, new Date().toISOString()]);
async function measureInsert() {
const chunkSize = 1000;
const startTime = Date.now();
for (let i = 0; i < data.length; i += chunkSize) {
const chunk = data.slice(i, i + chunkSize);
const placeholders = chunk.map((_, index) => `($${index * 2 + 1}, $${index * 2 + 2})`).join(',');
const values = chunk.flat();
const insertQuery = `
INSERT INTO test_table (data, time_added) VALUES ${placeholders}
`;
await client.query(insertQuery, values);
}
const duration = (Date.now() - startTime) / 1000;
console.log(`Multi-line INSERT took: ${duration} seconds`);
}
async function measureCopy() {
return new Promise<void>((resolve, reject) => {
const writableStream = fs.createWriteStream("data.csv");
writableStream.on("finish", async () => {
try {
const startTime = Date.now();
const stream = fs.createReadStream("data.csv");
const copyPsqlStream = client.query(copyFrom('COPY test_table (data, time_added) FROM STDIN WITH (FORMAT CSV)'));
await pipeline(stream, copyPsqlStream);
const duration = (Date.now() - startTime) / 1000;
console.log(`COPY took: ${duration} seconds`);
resolve();
} catch (err) {
reject(err);
}
});
data.forEach(d => writableStream.write(d.join(',') + '\n'));
writableStream.end();
});
}
async function main() {
await client.connect();
await measureInsert();
await measureCopy();
await client.end();
}
main();
This code inserts 2mi rows at once into the database. Notice that, for multi-line inserts, there is a limit of 1000 lines per INSERT statement. If we run this code, we get:
Multi-line INSERT took: 37.187 seconds
COPY took: 10.088 seconds
Ramping up (5mi records)
Changing the code to load 5 million records, gives us the following result:
Multi-line INSERT took: 78.957 seconds
COPY took: 20.534 seconds
Ramping up (10mi records)
Now, with 10 mi records, we get:
Multi-line INSERT took: 134.412 seconds
COPY took: 36.965 seconds
We can see how absurdly fast is the COPY command, and why we should consider its use in favor of INSERT in some cases. But we can make COPY even faster, by parallelizing it with threads!
Improving COPY performance with multithreading
For our initial tests with 2 million records, we have:
COPY took: 5.915 seconds
Ramping up (5mi records)
Our parallelized COPY with 5 million records to insert, performs:
COPY took: 15.41 seconds
Ramping up (10mi records)
Our parallelized COPY with 10 million records to insert, performs:
COPY took: 21.19 seconds
Final results
I conducted a performance comparison using a simple plot generated with a notebook running Matplotlib. The results are as follows:
The graph clearly illustrates the superior efficiency of the COPY command in comparison to the INSERT command for this specific task. The difference in speed is quite remarkable.
Top comments (5)
We evaluated 3 database servers (python) postgresql, oracle and for funzies MySQL.
Inserting from 10k records up to 5m in batches of 1000 at a time
The difference between PG 14 and 16 was negligible (16 was faster but really low margins)
Oracle was on 19c, 23c isnt ready yet...
On different hardware the pattern was the same just the times were different.
MySQL actually held its own surprisingly well
the benchmark was to read the dataset into a polars dataframe so "input" wasnt an issue. drop table, create table, insert x records 50 times over for each dataset.
the actual "times" here arent important. this was a comparison of the servers
oracle, mysql, postgresql and postgresql copy were all on the exact same server (clean install vm default configs)
(also running it against docker db servers on my local dev machine showed the same pattern. just obviously much much higher values)
postgres novus is on a production grade server
the pattern stays the same here. i found it strange how postgres could be "that bad" and "that good" at the same time. the normal executemany was by far (really far) behind, yet copy was insanely fast
our takeaway from this was that postgresql looks like the most amazing tech ever when comparing pg executemany to copy. you feel that if the bar is at pg executemany (and think the others servers are at the same or similar) and then you get introduced to copy and its like whoa unicorns and rainbows. but in reality its just that executemany is just "that bad". (heavy emphasis on the quotation marks)
amazing, thanks for sharing! postgres is awesome
it is. and the amount of cool "connection" methods (from pretty much anything). but we have a oracle DBA in the company. postgres didnt win enough for us to use it over oracle. (pg was my choice)
I recently did some testing on this with jdbc and managed 3M rows per second on an M1 mac with copy in a single thread however you lose the ability to use
on conflict ...
.It's possible to improve the performance of inserts I found by replacing
values (..., ...), (..., ...)
withselect unnest(?), unnest(?)
with each?
representing an array of values forcolumn1, column2
, this method was 3x faster thanvalues
in my tests at 850k rows per second.(Lib used for copy with jdbc github.com/PgBulkInsert/PgBulkInsert)
How can I get recently modified functions and tables list in postgresql
As like SQL
Select * from sys.procedures order by modify_date desc
Select * from sys.tables order by modify_date desc