This post will be fairly quick, and covers how you can significantly reduce the amount of time a Postgres
COPY command takes to complete.
Not too long ago, I managed to finish building out the first revision of my UK transport data platform.
During the time I was building it, I came across numerous challenges but perhaps one of the most critical - How do I improve the performance of data import?
Allow me to provide some context. I provide transport data for buses, trains and London underground via a GraphQL API. I source my data from numerous official providers, including National Rail, Network Rail, and the UK government open data sources. They provide me with huge datasets such as bus stop locations, transit schedules, etc. Every day, I retrieve this data and import it into a Postgres database.
One of my these datasets has over 38 million rows, and so ensuring that I can import that as quickly as possible is important for maintaining the general uptime of the API.
Here's a few quick tips which can significantly improve performance when importing datasets using Postgres
- Delete indexes before importing
Indexes provide a way for Postgres to quickly locate data in tables. You can run a database without them, but they help improve read performance significantly on large datasets. The downside of them is that they drastically reduce the performance of inserts. Removing them before you begin an import, and recreating them after the import is finished, reduces the need for the database to update indexes at the same time as it's handling the data import.
- Set tables to unlogged before importing
When a table is set to
LOGGED, it will write changes made to that table in the WAL (Write-Ahead Log). This helps ensure data consistency, and reduces the chance that a table will be corrupted. Unfortunately, much like indexes, this also reduces performance significantly when writing to the table. Setting a table to
UNLOGGED before you begin the import, and then setting it to
LOGGED after the import completes can improve performance significantly.
Let's cover some numbers.
Importing the data with indexes in-place, and tables set to logged: Approximately 27 minutes to import data to 6 tables
Importing the data without indexes, and tables set to unlogged: Approximately 10 minutes to import data to 6 tables
The creation of several indexes on the largest table after data import has completed takes around 3 minutes to complete. If I opted not to recreate the indexes, the entire import process would be completed in 7 minutes.