DEV Community

Franck Pachot for Yugabyte

Posted on

when you have millions of insert statements in a file...

Let's say you have a script of one million of insert statements with literal values. Of course, this is a bad idea. SQL is language, not a data exchange format. You should have a file to import with COPY, with all values as CSV for example. Or at least, if you are a big fan of INSERT statements, have thousands of row values in it. Or prepare the statement with parameters and call with each parameter.

However, this is the kind of things we encounter in real life. I still remember 20 years ago when I got to look at a Mobile Telecom Billing application where the vendor specification required 6 large servers for a system that had to at most one thousand of post-paid customers to bill. Call Data record came in a very efficient proprietary binary format from the network Mobile Switching Centre. They had a C program to decode it into... plain text INSERT statements to be "ingested" with sqlplus. Clearly, it is hard to imagine worst design, and that was for the most critical part of the system as it required nearly real-time actions. Of course performance was poor. And you can imagine how there were no correct error handling there. And no security with all call records in plain text. But they sell it to people who do not look at how it works, easily satisfied by nice powerpoints. Today, I'm convinced that, whatever the reputation of the vendor, you should not put your data on software that is not open-source. Look at the code, look at the design, and you will get a good idea of the quality of the software.

Back to our technical workaround, I'm generating one million of insert statements:

for i in {1..1000000}
do
 echo "insert into demo (key,value) values ($i,$RANDOM);" 
done > inserts.sql
Enter fullscreen mode Exit fullscreen mode

PostgreSQL

Loading them as-is with psql is not very efficient because it has to send the command one by one, parse them each time, and commit each row:

time psql <<SQL
 drop table if exists demo;
 create table demo (key int, value int);
 \set QUIET on
 \i inserts.sql
 \set QUIET off
 select count(*),sum(value) from demo;
 \q
SQL
Enter fullscreen mode Exit fullscreen mode

Here is the result, in 21 minutes, and my PostgreSQL database is local. It would be even worse with network roundtrips:

CREATE TABLE

  count  |     sum
---------+-------------
 1000000 | 16382993084
(1 row)


real    21m22.156s
Enter fullscreen mode Exit fullscreen mode

I can get it much faster by sending them in a single transaction. By default, psql is in autocommit mode, which means that each call is a transaction. You get out of this by starting a transaction yourself:

time psql <<SQL
 begin transaction;
 drop table if exists demo;
 create table demo (key int, value int);
 \set QUIET on
 \i inserts.sql
 \set QUIET off
 select count(*),sum(value) from demo;
 commit;
 \q
SQL
Enter fullscreen mode Exit fullscreen mode

There, even if the rows are sent one by one, all is part of the same transaction and takes only one minute:

BEGIN
DROP TABLE
CREATE TABLE

  count  |     sum
---------+-------------
 1000000 | 16382993084
(1 row)

COMMIT

real    1m13.517s
user    0m5.841s
Enter fullscreen mode Exit fullscreen mode

I can even avoid the many roundtrips by enclosing this in a procedural block. The BEGIN keyword has another signification here. There's no need to control the transaction as the whole command is send as one auto-commit one:

cat > inserts.cmd <<SQL
do 'begin
$(cat inserts.sql)
 commit;
 end;';
SQL

time psql <<SQL
 drop table if exists demo;
 create table demo (key int, value int);
 \i inserts.cmd
 select count(*),sum(value) from demo;
 \q
SQL
Enter fullscreen mode Exit fullscreen mode

I save the roundtrips, and still run it as one transaction, and get it to run in 40 seconds:

DROP TABLE
CREATE TABLE
DO

  count  |     sum
--------------+-------------
 1000000 | 16382993084
(1 row)


real    0m40.664s
Enter fullscreen mode Exit fullscreen mode

YugabyteDB

Ok, now my goal is to ingest to YugabyteDB. I will have some network calls because my database is on another node (you can't expect a distributed database to be local as it has multiple nodes).
I tried the same but it was very long:

time psql postgres://yugabyte:yugabyte@yb1.pachot.net:5433 <<SQL
 drop table if exists demo;
 create table demo (key int, value int);
 \i inserts.cmd
 select count(*),sum(value) from demo;
 \q
SQL 
Enter fullscreen mode Exit fullscreen mode

I mentioned the long parse to prepare the statements, and this is worse in a distributed database because the metadata (aka dictionary aka catalog aka system information) is shared across the nodes. And, during this, the DropCachedPlan shows up in perf (reminder that perf shows only the CPU resources, not the RPC waits):

Alt Text

Well, I cancelled it. Row-by-row inserts is bad in a monolith database, even worse in a distributed ones. Let's see the solutions, or workarounds, without having to modify out the inserts.sql file.

YugabyteDB - Temporary Table

As the insert into the monolith PostgreSQL doesn't take too long, I tried the following, inserting into a temporary table (local on the node I'm connected to) and CREATE TABLE ... AS SELECT or INSERT ... SELECT from it into my distributed table.

time psql postgres://yugabyte:yugabyte@yb1.pachot.net:5433/yugabyte <<SQL
 drop table if exists demo_temp;
 drop table if exists demo;
 create temporary table demo_temp (key int, value int);
 alter table demo_temp rename to demo;
 \set QUIET on
 \i inserts.sql
 \set QUIET off
 alter table demo rename to demo_temp;
 create table demo as select * from demo_temp;
 select count(*),sum(value) from demo;
 \q
SQL
Enter fullscreen mode Exit fullscreen mode

This is again too long. Because the problem is not in the distributed storage. When we have a batch of inserts, they are optimally sent to each node and processed asynchronously. Here the problem is many small SQL statements to parse.

YugabyteDB - COPY

In YugabyteDB as in PostgreSQL, for fast ingest of data, the right tool is COPY. But I'm too lazy to format those inserts into a CSV. As it is quite fast to load into a local PostgreSQL database, like I did above, I'll generate the COPY commands with pg_dump:

pg_dump --table demo --data-only > inserts.dmp
Enter fullscreen mode Exit fullscreen mode

Without the --inserts option, the generated file uses COPY FROM STDIN with TSV formatted lines until \.:
Alt Text

No statement to parse, no dictionary to read, this is efficient:

time psql postgres://yugabyte:yugabyte@yb1.pachot.net:5433/yugabyte <<SQL
 drop table if exists demo;
 create table demo (key int, value int);
 \i inserts.dmp
 \c
 select count(*),sum(value) from demo;
 \q
SQL
Enter fullscreen mode Exit fullscreen mode

Here is the result:

CREATE TABLE
SET
SET
SET
SET
SET
 set_config
-----------------

(1 row)

SET
SET
SET
SET
COPY 1000000
You are now connected to database "yugabyte" as user "yugabyte".

  count  |     sum
--------------+-------------
 1000000 | 16382993084
(1 row)

real    0m59.139s
Enter fullscreen mode Exit fullscreen mode

This is about 1 minute, not bad when compared to the local PostgreSQL one given that I'm inserting into a Replication Factor 3 cluster here. But in a distributed database, the performance comes with scalability.

Let's try again:

psql postgres://yugabyte:yugabyte@yb1.pachot.net:5433/yugabyte <<SQL
 drop table if exists demo;
 create table demo (key int, value int);
SQL
Enter fullscreen mode Exit fullscreen mode

Now with 4 clients in parallel:

for i in {0..3} ; do 
(
 awk '
 # out of data section
 /^\\[.]/{data=0}
 # when in data section, take only 1 row every `mod`
 data!=1||NR%mod==num{print}
 # data section begins after COPY
 /^COPY /{data=1}
 ' mod=4 num=$i inserts.dmp > inserts.$i
 psql -c "\\i inserts.$i" postgres://yugabyte:yugabyte@yb1.pachot.net:5433/yugabyte
) & done | grep ^COPY | ts
time wait
Enter fullscreen mode Exit fullscreen mode

This is a small cluster but with parallelism I can load the million rows in a shorter time:

Sep 22 20:34:30 COPY 250000
Sep 22 20:34:30 COPY 250000
Sep 22 20:34:30 COPY 250000
Sep 22 20:34:30 COPY 250000

real    0m39.396s
Enter fullscreen mode Exit fullscreen mode

This is, in a remote highly available distributed database, the same time as with a local monolith database. YugabyteDB has many optimizations to scale fully consistent data ingestion. The right way is to send data in bulk without parsing statements, as with COPY. The same PostgreSQL best practices apply in YugabyteDB, and are even more critical as soon as it involves cross-nodes calls

Discussion (0)