DEV Community

Cover image for Supercharge Your Rails App: Inserting 1 Million Records in 15s
Pimp My Ruby
Pimp My Ruby

Posted on

Supercharge Your Rails App: Inserting 1 Million Records in 15s

In a world where database performance and efficiency are critical factors in application development, the swift import of a large number of records has become a necessity.

In a previous article, we extensively examined various methods for effectively creating 100,000 records with Ruby on Rails. As a quick recap, we achieved a record insertion time of less than 3 seconds using the insert_all method.

However, I received feedback mentioning an even more efficient method than insert_all. This method is called ActiveRecord-Copy. Based on the feedback I received, this method is expected to revolutionize my perspective on fast insertion.

Today, we are going to shift our focus to a larger scale and find the fastest method for inserting 1 million records into a database!

DataSet

For today's benchmark, we will start with the PostgreSQL database used in my previous article:

# db/schema.rb
create_table "accounts", force: :cascade do |t|
  t.string "first_name"
  t.string "last_name"
  t.string "phone"
  t.string "email"
  t.string "role"
end
Enter fullscreen mode Exit fullscreen mode

As we aim to load 1 million records, using FactoryBot to load data for each benchmark execution would be too time-consuming. Therefore, I preloaded one million records into a file named accounts.csv to load the data at the beginning of the benchmark as follows:

accounts = []
CSV.foreach(Rails.root.join('accounts.csv'), headers: true) do |row|
  next if row['role'] == 'role'
  accounts << row.to_h
end
Enter fullscreen mode Exit fullscreen mode

Now, we can feed these users to our various methods to kickstart the insertion process!


Introduction to ActiveRecord-Copy

ActiveRecord-Copy is a gem that enables the use of the PostgreSQL COPY command. Typically, the COPY command is used to import data from a .sql file into a local database using PostgreSQL directly.

ActiveRecord-Copy facilitates interfacing this command directly within Ruby. The use of COPY minimizes the overhead associated with creating and executing multiple SQL statements for each record, resulting in remarkable speed and efficiency.

To install it in your project:

bundle add activerecord-copy
Enter fullscreen mode Exit fullscreen mode

Here is how I used ActiveRecord-Copy:

columns = users.first.keys + %w[created_at updated_at]
time = Time.now.getutc
Account.copy_from_client(columns) do |copy|
  accounts.each do |account|
    copy << (account.values + [time, time])
  end
end
Enter fullscreen mode Exit fullscreen mode

When using ActiveRecord-Copy, you must specify the columns you will provide. You also need to provide the created_at and updated_at timestamps, which are not automatically set during the COPY process.


Benchmark

In this benchmark, we will compare insert_all, activerecord-import, and our new contender, activerecord-copy.

You can find the benchmark here.

Now, let's compare our three methods for creating 1 million records:

Performance Benchmark

For the pure performance benchmark, we will run the benchmark script ten times, and here are the results:

Method Average Min Max Sum
ActiveRecord-Import 73.42 68.37 79.30 734.26
InsertAll 48.71 46.09 52.14 487.11
ActiveRecord-Copy 14.81 11.42 17.46 148.17

Here is the distribution of execution times over the ten trials:

Image description

We have an intriguing ranking!

🥇 ActiveRecord Copy (~5 times faster than ActiveRecord-Import)

🥈 InsertAll (~1.5 times faster than ActiveRecord-Import)

🥉 ActiveRecord-Import

ActiveRecord-Copy demonstrates significantly better performance than InsertAll and ActiveRecord-Import.


Error Handling Comparison

Beyond raw performance, error handling is a crucial aspect when inserting a large number of records. Let's compare how the three methods behave in terms of error handling.

For these tests, we will add two validations:

  • A database validation
# db/migrate/20231101115631_add_unique_on_accounts_email.rb
class AddUniqueOnAccountsEmail < ActiveRecord::Migration[7.0]
  def change
    add_index :accounts, :email, unique: true
  end
end
Enter fullscreen mode Exit fullscreen mode
  • A validation in our ActiveRecord model
# app/models/account.rb
class Account < ApplicationRecord
  [ ... ]
  validates :first_name, presence: true, length: { maximum: 50 }
  [ ... ]
end
Enter fullscreen mode Exit fullscreen mode

We will then create a script to introduce errors for each of the methods:

accounts = FactoryBot.attributes_for_list(:account, 100)

# We add an account with a first_name that is too long
accounts << FactoryBot.attributes_for(:account, first_name: 'a' * 51)

# We add a duplicate email in the array
accounts << accounts.first
Enter fullscreen mode Exit fullscreen mode

Here are the behaviors observed when trying to insert records with the different methods:

Method Database Error Validation Error
InsertAll No error, but the record is not created in the database Unhandled. The record is created with the excessively long first name
ActiveRecord-Import Raises a PG::UniqueViolation error. Can be bypassed with on_duplicate_key_ignore No error, but the record is not created in the database
ActiveRecord-Copy Raises a PG::UniqueViolation error. Cannot be bypassed. Unhandled. The record is created with the excessively long first name

ActiveRecord-Import is the only method that effectively handles model validations. This difference can partially explain why its processing time is significantly longer than the other methods.

I am also surprised that InsertAll does not create a record when the error originates from the database. Given that the method does not raise any errors, the expected behavior would have been database insertion.


Interpretation

After analyzing the benchmark results, we can draw the following conclusions regarding the performance and error handling of the three insertion methods:

  1. InsertAll: The InsertAll method is highly efficient, taking around 48.71 seconds to insert 1 million records. It handles database errors but ignores model validation errors.

  2. ActiveRecord-Import: This method offers reasonable performance, taking around 73.42 seconds to insert 1 million records. Importantly, this method actively handles database errors (PG::UniqueViolation) and model validation errors. This approach results in slower insertion but allows for comprehensive validation handling.

  3. ActiveRecord-Copy: ActiveRecord-Copy stands out in terms of performance, inserting 1 million records in just 14.81 seconds. However, concerning error handling, it behaves similarly to InsertAll, as it handles database errors (triggering a PG::UniqueViolation in our tests) but does not address model validation errors.

Recommendations

Based on the benchmark results and your specific project priorities, here are my recommendations:

  • Best Performance - ActiveRecord-Copy: If you seek the best performance for rapid insertion of large amounts of data, ActiveRecord-Copy is the top choice. It is approximately five times faster than ActiveRecord-Import. The only drawback is that it is exclusively available on PostgreSQL.
  • Model Validation - ActiveRecord-Import: If handling model validation errors is essential for your project, ActiveRecord-Import is a viable option. It is important to note that this method offers extensive customization in its usage. I encourage you to consult the documentation, as it allows for advanced features like recursive model creation and skipping model validations for faster insertion.
  • In the worst case - InsertAll: If you are not using PostgreSQL and do not require model validations, InsertAll is a solid choice. It provides good performance while handling database errors, even though it ignores model validation errors.

Conclusion

In the demanding world of application development, database performance plays a crucial role. Rapid data insertion is a common challenge, and our benchmark has highlighted significant differences between the three methods examined.

ActiveRecord-Copy stands out as the most efficient and memory-friendly method. However, ActiveRecord-Import is the only method that actively handles model validation errors.

Ultimately, the choice of the method depends on your priorities and project requirements. Each of these methods has its advantages and disadvantages, and the decision should be made based on your specific needs in terms of performance, memory usage, and error handling.

Personally, I am thrilled to have discovered ActiveRecord-Copy and look forward to applying it to my projects!

Learn More

Top comments (3)

Collapse
 
waghanza profile image
Marwan Rabbâa

Great tips, thanks ❤️

Just to notice that 1 billions records will not take 30 seconds to insert (insert_all). It was 3 minutes on my machine

Collapse
 
pimp_my_ruby profile image
Pimp My Ruby

Hi Marwan !

What is your computer ? I may need to specify in my article that I'm running all my benchmark on my M1 Pro so maybe if you have better specs than mine ofc the benchmark will be faster.

I will add this detail to the next articles, thanks for feedback!

Collapse
 
waghanza profile image
Marwan Rabbâa

I don't exactly remember what is my hardware, but it's nearly an M2.

Btw, it is more about saying that we can't just multiply those figures (100K = 3s, 1M = 30s ...).

PS : Sorry it was 1 million, not 1 billion