Ever had issues merging CSV files in Ruby?
⚰️ Typical problems
• you load into memory (problem if csvs are large)
• parse line-by-line (slow)
💡 Solution
Use sqlite instead! Theoretically, it can handle large csvs, and is very fast.
Update: I'm aware that the code below isn't optimal. It works, but it could be better. At the moment, it loads the entire CSV file still because it's using CSV.open(...)
. The correct way would be to stream it and load the data into SQLite. Working on a fix, and will update here in 1-2 days.
How does it work?
Preparing the columns
We will have to first extract each csv columns, and transform it into a string that SQLite can accept. Spaces, etc are removed.
Not acceptable: Column 1
Acceptable: Column_1
# Sanitizes the column name so we can create a sqlite column with it
def sanitize_name(name)
name.strip.gsub(/[^\w\ufeff]/, '_').downcase
end
Load the csv into SQLite
We then use the transformed column names, which are compatible with SQLite to then create the tables with the corresponding columns.
Once that's done, we insert each row into the corresponding SQLite table.
def load_csv_to_sqlite(file, table_name, db)
CSV.open(file, headers: true) do |csv|
original_headers = csv.first.headers
sanitized_headers = original_headers.map { |header| sanitize_name(header) }
db.execute <<-SQL
CREATE TABLE #{table_name} (
#{sanitized_headers.map { |header| "#{header} TEXT" }.join(', ')}
);
SQL
csv.each do |row|
db.execute "INSERT INTO #{table_name} VALUES (#{(['?'] * sanitized_headers.size).join(', ')})", row.fields
end
original_headers.zip(sanitized_headers).to_h
end
end
Merging columns
To merge 2 separate columns, ensure that both CSVs have the same column names. Open it in a text editor and manually edit it.
The following code then uses a SQLite query to join both CSVs together on that selected column to merge on.
For this to work, we assume that columns on both files have the exact same value. If they don't, you will have to edit the code to transform them.
def merge_csv_files(file1, file2, output_file, key_column)
db = SQLite3::Database.new 'temp.sqlite'
header_mapping1 = load_csv_to_sqlite(file1, 'table1', db)
header_mapping2 = load_csv_to_sqlite(file2, 'table2', db)
sanitized_key_column = sanitize_name(key_column)
merged = db.execute2 <<-SQL
SELECT * FROM table1
INNER JOIN table2
ON table1.#{sanitized_key_column} = table2.#{sanitized_key_column};
SQL
original_headers = merged.first.map do |sanitized_header|
header_mapping1[sanitized_header] || header_mapping2[sanitized_header] || sanitized_header
end
CSV.open(output_file, 'w') do |csv|
csv << original_headers
merged.each { |row| csv << row }
end
# db.execute "DROP TABLE table1"
# db.execute "DROP TABLE table2"
end
merge_csv_files('external.csv', 'notion.csv', 'output.csv', 'Name')
What can be improved?
Transforming mapped columns before comparing them
It would be great to add support for easily mapping columns with a specified transformer method. An example of how this could work:
merge_csv_files('external.csv', 'notion.csv', 'output.csv', ['Name'], {'Name' => method(:normalize_domain)})
Full code
require 'csv'
require 'sqlite3'
require 'pry' # add binding.pry in code to see inspect how it works
# Sanitizes the column name so we can create a sqlite column with it
def sanitize_name(name)
name.strip.gsub(/[^\w\ufeff]/, '_').downcase
end
def load_csv_to_sqlite(file, table_name, db)
CSV.open(file, headers: true) do |csv|
original_headers = csv.first.headers
sanitized_headers = original_headers.map { |header| sanitize_name(header) }
db.execute <<-SQL
CREATE TABLE #{table_name} (
#{sanitized_headers.map { |header| "#{header} TEXT" }.join(', ')}
);
SQL
csv.each do |row|
db.execute "INSERT INTO #{table_name} VALUES (#{(['?'] * sanitized_headers.size).join(', ')})", row.fields
end
original_headers.zip(sanitized_headers).to_h
end
end
def merge_csv_files(file1, file2, output_file, key_column)
db = SQLite3::Database.new 'temp.sqlite'
header_mapping1 = load_csv_to_sqlite(file1, 'table1', db)
header_mapping2 = load_csv_to_sqlite(file2, 'table2', db)
sanitized_key_column = sanitize_name(key_column)
merged = db.execute2 <<-SQL
SELECT * FROM table1
INNER JOIN table2
ON table1.#{sanitized_key_column} = table2.#{sanitized_key_column};
SQL
original_headers = merged.first.map do |sanitized_header|
header_mapping1[sanitized_header] || header_mapping2[sanitized_header] || sanitized_header
end
CSV.open(output_file, 'w') do |csv|
csv << original_headers
merged.each { |row| csv << row }
end
# db.execute "DROP TABLE table1"
# db.execute "DROP TABLE table2"
end
merge_csv_files('external.csv', 'notion.csv', 'output.csv', 'Name')
Liked it?
Please support me by following more of my content on Dev and Twitter!
Let me know in the comments how this could be improved, and whether you'd like me to explore more about using SQLite to parse and manipulate CSVs in Ruby. I think there is a lot of potential on what could be achieved with this approach.
Originally tweeted on X here.
Top comments (0)