DEV Community

Gabriel Chuan
Gabriel Chuan

Posted on • Edited on

Merge CSV files in Ruby efficiently

Ever had issues merging CSV files in Ruby?

⚰️ Typical problems

• you load into memory (problem if csvs are large)
• parse line-by-line (slow)

💡 Solution

Use sqlite instead! Theoretically, it can handle large csvs, and is very fast.


Update: I'm aware that the code below isn't optimal. It works, but it could be better. At the moment, it loads the entire CSV file still because it's using CSV.open(...). The correct way would be to stream it and load the data into SQLite. Working on a fix, and will update here in 1-2 days.


How does it work?

Preparing the columns

We will have to first extract each csv columns, and transform it into a string that SQLite can accept. Spaces, etc are removed.

Not acceptable: Column 1
Acceptable: Column_1

# Sanitizes the column name so we can create a sqlite column with it
def sanitize_name(name)
  name.strip.gsub(/[^\w\ufeff]/, '_').downcase
end
Enter fullscreen mode Exit fullscreen mode

Load the csv into SQLite

We then use the transformed column names, which are compatible with SQLite to then create the tables with the corresponding columns.

Once that's done, we insert each row into the corresponding SQLite table.

def load_csv_to_sqlite(file, table_name, db)
  CSV.open(file, headers: true) do |csv|
    original_headers = csv.first.headers
    sanitized_headers = original_headers.map { |header| sanitize_name(header) }

    db.execute <<-SQL
      CREATE TABLE #{table_name} (
        #{sanitized_headers.map { |header| "#{header} TEXT" }.join(', ')}
      );
    SQL

    csv.each do |row|
      db.execute "INSERT INTO #{table_name} VALUES (#{(['?'] * sanitized_headers.size).join(', ')})", row.fields
    end

    original_headers.zip(sanitized_headers).to_h
  end
end
Enter fullscreen mode Exit fullscreen mode

Merging columns

To merge 2 separate columns, ensure that both CSVs have the same column names. Open it in a text editor and manually edit it.

The following code then uses a SQLite query to join both CSVs together on that selected column to merge on.

For this to work, we assume that columns on both files have the exact same value. If they don't, you will have to edit the code to transform them.

def merge_csv_files(file1, file2, output_file, key_column)
  db = SQLite3::Database.new 'temp.sqlite'

  header_mapping1 = load_csv_to_sqlite(file1, 'table1', db)
  header_mapping2 = load_csv_to_sqlite(file2, 'table2', db)

  sanitized_key_column = sanitize_name(key_column)

  merged = db.execute2 <<-SQL
    SELECT * FROM table1
    INNER JOIN table2
    ON table1.#{sanitized_key_column} = table2.#{sanitized_key_column};
  SQL

  original_headers = merged.first.map do |sanitized_header|
    header_mapping1[sanitized_header] || header_mapping2[sanitized_header] || sanitized_header
  end

  CSV.open(output_file, 'w') do |csv|
    csv << original_headers
    merged.each { |row| csv << row }
  end

  # db.execute "DROP TABLE table1"
  # db.execute "DROP TABLE table2"
end

merge_csv_files('external.csv', 'notion.csv', 'output.csv', 'Name')
Enter fullscreen mode Exit fullscreen mode

What can be improved?

Transforming mapped columns before comparing them

It would be great to add support for easily mapping columns with a specified transformer method. An example of how this could work:

merge_csv_files('external.csv', 'notion.csv', 'output.csv', ['Name'], {'Name' => method(:normalize_domain)})
Enter fullscreen mode Exit fullscreen mode

Full code

require 'csv'
require 'sqlite3'
require 'pry' # add binding.pry in code to see inspect how it works

# Sanitizes the column name so we can create a sqlite column with it
def sanitize_name(name)
  name.strip.gsub(/[^\w\ufeff]/, '_').downcase
end

def load_csv_to_sqlite(file, table_name, db)
  CSV.open(file, headers: true) do |csv|
    original_headers = csv.first.headers
    sanitized_headers = original_headers.map { |header| sanitize_name(header) }

    db.execute <<-SQL
      CREATE TABLE #{table_name} (
        #{sanitized_headers.map { |header| "#{header} TEXT" }.join(', ')}
      );
    SQL

    csv.each do |row|
      db.execute "INSERT INTO #{table_name} VALUES (#{(['?'] * sanitized_headers.size).join(', ')})", row.fields
    end

    original_headers.zip(sanitized_headers).to_h
  end
end

def merge_csv_files(file1, file2, output_file, key_column)
  db = SQLite3::Database.new 'temp.sqlite'

  header_mapping1 = load_csv_to_sqlite(file1, 'table1', db)
  header_mapping2 = load_csv_to_sqlite(file2, 'table2', db)

  sanitized_key_column = sanitize_name(key_column)

  merged = db.execute2 <<-SQL
    SELECT * FROM table1
    INNER JOIN table2
    ON table1.#{sanitized_key_column} = table2.#{sanitized_key_column};
  SQL

  original_headers = merged.first.map do |sanitized_header|
    header_mapping1[sanitized_header] || header_mapping2[sanitized_header] || sanitized_header
  end

  CSV.open(output_file, 'w') do |csv|
    csv << original_headers
    merged.each { |row| csv << row }
  end

  # db.execute "DROP TABLE table1"
  # db.execute "DROP TABLE table2"
end

merge_csv_files('external.csv', 'notion.csv', 'output.csv', 'Name')
Enter fullscreen mode Exit fullscreen mode

Liked it?

Please support me by following more of my content on Dev and Twitter!

Let me know in the comments how this could be improved, and whether you'd like me to explore more about using SQLite to parse and manipulate CSVs in Ruby. I think there is a lot of potential on what could be achieved with this approach.

Originally tweeted on X here.

Top comments (0)