DEV Community

Noel Worden
Noel Worden

Posted on

Elixir/Phoenix Project: Import CSV & Seed Data

Goals

  • Seed the destinations table

I finished setting up the first schema with everything needed to CRUD (create / edit / update / delete) a record in the destinations table. The views and forms are functional, but they definitely need some love, particularly the index view. But before I dive into any styling, it would be beneficial to get some data into the index table. I could have used the app's newly created html forms and generated the data one record at a time, but its far more efficient -and beneficial down the road- to utilize the seeds.exs file. Using this method, the database can be populated quickly and consistently.

Seed the destinations table

Phoenix ships with a seed file, located at:

priv/repo/seeds.exs

Functions in the seed file don't have to be complex, it's simply an automated way to insert data into the table. A lot of times the data being inserted is fake, and primarily used as a kind of placeholder. But in my case I already had a spreadsheet with all the data and figured that since I was going through this process to inject data, it might as well be real data. Taking the time to write functionality to ingest data from a spreadsheet would also allow me to batch upload data through the browser at some point, which could be handy. Lastly, this kind of spreadsheet ingestion was handled in my last work project, so I had some experience with it, as well as some existing functions to guide me through the process.

Because of the potential dual use of the ingestion code I would be writing, I created a new CSVUtil.ex file, and housed it under the Mapping context. I did a bit of digging to see what libraries were available for processing a CSV (that's the format I exported the spreadsheet to for ingestion), and NimbleCSV was the clear front runner. I wasn't surprised, as that's what we used on my work project.

Knowing that I would use the same library as I already had experience with, I took an iterative approach to writing the functionality. Basically, what happened over the course of a day was:

  1. Starting with the work project functions as a rough scaffold, I implemented my own version of CSV ingestion.
  2. My initial version was a little verbose and clunky, but functional.
  3. I started refactoring my work and seeing how and where it was similar to the work project code.
  4. Ultimately I ended up with a good bit of similarity to the work project code, but with a much better understanding of what was happening from start to finish.

I will be discussing the end result of that process shortly. But first things first, the NimbleCSV library needed to be added to the dependencies in mix.exs:

Then I fetched the new dependency with:

mix deps.get

Once that finished, I rebuilt the docker image:

docker-compose build

And, finally, spun it up again:

docker-compose up

With the library added to the project, I added an alias of the implementation of NimbleCSV I planned on using in CSVUtil.ex:

This allows me to just call CSV.<function> instead of NimbleCSV.RFC4180.<function> every time.

Pulling from my previous experience, I knew I needed to either hardcode the CSV headers or grab them from the file. Initially I hardcoded them, but for future versatility I went with this function that extracts the headers and maps them to an index:

The breakdown of that function:

This takes in the CSV file and streams it line by line, which is perfect creating a record from each row of the CSV. This function alone only returns a struct with data about the file, its best utilized when another function is piped into it.

This tells File.stream!() to break that file down line by line, then parse it into a stream of rows. The default action of parse_stream is to exclude the headers, so I need to explicitly state that I want to include the headers here. The rows aren't user-facing until they are looped over.

This is returning only the first row of the stream, and dropping the rest. At this point the results start to take shape into something usable:

Next is:

This takes that list of strings and creates a list of tuples, the string as the first element and its index as the other, like this:

And lastly:

This converts the list of tuples into a map, using the index value as the key and the string as the value:

The need for the indexing and conversion to key/value pairs will reveal itself in the next step.

The next function, csv_row_to_table_record/1 calls the first, sets it as a variable, and then works through the rest of the CSV file, to insert it into the destinations table:

I'll skip the first two pipes, since they were covered above, and start with:

That is taking each row, and adding an index to each field in that row. A small sample of the output looks like this:

Then there is:

Which is taking each of those rows, creating a new map from it, and using the indexing to match the header value from column_names to the row value, the result of which is:

And, finally, there is a custom function to use that map to insert the data into the table:

The goal with this function was to be able to update that CSV, execute the seeding file, and only have the new files be added to the database, so the standard create function:

Wouldn't work here, because it would add the entirety of the CSV to the table each time, instead of only the new records. This is what the custom function looks like:

The argument row is the map created in the pipe above the calling of this function. The case is searching the destination table for the lat/long coordinates created in the call above. If that search returns nil, the map and Mapping.create_destination are utilized to create and insert a new record. Because of the header/value pairing that happened in the creation of the map, I can reference the value of the fields with this syntax:

If those coordinates do exist, it will return a tuple instead of creating a new record.

This logic might be easier to parse when seeing everything at one time, here is the file in its entirety:

The spreadsheet I've been using to hold my data up to this point is a Google Sheet. I exported it out as a CSV, and created a new folder at:

priv/repo/data

To store this file an any potential future files of this nature.

With all that in place, the code in seeds.exs is actually only one line:

And, as the notes in the seed file mention, that can be executing by running:

Closing Thoughts

This was definitely a more complex data seeding than is typically performed as the first seed of a new project. It would have taken a good amount of time to manually create dummy records, and if I ever needed to drop the database and rebuild it, I would have had to manually insert records again. Instead I took a bit more time and created reusable logic that can seed the database with actual records in quite literally the blink of an eye. I also now have the foundation to do more work with CSVs, which seems like a feasible option for mass uploading in the future.

As always, here is my repo at the end of this block of work.

References


This post is part of an ongoing series. I encourage any critique, feedback, or suggestions in the comments.

Top comments (0)