DEV Community

Zachary Becker
Zachary Becker

Posted on

Querying CSV Files Using Postgres

While reading PostgreSQL: Up and Running, I noticed a few features where it is easy to import data from a CSV file into postgres. However, the example in the book required a complex CREATE TABLE query that was specific to the CSV file that they were loading.

I figured it would be cool if we could load an arbitrary CSV files, and automatically create a table based on the first line of the CSV file which often contains the column names for the data included in the file.

Sample Dataset

To test this out, I used a free data set (Electric Vehicle Population Data) from the State of Washington which contains data about registered electric vehicles in the state.

You can easily grab this data set with the following shell command:

wget \
  https://data.wa.gov/api/views/f6w7-q2d2/rows.csv?accessType=DOWNLOAD \
  -O Electric_Vehicle_Population_Data.csv
Enter fullscreen mode Exit fullscreen mode

Load Column Data

This CSV file contains the expected header row

VIN (1-10),County,City,State,Postal Code,Model Year,Make,Model,Electric Vehicle Type,Clean Alternative Fuel Vehicle (CAFV) Eligibility,Electric Range,Base MSRP,Legislative District,DOL Vehicle ID,Vehicle Location,Electric Utility,2020 Census Tract
Enter fullscreen mode Exit fullscreen mode

In order to generate a CREATE TABLE command to create a table based on these columns, we first need to load the column names into PostgreSQL.

However, since we don't care about persisting this information past the task of creating the table of electric vehicle data, we can load it into a temporary table.

From a psql prompt, I created this temporary table and loaded it with data with the following commands

CREATE TEMP TABLE "csv_columns" ( "name" TEXT );
\copy "csv_columns" FROM PROGRAM 'head -n 1 Electric_Vehicle_Population_Data.csv | sed ''s/,/\n/g'''
Enter fullscreen mode Exit fullscreen mode

This command reads the first line of our csv file, and replaces the commands with newlines. The copy command takes each line of our output, and puts it into our csv_columns table.

Loading CSV Data

With psql, we can construct a query with a SELECT statement, based on our columns table, and execute that query with the \gexec command.

We employ the built in functions array_to_string which is like strings.Join in main languages, and array_agg which aggregates values into a single value.

We can create the table based on the loaded column information using the following command

SELECT
    'CREATE TABLE "public"."electric_vehicles" (' ||
        array_to_string(array_agg('"' || "name" || '" TEXT'), ',') ||
    ');'
FROM "csv_columns" \gexec
Enter fullscreen mode Exit fullscreen mode

This effectively executes the SQL query

CREATE TABLE "public"."electric_vehicles" (
    "VIN (1-10)" TEXT,
    "County" TEXT,
    "City" TEXT,
    "State" TEXT,
    "Postal Code" TEXT),
    "Model Year" TEXT,
    "Make" TEXT,
    "Model" TEXT,
    "Electric Vehicle Type" TEXT,
    "Clean Alternative Fuel Vehicle (CAFV) Eligibility" TEXT,
    "Electric Range" TEXT,
    "Base MSRP" TEXT,
    "Legislative District" TEXT,
    "DOL Vehicle ID" TEXT,
    "Vehicle Location" TEXT,
    "Electric Utility" TEXT,
    "2020 Census Tract" TEXT
);
Enter fullscreen mode Exit fullscreen mode

Once we have our table, we can load the CSV data with the copy command.

\copy "public"."electric_vehicles" FROM 'Electric_Vehicle_Population_Data.csv' CSV
Enter fullscreen mode Exit fullscreen mode

Querying CSV Data in PostgreSQL

We can now query our CSV data as a SQL table, for example, this gets the top 5 makes of electric vehicles registered in Washington State.

SELECT
    "Make",
    count(*) AS "Count"
FROM "public"."electric_vehicles"
GROUP BY "Make" 
ORDER BY "Count" DESC
LIMIT 5;
Enter fullscreen mode Exit fullscreen mode

Which outputs

   Make    | Count
-----------+-------
 TESLA     | 65552
 NISSAN    | 13317
 CHEVROLET | 11816
 FORD      |  7307
 BMW       |  6209
(5 rows)
Enter fullscreen mode Exit fullscreen mode

Conclusions

I found this a really interesting way to learn PostgreSQL better, and perhaps a way I can grok CSV data in the future.

Code

Here is a full psql script to load and query the Electric Vehicle Population Data CSV file. I named it create_ev_dataset.sql, and execute it with psql -f create_ev_dataset.sql. It assumes the CSV is in the directory that you ran psql from.

-- create a temporary table to load the column names into
-- this table only exists for the duration of the connection
CREATE TEMP TABLE "csv_columns" ( "name" TEXT );

-- split the first line of the csv file to get the column names on individual lines, and copy them into our temporary table
\copy "csv_columns" FROM PROGRAM 'head -n 1 Electric_Vehicle_Population_Data.csv | sed ''s/,/\n/g'''

-- use our temporary table of column names to construct a table to store csv data in
SELECT
    'CREATE TABLE "public"."electric_vehicles" (' ||
        array_to_string(array_agg('"' || "name" || '" TEXT'), ',') ||
    ');'
FROM "csv_columns" \gexec

-- copy csv data into the table
\copy "public"."electric_vehicles" FROM 'Electric_Vehicle_Population_Data.csv' CSV

-- query csv data using postgres
SELECT
    "Make",
    count(*) AS "Count"
FROM "public"."electric_vehicles"
GROUP BY "Make" 
ORDER BY "Count" DESC
LIMIT 5;

Enter fullscreen mode Exit fullscreen mode

Top comments (0)