While reading PostgreSQL: Up and Running, I noticed a few features where it is easy to import data from a CSV file into postgres. However, the example in the book required a complex CREATE TABLE
query that was specific to the CSV file that they were loading.
I figured it would be cool if we could load an arbitrary CSV files, and automatically create a table based on the first line of the CSV file which often contains the column names for the data included in the file.
Sample Dataset
To test this out, I used a free data set (Electric Vehicle Population Data) from the State of Washington which contains data about registered electric vehicles in the state.
You can easily grab this data set with the following shell command:
wget \
https://data.wa.gov/api/views/f6w7-q2d2/rows.csv?accessType=DOWNLOAD \
-O Electric_Vehicle_Population_Data.csv
Load Column Data
This CSV file contains the expected header row
VIN (1-10),County,City,State,Postal Code,Model Year,Make,Model,Electric Vehicle Type,Clean Alternative Fuel Vehicle (CAFV) Eligibility,Electric Range,Base MSRP,Legislative District,DOL Vehicle ID,Vehicle Location,Electric Utility,2020 Census Tract
In order to generate a CREATE TABLE
command to create a table based on these columns, we first need to load the column names into PostgreSQL.
However, since we don't care about persisting this information past the task of creating the table of electric vehicle data, we can load it into a temporary table.
From a psql
prompt, I created this temporary table and loaded it with data with the following commands
CREATE TEMP TABLE "csv_columns" ( "name" TEXT );
\copy "csv_columns" FROM PROGRAM 'head -n 1 Electric_Vehicle_Population_Data.csv | sed ''s/,/\n/g'''
This command reads the first line of our csv file, and replaces the commands with newlines. The copy command takes each line of our output, and puts it into our csv_columns
table.
Loading CSV Data
With psql, we can construct a query with a SELECT
statement, based on our columns table, and execute that query with the \gexec
command.
We employ the built in functions array_to_string
which is like strings.Join
in main languages, and array_agg
which aggregates values into a single value.
We can create the table based on the loaded column information using the following command
SELECT
'CREATE TABLE "public"."electric_vehicles" (' ||
array_to_string(array_agg('"' || "name" || '" TEXT'), ',') ||
');'
FROM "csv_columns" \gexec
This effectively executes the SQL query
CREATE TABLE "public"."electric_vehicles" (
"VIN (1-10)" TEXT,
"County" TEXT,
"City" TEXT,
"State" TEXT,
"Postal Code" TEXT),
"Model Year" TEXT,
"Make" TEXT,
"Model" TEXT,
"Electric Vehicle Type" TEXT,
"Clean Alternative Fuel Vehicle (CAFV) Eligibility" TEXT,
"Electric Range" TEXT,
"Base MSRP" TEXT,
"Legislative District" TEXT,
"DOL Vehicle ID" TEXT,
"Vehicle Location" TEXT,
"Electric Utility" TEXT,
"2020 Census Tract" TEXT
);
Once we have our table, we can load the CSV data with the copy command.
\copy "public"."electric_vehicles" FROM 'Electric_Vehicle_Population_Data.csv' CSV
Querying CSV Data in PostgreSQL
We can now query our CSV data as a SQL table, for example, this gets the top 5 makes of electric vehicles registered in Washington State.
SELECT
"Make",
count(*) AS "Count"
FROM "public"."electric_vehicles"
GROUP BY "Make"
ORDER BY "Count" DESC
LIMIT 5;
Which outputs
Make | Count
-----------+-------
TESLA | 65552
NISSAN | 13317
CHEVROLET | 11816
FORD | 7307
BMW | 6209
(5 rows)
Conclusions
I found this a really interesting way to learn PostgreSQL better, and perhaps a way I can grok CSV data in the future.
Code
Here is a full psql
script to load and query the Electric Vehicle Population Data CSV file. I named it create_ev_dataset.sql
, and execute it with psql -f create_ev_dataset.sql
. It assumes the CSV is in the directory that you ran psql
from.
-- create a temporary table to load the column names into
-- this table only exists for the duration of the connection
CREATE TEMP TABLE "csv_columns" ( "name" TEXT );
-- split the first line of the csv file to get the column names on individual lines, and copy them into our temporary table
\copy "csv_columns" FROM PROGRAM 'head -n 1 Electric_Vehicle_Population_Data.csv | sed ''s/,/\n/g'''
-- use our temporary table of column names to construct a table to store csv data in
SELECT
'CREATE TABLE "public"."electric_vehicles" (' ||
array_to_string(array_agg('"' || "name" || '" TEXT'), ',') ||
');'
FROM "csv_columns" \gexec
-- copy csv data into the table
\copy "public"."electric_vehicles" FROM 'Electric_Vehicle_Population_Data.csv' CSV
-- query csv data using postgres
SELECT
"Make",
count(*) AS "Count"
FROM "public"."electric_vehicles"
GROUP BY "Make"
ORDER BY "Count" DESC
LIMIT 5;
Top comments (0)