The Data Scientist and several professionals in the same space work with data from several sources. It does not take long for one to face the challenge of importing data from a CSV file into an SQL database. I am a great fan of the terminal so I want to show you how to do it from the terminal.
Prerequisites
I am going to use a Python Package for this. I suggest that you work in a virtual environment. The Real Python's article on virtual environments is a good resource if need be. I assume you are using Python3.
- Install the needed packages
pip install PyMySQL psycopg2 csvkit
PyMySQL: This is a pure Python MySQL driver which helps us to connect to a MySQL database using Python.
psycopg2: This the most popular Postgresql database adapter for Python.
csvkit: It is a suite of command-line tools for converting to and working with CSV. It provides the csvsql module we are going to use to import the data from our CSV file.
- Download the sample CSV file,customer_info.csv.
For the sake of this tutorial, I have uploaded a CSV file which contains some customer information you can use to test this tool I am introducing. Click this link to download.
- Create a new MySQL or Postgresql database. I named mine test. The data will be imported into a table I have named customer_info.
Importing Your Data
The Quick Approach.
This approach automatically generates the Schema from the CSV data, creates the table and inserts the data into the table.
db_user: This is the database user with access to the test database you created.
password: This is the password of the user.
For MySQL
csvsql --db 'mysql+pymysql://db_user:password@localhost/test' --tables customer_info --insert customer_info.csv
For Postgresql
csvsql --db 'postgresql:///test' --tables customer_info --insert customer_info.csv
The Lengthier (and Safer) Approach.
- Generate the query for creating the customer_info table from your terminal.
For MySQL
csvsql -i mysql customer_info.csv
Expected Result:
CREATE TABLE customer_info (
`contractId` DECIMAL(38, 0) NOT NULL,
`AccountNumber` DECIMAL(38, 0) NOT NULL,
`CreatedAt` TIMESTAMP NULL,
`Amount` DECIMAL(38, 0) NOT NULL,
`Count` DECIMAL(38, 0) NOT NULL,
`Duration` DECIMAL(38, 0) NOT NULL
);
The result above is the query we are going to use to create the customer_info table in the test database. You can edit the generated schema to suit your need. I will leave it as it is in this tutorial.
For Postgresql
csvsql -i postgresql
Expected Result:
CREATE TABLE customer_info (
"contractId" DECIMAL NOT NULL,
"AccountNumber" DECIMAL NOT NULL,
"CreatedAt" TIMESTAMP WITHOUT TIME ZONE,
"Amount" DECIMAL NOT NULL,
"Count" DECIMAL NOT NULL,
"Duration" DECIMAL NOT NULL
);
NOTE THE DIFFERENCE IN HOW EACH DATABASE EXPECTS THE QUERY TO BE. MySQL is wrapped in single quotes but Postgresql is wrapped in double-quotes.
Import your data into the table.
In MySQL
csvsql --db 'mysql+pymysql://db_user:password@host/test' --no-create --insert customer_info.csv
In Postgresql
csvsql --db 'postgresql:///test' --no-create --insert customer_info.csv
SUMMARY
We just learnt how to import data from a CSV file into tables in our MySQL and Postgresql databases with the csvkit package.You can take the direct approach which automatically creates your table and inserts the data.The longer and in my opinion safer approach is to generate the query that creates the table, inspect and edit the generated schema and import your data into the table.
CSVKIT is a powerful tool every Data Scientist should have in his or her toolbox.
Top comments (3)
If using PostgreSQL, there is a super easy way to solve these kinds of tasks - the COPY statement: postgresql.org/docs/current/sql-co...
Syntax looks like:
Thanks for sharing. We are here to learn from one another. I skimmed through the link you shared and my guess is that the table should be created for you to use the COPY command. csvsql doesn't just help to copy from the CSV file into the database table. It helps with schema generation from the CSV file, table creation and inserting data into the created table.
Yeah, unfortunately the schema must exist first when using
COPY
.