DEV Community

My first ETL in Python

Jesse M. Holmes on June 19, 2019

I’ve only used Python for exploring and manipulating data with Pandas. I’ve taken on a project where we are doing some data transformations in Node...
Collapse
 
claudiodavi profile image
Claudio Davi

Ok, so you don't have that many options for courses with just four hours available. I'd recommend you to look into some algorithms and data structures, how and why to avoid in memory workloads and how to use streaming data. You can find a lot of resources online and be ready to read a lot of tutorials.

Personally, I'd go with standard python for the task. You can use the csv module to load and write your CSVs and is a bit faster than pandas, you can do streaming insert which I think is great.

What I would do:

  • Create namedtuples with the format of your data that is going to be fed to your final DB. This will give you a standard object formatting.
  • Connecting to MySQL database I recommend using pymysql, specially the SSDictCursor for reading queries. This will give you streaming data one row at a time.
  • For other connections I believe you should search for streaming readers, try to always store in transitory files and upload as you go, do not keep all you data in memory, that can lead to several memory issues.
  • If your transformation requires group by or any other analytics methods I'd go with pandas or even dask if performance is a issue, however most of the time pandas will do the job.

libraries:

  • PyMySql
  • csv
  • pandas
  • requests (for REST api)
  • pymongo

Tips:

use as much logging as you can. This will save your day.

Book:

Python Cookbook

Collapse
 
elrdevelopment profile image
Erick • Edited

There is a really nice up and coming project called bonobo project (bonobo-project.org/). I have been watching this one really closely.

I love Pandas for ETL, but I really like the graphing and visualization that bonobos provides.

No course to reference but this was a good read for me.

towardsdatascience.com/streaming-t...

towardsdatascience.com/building-an...

Collapse
 
thefern profile image
Fernando B 🚀

To answer your first question: I would recommend Introduction to CS and programming using Python on Edx is free, and also Python courses on Teamtreehouse paid membership but free for 30days.

There are quite a few good channels on YouTube personally I love Socratica but look around and see which one you like the most.

Books I don't think I ever picked one up for python so can't help you there.

Collapse
 
gerreth profile image
Gerret Halberstadt

Though I guess effectively working with it will take some more time than 4 hours, if you are facing problems with datasets that do not fit in memory, pyspark might be a way to go. If you are really interested in this direction, udacity has "Data Engineering" nanodegree. I can not recommend the degree without reservations, nevertheless, with additional studies about surrounding topics I learned quite a bit.

Collapse
 
jess profile image
Jess Lee

@sheyd got any advice here?