My first ETL in Python

#help #python

I’ve only used Python for exploring and manipulating data with Pandas. I’ve taken on a project where we are doing some data transformations in Node, and now the data and complexity have grown to a point where it makes more sense to schedule a Python job to take over.

I need two kinds of help.

First, I’d like to find a community recommended course, paid or free, to level up my basic Python skills.

Second, the list of Python libraries is just awesome, and I don’t know which ones will be most useful. I’m collecting data from three different types of sources, SQL, Mongo, and some JSON from Stripe. After transformation, the data is going into Mongo via GridFS. Also, some of the datasets are very large, so working with everything in memory at once is going to be a challenge. Previously I’ve dumped data into CSVs and read those into Pandas, but there are so many things in this landscape that I’m unaware of, I’m certain there’s a better way.

Thanks in advance!

Top comments (5)

Claudio Davi • Jun 19 '19

Ok, so you don't have that many options for courses with just four hours available. I'd recommend you to look into some algorithms and data structures, how and why to avoid in memory workloads and how to use streaming data. You can find a lot of resources online and be ready to read a lot of tutorials.

Personally, I'd go with standard python for the task. You can use the csv module to load and write your CSVs and is a bit faster than pandas, you can do streaming insert which I think is great.

What I would do:

Create namedtuples with the format of your data that is going to be fed to your final DB. This will give you a standard object formatting.
Connecting to MySQL database I recommend using pymysql, specially the SSDictCursor for reading queries. This will give you streaming data one row at a time.
For other connections I believe you should search for streaming readers, try to always store in transitory files and upload as you go, do not keep all you data in memory, that can lead to several memory issues.
If your transformation requires group by or any other analytics methods I'd go with pandas or even dask if performance is a issue, however most of the time pandas will do the job.

libraries:

PyMySql
csv
pandas
requests (for REST api)
pymongo

Tips:

use as much logging as you can. This will save your day.

Book:

Python Cookbook

Erick • Jun 21 '19 • Edited

There is a really nice up and coming project called bonobo project (bonobo-project.org/). I have been watching this one really closely.

I love Pandas for ETL, but I really like the graphing and visualization that bonobos provides.

No course to reference but this was a good read for me.

towardsdatascience.com/streaming-t...

towardsdatascience.com/building-an...

Fernando B 🚀 • Jun 19 '19

To answer your first question: I would recommend Introduction to CS and programming using Python on Edx is free, and also Python courses on Teamtreehouse paid membership but free for 30days.

There are quite a few good channels on YouTube personally I love Socratica but look around and see which one you like the most.

Books I don't think I ever picked one up for python so can't help you there.

Gerret Halberstadt • Jun 20 '19

Though I guess effectively working with it will take some more time than 4 hours, if you are facing problems with datasets that do not fit in memory, pyspark might be a way to go. If you are really interested in this direction, udacity has "Data Engineering" nanodegree. I can not recommend the degree without reservations, nevertheless, with additional studies about surrounding topics I learned quite a bit.