I’ve only used Python for exploring and manipulating data with Pandas. I’ve taken on a project where we are doing some data transformations in Node, and now the data and complexity have grown to a point where it makes more sense to schedule a Python job to take over.
I need two kinds of help.
First, I’d like to find a community recommended course, paid or free, to level up my basic Python skills.
Second, the list of Python libraries is just awesome, and I don’t know which ones will be most useful. I’m collecting data from three different types of sources, SQL, Mongo, and some JSON from Stripe. After transformation, the data is going into Mongo via GridFS. Also, some of the datasets are very large, so working with everything in memory at once is going to be a challenge. Previously I’ve dumped data into CSVs and read those into Pandas, but there are so many things in this landscape that I’m unaware of, I’m certain there’s a better way.
Thanks in advance!
Top comments (5)
Ok, so you don't have that many options for courses with just four hours available. I'd recommend you to look into some algorithms and data structures, how and why to avoid in memory workloads and how to use streaming data. You can find a lot of resources online and be ready to read a lot of tutorials.
Personally, I'd go with standard python for the task. You can use the csv module to load and write your CSVs and is a bit faster than pandas, you can do streaming insert which I think is great.
What I would do:
libraries:
Tips:
use as much logging as you can. This will save your day.
Book:
Python Cookbook
There is a really nice up and coming project called bonobo project (bonobo-project.org/). I have been watching this one really closely.
I love Pandas for ETL, but I really like the graphing and visualization that bonobos provides.
No course to reference but this was a good read for me.
towardsdatascience.com/streaming-t...
towardsdatascience.com/building-an...
To answer your first question: I would recommend Introduction to CS and programming using Python on Edx is free, and also Python courses on Teamtreehouse paid membership but free for 30days.
There are quite a few good channels on YouTube personally I love Socratica but look around and see which one you like the most.
Books I don't think I ever picked one up for python so can't help you there.
Though I guess effectively working with it will take some more time than 4 hours, if you are facing problems with datasets that do not fit in memory, pyspark might be a way to go. If you are really interested in this direction, udacity has "Data Engineering" nanodegree. I can not recommend the degree without reservations, nevertheless, with additional studies about surrounding topics I learned quite a bit.
@sheyd got any advice here?