DEV Community

Discussion on: My first ETL in Python

Collapse
 
claudiodavi profile image
Claudio Davi

Ok, so you don't have that many options for courses with just four hours available. I'd recommend you to look into some algorithms and data structures, how and why to avoid in memory workloads and how to use streaming data. You can find a lot of resources online and be ready to read a lot of tutorials.

Personally, I'd go with standard python for the task. You can use the csv module to load and write your CSVs and is a bit faster than pandas, you can do streaming insert which I think is great.

What I would do:

  • Create namedtuples with the format of your data that is going to be fed to your final DB. This will give you a standard object formatting.
  • Connecting to MySQL database I recommend using pymysql, specially the SSDictCursor for reading queries. This will give you streaming data one row at a time.
  • For other connections I believe you should search for streaming readers, try to always store in transitory files and upload as you go, do not keep all you data in memory, that can lead to several memory issues.
  • If your transformation requires group by or any other analytics methods I'd go with pandas or even dask if performance is a issue, however most of the time pandas will do the job.

libraries:

  • PyMySql
  • csv
  • pandas
  • requests (for REST api)
  • pymongo

Tips:

use as much logging as you can. This will save your day.

Book:

Python Cookbook