I just completed the Udacity Data Engineering program and wanted to share my insights about the program to other people that were thinking of taking it. You will use S3, Redshift, Spark, EMR, EC2, Airflow, Cassandra, & Postgres throughout the course. With a focus on the end on Airflow, Spark, S3 & Redshift.
- Data Modeling in Apache Cassandra & Postgres
- Cloud Data Warehouses (AWS, S3, Redshift)
- Building a Data Lake with an ETL pipeline in Spark (EMR, Spark)
- Using Airflow as a data pipeline to schedule data sends.
- An open ended project similar to project 3+4 or using their template project with their own data.
I wanted to get better at data governance, data quality, and data pipelines so I can introduce these concepts at my current role, along with being more familiar with Airflow and AWS. Overall take more ownership of the data engineering I am doing now.
Our current stack includes data loaders like Stitch and Fivetran but there has been more instances where Stitch and Fivetran don't meet our needs of integrating all our data sources. This is where I saw a need to leverage Airflow into the picture to bring data into our warehouse but also into other sources to be used by our business users.
Coming into this program I felt really confident with my SQL skills and Python skills to take this nano degree. I don't write python as much as I liked to (maybe once a week), and there were some cases I struggled with the content because of its involvement with Python in unfamiliar scenarios. Especially near the Airflow portion but its manageable if you just search examples and the Airflow documentation.
There were a lot of issues with the course after module 2, to the point a large chunk of my peers were complaining. They eventually added an extension to module 3 with new videos and sources that makes the completion of it a lot easier but should have been there from the start. Unfortunately I really struggled here to complete this portion before the introduced the new videos and content, so I was stuck googling and reading through the AWS documents to figure it out.
This was a common theme after the 2nd module. A lot of the information and resources were outdated, I felt like I was troubleshooting and learning AWS IAM roles, or AWS security permissions than I was learning data engineering. These issues were only remedied when enough people had an issue. It felt like a half finished program. Although through the hardship I got a lot better at understanding AWS and Airflow because of limited information I had to pass each module.
Overall though I really enjoyed the program to get exposure to these tools and concepts. The highlights for me was learning Spark, and Airflow. I am comfortable enough now I can stand up my own ecosystem where I can send data to a warehouse from different sources along with having tests on the data to ensure its quality. This was my overall goal taking this nano degree.
Do I recommend it? It depends on the work you do or want to do.
With tools like dbt, dbt Cloud, Stitch and Fivetran you can avoid a lot of headaches and stand up an analytics ecosystem relatively quickly. I don't think this is the nano degree is for you if that is your goal. Buying over building makes more sense 9/10 times especially if you are a startup trying to gain insights quickly.
But if you're building a backend for an application or purposes outside of an analytics warehouse. Then I would potentially take this course and proceed with caution.
I think a better step is to choose one of these technologies and focus on them. (Take a Udemy Course on Airflow first for example).
- Learned AWS, Airflow, Spark
- Good background information on data modeling, traditional data schemas.
- Highlighted data quality and data governance how to introduce tests within your data pipeline.
- Great community of help
- Unclear directions and modules become a huge time sink
- Can burn through free AWS credits quickly. You spend roughly $200 USD a month on the course and then maybe an additional $100-150 USD Month on AWS products trying to figure out the projects due to the outdated documentation and resources provided. (Overall I had to spend an additional $320 over 3 months)
- Spending more time troubleshooting problems with the course or filling gaps of it then learning about data engineering.
Overall Rating 3/5