How do you become a data engineer?
Unlike some of the other technical roles that have degrees and, generally speaking, a defined path, data engineering is a little less straightforward. Many of us might had never even heard of data engineers when we were taking our college courses. Yet companies like Facebook, Amazon, PayPal, and Walmart all have data engineering roles open right now, and there are also plenty of startups looking for data engineers.
But how do you go from college student to data engineer? What degrees do data engineers have? How does one become a data engineer? What skills do data engineers have? What do data engineers do on a day-to-day basis?
These are just some of the questions I have gotten over the past year. I wanted to write an article to help answer many of them.
I have worked with data engineers that have degrees in multiple fields, from English to physics.
Although many job descriptions seem to require data engineers to have mostly math or engineering degrees, often, if you have the right experience, your degree will be overlooked.
Of course, that begs the question of how you get the experience.
There are a few ways to go about this. First, you can get an internship as a data engineer. This would arguably be when the bar would be the lowest and employers would be willing to look to someone with zero work experience.
Another route is to work into the position laterally. Often, even if you don't have a computer science or math background, you can still get into data engineering by getting an analyst or project manager position first. From there you can start pushing for more and more work in the data engineering space.
I have seen this work several times for different individuals who started in very different roles. But you need to often be willing to do both your own work and some extra data engineering work.
You can also try to get positions that are very close to data engineers, like BI analysts.
At a high level, data engineers help take data from point A to point B and remodel it into a format where analysts and data scientists can easily use it.
From a skills perspective, this means that data engineers specialize in ETLs (extract, transform, load), automation (usually with Python or other programming languages), data modeling/data warehousing, SQL and NoSQL data manipulation, and data visualization, to name a few.
The skills that are usually a little new to many are in ETLs and data warehousing. Both of these are usually discussed more in master's or certificate programs after getting your bachelor's degree --- although we suspect this will or has already changed.
Data engineers use a variety of tools, from programming languages to drag-and-drop tools and from cloud data warehouses to data visualization programs.
There are more tools available for data engineers to work with than a single person could probably master in a lifetime.
A better way to look at this is to look at the various types of tools that exist.
- Airflow and Luigi
- Azure Synapse
- AWS Kinesis
Data engineering jobs exist at companies all across the world and in various industries. You can find them in banking, healthcare, big tech, startups, and everything in between.
But don't take my word for it. Here are a few jobs that you could apply to today:
- Facebook data engineer
- Strive Health data engineer
- Amazon data engineer
- Costco data engineer
- Amplify Consulting data engineer
To answer this question, I just send them a link to my data engineering study guide.
I get asked this question a lot, and there are entire articles that can provide a skill-by-skill difference.
However, for this answer, I am going to focus on the goals of data scientists and data engineers. From there it can be easier to see how the different tools and skills line up for both of these data specialties.
The goals of a data engineer are much more big-picture and development focused. Data engineers build automated systems and model data structures to allow data to be efficiently processed.
This means the goal of a data engineer is to create and develop tables and data pipelines to support analytical dashboards and other data customers (like data scientists, analysts, and other engineers). It's similar to most engineers. There is a lot of design, assumptions, limitations, and development that occurs to be able to create some sort of final robust system.
This system might be a data warehouse and ETL or a streaming pipeline. All of these are built to be used by hundreds if not thousands of users who need to access reliable data to help answer their questions.
In comparison, data scientists tend to be question focused, in the sense that they are looking for ways to reduce costs and increase profits or to improve customer experience or business efficiencies. This means they need to ask and then answer questions (ask a question, hypothesize, and then conclude).
They need to ask questions like what impacts patient readmission, would a customer spend more if shown an ad like A vs. B, is there a faster route to deliver packages. Skipping over the rest of the process, the goal from here is to find an answer to whatever question is posed. It might be a final conclusion or more questions. Throughout the process, data scientists analyze, gather support, and develop a conclusion to the question.
There are a lot of great courses out there that you can use to learn more about data engineering. I'll break this down into two distinct types of courses and training: There are specific skill courses and general DE training.
For example, if you're interested in learning about what data engineers do and what skills you need on a daily basis, then check out these courses.
This excellent course by Coursera covers the entire toolkit of skills required to learn data engineering.
This 100%-online course offers a flexible schedule and brings you an opportunity to practice key job skills, such as working with data processing systems and machine learning models.
This is an intermediate-level course, and it requires you have basic proficiency with SQL.
This course includes various demos, labs, and presentations that'll enable you to learn data-driven decision-making through the collection, transformation, and publishing of data.
You can also check out Big Data in the AWS (Amazon Web Services) Cloud for a different perspective on big data and some of the skills that data engineers use.
With this 100%-online, fully flexible course, you'll learn the basics of data modeling and work with SQL to develop an in-depth understanding of data manipulation and the design of a data warehouse.
This course will give you the opportunity to work with large data sets and create dashboards using visual analytics.
With this comprehensive specialization, you'll learn about data visualizations, Pentaho, and data warehouses.
On the other side, there are specific courses you can look into to pick up more specialized skills. There are so many tools out there that, after learning the basics of being a data engineer, it's not a bad idea to learn about some of the other tools and methods out there like streaming, Spark, and more.
In this course you will learn how to use structure streaming and data frames in Spark3 as well as how to use Elastic MapReduce service by Amazon to work with your cluster on Hadoop.
However, my favorite focus for this course is that it teaches you how to frame problems in big-data analysis, such as spark problems.
There is more than one path to become a data engineer. You can come from a variety of backgrounds and disciplines and still succeed.
What's more important is that you have the technical skills and soft skills that will make you a strong data engineer.
If your goal is to become a data engineer, take some time to assess your skill set and see where you can expand upon it.
Then start your journey!
Thanks for reading! If you want to read more about data consulting, big data, and data science, then click below.