I was in a local technology Slack and saw the question:
Tell me, what is data engineering and how is it different than data science? What [Python] tools do you use?
It's a good question and it's worth considering. Here's what I think:
Data engineers are concerned with how data:
- Lands in a system
- Moves through a system
- Interacts with business processes and application logic
- Is stored
- Is governed
Definitions differ, but I think that's a good starting place.
There is a huge and diverse ecosystem of tools out there. I would highlight the following as tools with strong Python tooling:
- Kafka (for pub/sub)
- Spark for data processing (including the Spark Streaming API for stream processing)
- Airflow for workflow management
- Pandas for data "wrangling"
And you can find tons of tools on the awesome list.
Lastly, how do data scientists differ from data engineers? I would argue that they're both roles that are concerned with connecting data to business processes. They differ the same way their names differ:
- A data engineer is concerned with designing and building systems that make data available and actionable in a cost-effective manner.
- A data scientist performs experiments, the results of which are (sometimes) actionable insights or automation.
I hope that explanation helps you. If you see something I didn't explain very well, or got wrong, please chime in with that 👇