loading...

What Even Is Data Engineering?

thejessleigh profile image jess unrein ・2 min read

I started my first job as a Data Engineer back in September. I wasn't exactly sure what the job entailed, but I was ready to shift my focus and learn new things. I'd spent the previous four years on back end application development, and while I learned a lot, I was ready to shift away from primarily consuming and constructing APIs.

When I was interviewing for the position, I read up on DAGs, ETL pipelines, and specific common database patterns like the star schema. However, in my interview I was asked things "how do you optimize this SQL query" and "assuming you don't have an automated load balancer in place, how do you decide which cluster to create a new database in?" They were interesting questions, so I decided to take the job. But it can be difficult to explain, even to other engineers, what the scope of my job is.

My team is responsible for database infrastructure. We make sure that database clusters aren't overloaded. We're concerned with internal data security. We manage database users and permissions, and field requests from different teams. We're the information gathering arm for the Business Intelligence team; if they need a dump of data from a specific API every 24 hours, we write the ETL pipeline to do it, and manage the data warehouse where all that information lives. My job is kind of a grab bag of things that need to get done but might not have an obvious owner.

I know I probably haven't done a great job of answering the question "What even is data engineering?" because I'm not exactly sure I know myself! I recently met up with Ali at a local Python meetup and we bonded over the fact that the discipline of data engineering seems utterly made up.

In this series, however, I'm going to introduce different tools that I use in my work through the lens of data engineering. If you have specific questions, or want to know about specific tools commonly used by data engineers, let me know in the comments. Data engineering is a fairly new field in software development, and the boundaries are still being drawn. Let's suss it out together!

Posted on by:

thejessleigh profile

jess unrein

@thejessleigh

Pronouns: they/them | | | Pythonista, cat lover, avid reader, and gamer in Chicago. Tip jar: https://ko-fi.com/thejessleigh

Discussion

pic
Editor guide
 

This is an anonymous comment sent in by a member who does not want their name disclosed.

I'm not sure how valid these questions are or if I'm articulating them properly.

  • What is a data warehouse and when do you need it? This is what I've been assuming: a gigantor database that extracts data from other sources, transforms them into the format you want and consolidates them all in one place so you can query all you want. i.e. if you don't need data from a db to be combined with other data, you wouldn't need to include it your data warehouse. Is that in the ballpark?

  • Do apps pull data from the data warehouse? I would assume not because they'd pull directly from the app database ^ if my above notion is correct.

  • Do data warehouses typically get updated in real time?

 

Hi! I've been working in DWH some years, in my expierence i could say about question 1&2: a data warehouse is not only for querying data, it allows you to get information(next level of data) and explode them with anothers tools like PowerBI, Microstrategy, etc. I didn't know about an app the reads a data warehouse.

Datawarehouses get updated in batch, it's not common a streamming updating, this is an app database behaviour.

:)

 

Totally valid questions! I will definitely start working on an explainer post about data warehouses - what they are, what commercial products are available, and common use cases! Thank you for reaching out!

 

"how do you optimize this SQL query" -- Best statement ever hahaha, I can relate here u.u

I've been working in a Datalake team (data whatever position) for a while and now switching back to frontend :p

IMHO it is a mix of DevOps and Big Data processing...

 

Data Engineering on AWS is a challenging role. You should know which service should use for which job. Otherwise you will end up in so much cost. If you guys are interested in how to build Data Make on AWS, just check out the link: youtu.be/lRWkGVBb13o

 

Chicago Python community represent!

This would make a great panel discussion at a ChiPy Data event. We'll see what we can set up in early 2019.

 

Ooh, yes. I'm willing to bet AC would be willing to host 😄