My 2022 Review: About Data-Centric

#emptystring

Although I have been working with data in the past, this year, I have been working closely with data engineers and finally fully realized the concept of data-centric.

We all know that applications are made up of data, so as a backend engineer or architect, we must be involved in all kinds of data storage and data structures. How to get the most efficiency out of all kinds of data storage is the goal of every backend engineer, and our focus is on maximizing the throughput of data storage with the most efficient data structures.

There are many technical details involved, including but not limited to the following.

Indexing: improve database query performance.
Sharding: reduce data volume of datasets.
Caching: use faster medium.
Denormalization: pre-aggregate complex query results.

But for data engineers, the focus is completely different.

How to find the answer to a question quickly with a lot of data at hand?

Quick here has a very different meaning to what a backend engineer cares about. Data engineers care about quickness in terms of productivity, so they can use all kinds of complex SQL queries, let's say over 100 lines, to find the answer, even though the SQL query takes a long time.

This is the opposite of what the backend engineers care about, which is the speed of response. From the backend engineer's point of view, such a complex and time-consuming query is unimaginable.

Data engineers have different data requirements in order to achieve their goals.

ETL and ELT: make data more structured.
Data Lakehouse: continuous data cleanup and compilation.
Governance: manage data catalogs and lineages.

How to organize the data from many sources into a structure that is suitable for use, and put it in an easy-to-access place and reduce the complexity of maintenance is the focus of data engineers.

Therefore, let me summarize this year's experience in one sentence.

Only those who focus on data and care about data are data engineers.

This is the meaning behind data-centric, and I have spent a lot of time this year learning and changing my mindset in order to understand this.

Data-centric Monitoring

In fact, the concerns of backend engineers and data engineers are also reflected in the metrics of the monitoring system.

The application quality metrics usually include the following key points.

response latency
throughput
error rate

However, for data applications, the quality metrics will be as follows.

Freshness: time lag between source and landing.
Completeness: whether data is lost at each stage of transformation.
Correctness: similar to the above, whether each stage of transformation is correct.

Although, both are applications, they are very different for telemetry. Of course, the underlying infrastructure is also different.

Data Team Organizations

Let's break down the composition of a data team.

From my point of view, a complete data team would have at least two roles with completely different skill sets and totally different responsibilities: data engineer and data analyst.

In my opinion, there is a simple dichotomy between how to differentiate these two roles.

Data Analyst: the person who uses data to answer questions.
Data Engineer: the person who generates the "data" above.

Of course, there are two other roles that can be created depending on how the questions are answered.

Data Scientist: the person who answers the question through himself.
Machine learning engineer: answers questions through AI.

But for a healthy data team, I believe there needs to be at least a distinction between the roles of data engineer and data analyst. Just as the focus of a backend engineer is different from that of a data engineer, the focus of a data engineer is different from that of a data analyst.