Preeti Hemant

Posted on Feb 23, 2022

Data architecture models

#datamesh #data #dataengineering #etl

Data has come a long way, starting from the 1640s when the term “data” had its first use, to the 21st century, where AI has become integral to everyday life.

As you can imagine, several software and hardware developments have co-evolved with data, bringing us to the here and now. One of the early challenges in data was ingesting it — how the data was to be used and the needs it served, weren't nearly as interesting. The use cases were extremely narrow, mostly defaulting to basic business reporting. Today however, the focus has shifted from ingesting data to making it accessible in a way that would support a plethora of applications — parameters of accuracy, timeliness, reliability and trust at a massive scale, are paramount.

The challenges today in data - a consequence of its scale and speed - are in the areas of data discoverability, governance and reliability. The market is flooded with tools for every data problem conceivable. But, is there a guiding philosophy on how to bring these multitude of tools together, or how to stitch the different roles in an org with these tools?

What we need are data architectures that can provide directional guidance, allow for weighing trade-offs, are domain-agnostic and at the same time don’t put us at the risk of building something that quickly becomes obsolete.

Data architecture is a relatively new term. In fact, one of the first references to data architecture is the mention of Data mesh as a model in this article in April, 2020.

So, is data mesh the only model or one of many? A search will show "Data Fabric" and "Data Mesh" as two popular candidates for data architecture models.

If you are looking for a short form introduction to the two models; Think of data fabric as a convergence of the modern data tools, stitched together to collect disparate data and move it within a system in a multi-hop manner. The objectives being, data discoverability, accessibility and management — for varied consumers and use cases. Data Mesh then, is the next step in the evolution of data architectures; brining in aspects of product management and decentralization to data.

Contrasting one with the other, Data fabric allows for ingestion of data from any source, for any use case — without gating it for quality during ingestion — trust in data, data integrity are addressed through layers that logically come after ingestion. Whereas, Data mesh places strong emphasis on data quality and data being treated as a product, even before it can become part of the data ecosystem.

Is one better than the other? Which one renders itself better to implementation? Here’s the long form of the two models.

Data Fabric

How should data move in a system, What characteristics should data retain and shed as it moves? In the Data fabric architecture, data follows a set of steps that determine its flow. The first step takes data through an integration phase. In the integration phase, data is ingested and then cleaned, transformed and loaded into storage. Then, there is the data quality phase where quality assessment is performed on the stored data. This data is then made available for different use cases through a combination of a data lake and a data warehouse, Typical use cases are BI, analytics and machine learning. Data governance policies are defined for the ingested data and a data catalog is used for discoverability.

The above functions are mostly centralized — a team of data specialists are designing and implementing the different stages in the fabric and also setting up policies and access controls.

Simply put, Data Fabric is how most data ecosystems move, store and access data today.

The beauty of data fabric as an architecture model is the flexibility it offers — not all components are a must, there are multiple vendors with off-the-shelf solutions that can collect and process data from any source and any use case.

Data Mesh

Riding on the shift of software systems towards distributed domain design, data mesh is built on the principles of distributed architecture. There are three major components in a data mesh - Decentralized Domain ownership of data and the resulting Data products, Self-serve data infrastructure and Federated Governance.

Data Mesh has been designed to derive value from data at scale, in complex environments — complex not in data volume or velocity but in the number of use cases and the diversity of data sources. Since the complexity in not only technical, this architecture is modelled as a socio-technical construct.

Domain owned data is probably the most critical shift in going from Data Fabric to a Mesh. The idea is quite simple — Who better to own and provide data for use, than the teams generating the very data? In this paradigm, business domains decide what data is useful and should be exposed for different use cases within the org. If that is true, are these the teams also building methods and tools to serve this data? No — This requires skills that the domains are not expected to have and is instead delegated to the data infrastructure that builds a self-serve data platform.

Domains serve their data as a product — a product that meets well-defined standards that ensure interoperability with data from other domains. This data product lives as a node on the mesh. This is how the concept of ETLs is done with in the Data Mesh paradigm.

Decentralized domain data ownership is the highlight in this architecture. Ownership of design and deployment of the infrastructure that serves data, however is centralized — with the data platform team. Naturally, there arises a need for a body that balances these aspects, delineates decisions that lie localized with each domain from the decisions that are considered global. This group is the federated governance group that is carved out of both the data platform team and individual domains.

Similarities between Data Fabric and Data Mesh
Both the architecture models attempt to solve the problem of getting value from data at scale - while making data secure, accessible and easy to use and interpret.

How do they differ?
In a Data Fabric, a dataset gains value by being onboarded, catalogued and made available through a standardized set of governance rules.
In a Data Mesh, a dataset gains value because of its usability as determined by its consumers (data scientists/data analysts)

In a Data fabric, there is standardization in how data is cleaned, labeled and checked for quality.
In a Data Mesh, the decision on how data is to be made consumption ready i.e the pre-processing steps lies with the domains that own the data.

In a Data Fabric, the onus of understanding the data, interoperability of data sets generated by different services becomes a joint responsibility of the data engineering team and consumers of data - the analysts and the scientists.
In a Data Mesh, it is the responsibility of the teams serving their data, to understand how the data could be used to generate value and design it in a way that meets the needs of the consumers.

Finally, which one should you pick?
Data Fabric addresses and recommends solutions to the fundamental questions on ingestion and use of data. Data Mesh as a model, can become a solution when the fabric hits a wall on issues around data ownership and data quality. Also, an important pre-requisite for Data Mesh architecture to be successful is domain oriented software architecture and teams in an organization.

All things considered, it is a good idea for a data org to get started with the data fabric paradigm and adopt principles from data mesh as their data, their needs and complexity of the data systems evolve!

References:
Data Mesh Principles and Logical Architecture
HelloFresh Journey to the Data Mesh
Data Fabric as Modern Data Architecture

DEV Community

Data architecture models

Top comments (0)

Read next

Inferência de Tipos com o Operador Losango

Getting Started with Playwright: A Step-by-Step Guide

Exploring AWS EKS Auto Mode: A Simplified Kubernetes Experience

Programming Problem Solving: C++ Case Study