The term Data Contracts is the latest buzzword in the data world and has been heavily explored in publications around the world. Although the subject has not yet reached Brazil with the force it deserves, the pain that it proposes to solve is already very much present in the market. But what are we really talking about and what is this pain?
Seeking to be Data Driven
In recent years, organizations in Brazilian market have made a great effort to become more and more Data Driven. In the beginning, the aim was to gather data available in the company and mobilize them to bring intelligence to day-to-day decision-making.
After overcoming this challenge, the market realized that, although they managed to use the data, it was unstructured, disorganized and siloed in different systems. At this stage, the priority became building an environment that would make it possible to work with very large volumes of data. It was also important that this architecture allowed an efficient and fast scalability of resources when necessary and that it brought operational efficiency to the company's processes, making pipelines more automatic and with better performance. It was critical, at this stage, to make data from different sources and with different formats able to "talk to each other", giving the user a consolidated view of the phenomenon that is the basis of decisions.
At the time of writing this article, I realize that the market has already achieved a certain maturity in data architectures. Companies' Data Lakes and Data Mesh structures are already running in production with data pipelines that can deliver data in a consolidated manner to the business user. HOWEVER, there is an important consequence here:
As organizations move towards being more data driven, they are, in fact, anchoring their business operation on data. This brings an enormous criticality to our architecture so that if there is any type of failure or data is unavailable somehow, we can stop the operation (!!!).
Unfortunately, data unavailability is a very common situation in the daily lives of data teams. It is very common for these teams to receive a distressed call from the business user saying that data is not available, has not been updated, is blank or something like that. This is enhanced when the organization is large and complex and has a scenario where there are several teams producing and consuming data. In such cases, the process failure points are multiplied. According to Barr Moses, CEO of Monte Carlo, among the top data challenges are:
- "Data pipelines constantly break and create quality and usability issues."
- "There is a communication chasm between service implementers, data engineers and data consumers" (Check out the original article here).
The high availability of data, therefore, presents itself as a first order problem to be solved. There are other very important issues related to our environment such as security, access management, quality… but high availability takes on enormous relevance in this context. What to do?
What are Data Contracts
Data contracts emerged as a proposed solution to the aforementioned problem. This term got prominence in the market with Chad Sanderson in August 2022 in his text The Rise of Data Contracts. In this article, he posits the problem and proposes the concept of data contracts as "API-like agreements between Software Engineers who own services and Data Consumers that understand how the business works in order to generate well-modeled, high-quality, trusted, real-time data".
Maggie Hays comments that a data contract needs to define:
- "what data needs to move from a (producer’s) source to a (consumer’s) destination"
- "the shape of that data, its schema, and semantics"
- "expectations around availability and data quality"
- "details about contract violation(s) and enforcement"
- "how (and for how long) the consumer will use the data" (Check out the original article here)
The point that most calls my attention is that the idea of data contracts does not reside only in the clear establishment of data delivery and consumption criteria, but is also concerned with ensuring, in an automated way, that this contract is not breached breaking entire data pipelines and generating loss for the business operation.
My 2 cents on the matter
In addition to what has already been exhaustively argued, I would like to add one more feature necessary for a good implementation of contracts of data: the visualization of the whole path of the data and its points of failure.
It's easy to argue that the feature I'm talking about is nothing more than data lineage, a feature made available by some data catalog solutions on the market. However, lineage (although critical to a good data governance strategy) is just one part. The view that I believe is useful for data contracts would also bring the processes' points of failure and if these checkpoints are OK or if failure was detected; a consolidated view of all contracts that permeate the organization, a holistic view of the process.
In addition to functioning as a monitor of the entire data flow in the company, this view would be extremely important in the culture and data literacy strategy, making the entire process explicit to stakeholders.
Perspectives for the future
At this moment there are some published ideas for the implementation of data contracts. Some using Kafka's schema registry and CDC as enforcement in an event-oriented architecture, others proposing to use dbt in a batch fashion... but there is still no clear definition or consolidated method to implement the concept.
The expectation regarding this feature, however, is huge. The promise of accelerating the production and strategic mobilization of data along with a significant improvement in quality and availability brought this concept to the market spotlight. The forecast is that in the coming months the first products in this area will begin to emerge.
Top comments (0)