DEV Community

Divij Vaidya
Divij Vaidya

Posted on • Originally published at Medium on

Why monitoring your big data analytics pipeline is important (and how to get there)

Your Big Data analytics pipeline provides insights to your business, now gain insights into your Big Data pipeline itself.

In this article, we first understand the requirements for monitoring your big data analytics pipeline and then we go into the key aspects that you need to consider to build a system that provides holistic observability.
Note: This article assumes some basic familiarity with the big data landscape and associated technologies such as Hadoop, YARN, Spark, Presto etc.

Background

In the technology fueled dynamic environment of the 21st century, for a business to be successful, it needs to learn from the past results, adapt to the current requirements and forecast the future trends. These valuable insights into the past, present and future are retrieved by analyzing vast amounts (large scale) of data collected over time.

Apart from these, COVID pandemic has taught us that a successful business should be nimble and innovative. Nimbleness comes from the confidence to make the right decision at the right time. Innovation comes from increasing the likelihood of getting lucky by consistently performing experiments. Acquiring both these traits require us to analyze large scale data which could be processed quickly and economically. Today, a growing ecosystem of big data technologies such as Spark, Hive, Kafka, Presto etc. fulfill these requirements.

The internet is filled with articles about usage of data processing and analytics frameworks. But, there is a shortage of commentary about the auxiliary systems that augment the power of big data technologies. These systems are force multipliers for a better user experience in the big data ecosystem. They provide infrastructure cost attribution, data set discovery, data quality guarantees, audit user access and extract insights (aka gain observability) into your analytical applications which run on the big data stack (henceforth called big data applications).

This article covers some key ideas to gain observability into big data applications. We would define the scope of requirements, present some known challenges and discuss high level solutions.

Introduction to Data Application Observability

Big data stack can be broadly divided into the following layers starting from the top of the hierarchy to the bottom: Query processing, execution framework, resource manager & storage. These independent sub-systems work together to execute an analytical query (big data application).

Data application observability is the ability to extract insights on the characteristics of big data applications. These insights lead to lower infrastructure cost by improving resource utilization; increase user productivity by ease of debugging and troubleshooting; improve availability by reducing mean time to recovery and, boost security by providing auditing and accountability.

Continue reading on Medium ยป

Top comments (2)

Collapse
 
alexantra profile image
Alex Antra

Hi Divij

Posting barebones articles linking back to the origional on another website is against our terms.

dev.to/terms

Can we ask that post the original article here in the future?

Collapse
 
divijvaidya profile image
Divij Vaidya

Understood. I expanded the content a bit more but reproducing the article here right now will be a lot of work in editing that I don't want to invest in right now. In future, I will ensure that the complete article is posted.

Thank you for your understanding.