This article is not a tutorial, but a philosophical reflection on the question that many professionals involved in creating or using monitoring systems ask: “Why are we doing what we are doing, and how are we doing that.”
What I will say in this blog post may sound strange or incomplete. But this is how this problem presented itself. I do not know the ultimate answer to “what is monitoring and how to do it properly.” I doubt any human being knows an answer to this question, which will cross all t's and dot all i's. But this post is somewhat of a silly attempt to clarify some points. And if some of my thoughts will be useful to you as well, I will be glad that I've spent time bringing all my arguments on this matter together. Of course, as I perceive it, the side effect of this article is to ask more questions and dive deeper into this fascinating issue. And now, without any further ado.
When speaking about IT monitoring and observability, many IT professionals make the same mistakes. They are to make an impression that the idea of monitoring and observability for the matter is something:
Created recently as a part of the IT revolution.
Exists separately from other monitoring forms (industrial and others).
Unique in its approaches and none of the experiences that's been gained by the engineers of the past.
None of those statements are true. Monitoring has been a part of human activities for centuries. Whenever there is some process, someone usually observes that process. Making sure that it is through. And IT monitoring and observability as it emerged as an IT topic initially, IT monitoring and observability were treated equally to any other form of monitoring and observability. And it shouldn't be treated any differently today. Because there is nothing new in the world regarding how human beings connect with their surroundings. We can create better tools, but we have yet to change the nether mechanics of an eye or how the human brain processes input data about that surroundings. And with this idea in mind, let us try to answer a straightforward question:
There are many answers to that question in the IT crowd, but let's think, for a second:
- Why, historically, were people observing and watching the fire?
- Why were captains watching the weather, tides, and directions of the winds?
- Why the train engineers were watching a boiler?
- Why are pilots watching airplane controls?
I brought those few samples of human activities to make the point. The purpose of all those actions is to keep driving some process, of the fire, of the ship movements, of the engine safety, of the control of the airplane. So, every time we think of “monitoring” or “observability,” we do need to think, “This is all for the control of some process,” not for personal curiosity. Not to satisfy some external requirements without questioning “why.” Not only to establish some fact without bringing this fact into a proper context. So, the foremost task of any “monitoring” and “observability” is “Control.” Everything else is a secondary task. Even if we are involved in the monitoring and instrumentation of some scientific experiment, the primary task is to keep the process under control to a maximum extent and then gain scientific data. So, after answering the first and probably most important question, we must ask ourselves:
And at this point, there will be no shortage of various answers. We will hear about how we get and compute the data, thresholds and aggregations, statistical computation, and visualization. But let us step back and think now: “What is the matter of controlling? How can we be certain that the process is controllable? How shall we organize our observability, so each element in this effort matters ?”
And again, I am taking a step back from the beginning to propose a multiple-choice solution and try to dig to the root of the idea of “observability.” What are we observing while seeking control over something? Every time we build a fire, we watch that the fire shouldn't die or get out of control. The whole purpose of seamanship is to deliver humans and cargo across the waters. While doing that, you are taking care of unfavorable conditions preventing you from getting this delivery done. Whenever you control some process or mechanism, you are looking at what you are managing and doing everything to complete this task by removing the obstacles. So, the method of control is detecting and preventing barriers that stand in the way of some processes and may block this process from fulfilling the process's purpose. So, to keep something under control, we have to detect not the problems. For example, if the train boiler is disintegrated into smithereens, we can safely say that the current state of the boiler is beyond any control. But we shall rather find the traits that lead to the issues. But what are those traits? What are the indicators that the issue is about? In the dynamic environment, which is characteristic of any process we seek to gain control over, those traits through the collection and observation of patterns through various methods.
So, if we are beginning to look at the root of the problem, monitoring, and observability is the process of constant search and detection of patterns in the data. For the user, who is trying to keep some processes under control, the observability platform produces help in catching “fingerprints” of the data that may lead to concerns and the patterns that lead to the restoration of normality. While observing the repeatable patterns is the primary tool for controlling some processes, more than just patterns is needed. But why? What's wrong with just the patterns? Detecting potential issues by searching for the known (or hardly known) traits in telemetry will remove a significant burden from the observer. Still, neither the observer nor us knows everything. And as the second line of observation, we may set the relationship between known patterns and other events in the system. Because “all good” and “all bad” are related, one observation usually leads to another. Otherwise, you may need to learn how to detect something. What is the summary, and what can we do correctly or incorrectly? Let me recapture my thoughts and come up to the conclusion:
- Monitoring and observability are for control. If you do not have an outcome of monitoring that implies control over your process, you are not monitoring. You are just busy.
- The purpose of control is not to let the undesirable outcome progress beyond a point where you cannot control and contain the situation. If your monitoring informs you that the wrong thing already happened, or worse, you learned that you do not have nether monitoring or observability from your user or the morning newspapers.
- The way of detecting the problems is by observing telemetry patterns in motion. The way of detecting normality is through observation of telemetry patterns.
- Finding new patterns by observing the relationship between known patterns and behavioral patterns of other telemetry items. And now, let me counter-sample of how we can get the monitoring wrong:
- We are considering monitoring and observability as an instrument for secondary-level issues, such as inventory control, capacity management, and planning.
- We are “detecting problems” instead of the conditions that may lead to the problem.
- We are overly obsessed with golden signals and thresholds for those golden signals. Furthermore, we are trying to automate a “momentary analysis” based on thresholds and simultaneously apply dynamic analysis without automation. And this means we measure points and create alerts based on entries, while seeking individual telemetry charts for mental pattern recognition.
- We are overly obsessed with issues and ignore the detection of normality.
- We are not looking at relationships between the behavior of different telemetry items.
This is all I can say about this very intricate engineering problem. While I am not claiming that I am 100% correct, I am instead claiming that thoughts that reflections are the result of years of observing monitoring and observability as IT and industrial discipline and how engineers perceive it.