My name is Zhenya Kosmak, and I wrote this article describing my experience as a Technical Product Manager. You can connect with me on LinkedIn if you want to discuss your project or anything related to this article. And I will be glad to share with you some pieces of advice 🙃
If your development team spent several thousand hours on your product and it's already in production, the issue of its stability is already quite significant. All services on the servers should work stable, and if there is a critical problem somewhere — the development team should figure it out and start fixing it. In this article, we'll talk about our experience setting up a Calibra alerting system. In this case, we have managed not only to ensure the technical stability of the product but also to optimize costs and improve our client's processes.
This article is a part of the cycle "Alerting system: why it's necessary both for developers and product owners." This article confers our arguments on why you should track vital metrics with alerting tools and how this aids a product owner.
To make understanding the problems we solved easier, we need to tell the essentials around the product. Calibra is a BPMS, i.e., a system that covers most of our client's workflows. The client's company managed numerous advertising campaigns for their clients in their interests. The client's company earned from each lead they brought. Calibra managed the accounts from which the ad was launched; contained all advertising settings; automatically changed the ads; collected statistics on the effectiveness of advertising, and much more.
Let's start with an example how client company may lose money without monitoring:
- We collected metrics on each server of the system using versatile tools. Each metric was sent to the centralized storage.
- We called any unexpected situation "event-to-alert." It has the date and time of beginning (when the issue happened) and its end (when the issue was resolved).
- We used centralized settings for sending notifications for such events. When something wrong happened, we sent a message to the channel on Slack. Same thing when the situation is resolved.
As a result, the client stayed informed about the product's problems. And by this, we mean not just technical issues but also the problems of the client's business as such.
One of the product's subsystems was especially financially important for our client. This subsystem provided the sending of lead data to the advertising platforms. With this data, advertising platforms have optimized the audience targeting so that the effectiveness of advertising increased significantly.
The integration between marketing platforms and our product was a complex one for a bunch of reasons:
- Google Ads and Meta Ads are different in the sense of gathering information and have various bugs on their side. We spent dozens of hours talking with them, figuring out those issues.
- Each lead event must be saved on our side for sure before being sent to the platform side. If we break any subsystem, any server crashes, anything else — we shouldn't lose any lead. So we developed a separate microservice that received all the lead events.
- Every lead event was sent to the main DB to be shown on the statistic pages of our product. Marketing managers used this metric on each ad to track its efficiency. If any part of lead event information was corrupt (for example, the type of event was malformed), we didn't send it to the platform and logged it for further investigation.
As follows, we had multiple points at which our system might break. On the other hand, we have already had several precedents for the failure of this process, and in most cases, not for reasons within our product Calibra. Our client used a few third-party products to filter the bot's traffic; settings for gathering lead information might be set incorrectly on the ad's settings; etc. Each part of this system might cause lower performance of advertising. This little simplified illustration shows the scale:
So, having the whole set of correctly sent lead events, we decided to alert when we haven't as many lead events as we expected. Thus, if too few leads were received, we could alert this to fix it as quickly as possible, optimizing the client's advertising costs as much as possible.
As a result, it looked like this:
- If in 1 hour there are no leads for a group of advertising campaigns, this is a problem, in which case the alert arrives.
- We created a checklist of typical problems on our side, covering 95%+ cases. If this is not our problem, the client has to solve it. We have checked all these situations using a checklist after a Slack notification.
- After checking, we called the client's team members in the Slack thread, informing them of the problem on their side.
As a result, we solve some of the client's problems by informing them timely and when they must act. At the same time, we are more likely to notice product problems that we can solve on our own.
We developed a relatively mature monitoring system for the early-stage product. It might look like an over-kill, but stories like this confirm the opposite. Problems like you've read above are waiting for you all the time. And you need a customizable solution to be ready. In this case, only you will be sure that CEO-level product metrics are on track.
So yeah, this is the only and the definite reason why product owner needs an alerting system. Some metrics matter a lot. You need to maintain the product in a way to fit their expected values. Just because it saves money.
If you need something similar on your product, we will discuss it with pleasure (link to our website).
Stay tuned for the new articles!