Originaly posted here
Monitoring of production environment is an essential part of any products not depending on the size. But monitoring big production environments efficiently is the essential difference between good quality of service and bad service.
Brief prehistory — our product is delivered to a few different clients. It means we have a multi-tenant environment and each of them requires proper monitoring. Every tenant has its own production subscription and independent infrastructure.
At the start all our monitoring was organized on native Azure Cloud resources, meaning all data go into Application Insight (Log analytics) and dashboards were built in the same environment. But as a result when we needed to investigate the root cause of the issue in our environment we were spending approximately 1 hour finding an exact component that caused the issue. Which was really painful for our clients. Incidents were like the challenge of finding a needle in a haystack.
Here is an example of a dashboard in Azure:
If you are using Azure I’m sure you know how bad their mobile compatibility is so I won’t even include a screenshot of that here.
So points that we were missing here are:
- No multi-chart graphs
- No dynamic variables — to choose resources that we want to show easily
- One of the worst date pickers — that one can compete for worst design award
- Responsive view
- Represent data from another source than Azure metrics
- Complicated access management
- No change history
We started looking for a better approach in our observability and logically we tried the top open-source tool for this — Grafana.
Grafana — is the open-source analytics & monitoring solution for every database. https://grafana.com/grafana
I won’t cover how to spin up your grafana instance. But I’ll show you how we improved our observability and how that improved our issue detecting mechanism.
Some of the pros of using Grafana that we found are:
- Different data sources
- Responsive dashboards
- Multi-charts graphs
- Great choice of plugin
- Awesome date picker that just works
- Dynamic variables
- Clear access management
- Change history — versioning
- Reach community dashboards library
All these features brought a new level of observability to our reliability team.
Here I’ll show you an example of how we are monitoring the load on our database depending on the number of requests and nodes from our API.
We just need to choose Subscription in the first variable and all other variables will be automatically substituted with a proper value with dynamic queries.
This is only a brief example of all possibilities that become open for observabilities in Grafana for Azure resources. Availability to connect different data sources helps build a dashboard on which you can monitor your background jobs hosted in azure and the data layer which is hosted in Elastic search.
To connect your Azure resources to Grafana you should use the Azure Monitor plugin — https://grafana.com/grafana/plugins/grafana-azure-monitor-datasource/;
With this plugin, you can create dynamic variables which will ease choosing the resource for which you want to get data.
A full list of query requests is available on the plugin page. After you created your variables, you should configure your charts to use template variables for data source:
This gives you possibilities to build awesome dashboards that will help you detect issues more efficiently.
Here are some of our dashboards:
Taking all these possibilities into 1 dashboard, we build a General dashboard that represents all important metrics of each system and shows if there are any issues. This decreased our time of detection from ~1h to ~10min. Which helped us prevent a lot of incidents in production and increased the speed of incident resolution.
Our postmortems became more scientific, as we were able to see much more metrics and precision thanks to awesome date picker became mathematical instead of approximate.
From the moment we introduced Grafana in our product, we are considering this as one of the best choices that we did to improve our monitoring.
And what did you do to improve your observability?
Thanks for reading!
If you have an interesting experience with Grafana or you are interested in another topic, please add comments and upvote 👍. I'm interested in the dialog.