This article is more like tips to that question rather than system design itself. If you are looking for the system design try this amazing book.
Clarifying questions to establish the scope:
What metrics do we want to collect?
What are the supported alert channels?
How long should we keep the data?
Can we reduce the resolution of the metrics data for long-term storage?
Do we need to collect logs?
What is the scale of infrastructure we are monitoring with this system?
Is it a SaaS or just for internal purposes?
A metrics monitoring and alerting system contains 5 features:
1) Collect data from sources
2) Transfer data(what is different between 1 and 2)?
3) Store data
Data access pattern
We will persistently collect data and write it to the database. Heavy write load.
Visualization and alerting services would send queries to the databases, an access pattern will vary depending on graphs and alerts. So the read load will be spiky.
Data storage system
SQL - too complicated queries, bad performance under heavy writes
NoSQL - Cassandra and Bigtable can be used, they are optimized for heavy writes. But we can find a better solution
Time-Series databases - InfluxDB and Prometheus are the two most popular databases. They are designed to store large volumes of time series data and quickly perform real-time analysis on that data. Another feature is the efficient aggregation and analysis of a large amount of time-series data by labels. InfluxDB builds indexes on labels to facilitate the fast lookup of time-series by labels.
Pull or push model to collect metrics
With such an approach we need to know the list of all services that we want to monitor. It is hard to maintain especially at a large scale when we need to add/remove services frequently. Of course, we can use some Service Discovery services like ZooKeeper where services will register their availability and our collector can be notified about any changes.
With that model, we need to install an agent on every monitored service. An agent is a long-running software that will collect metrics and send them to our collector.
Push or pull
Depends totally on the details of the system that we want to support. Large scale system probably needs to support both approaches cause it might uses serverless components and you will not be able to use an agent.
High level design with Kafka
Why Kafka needed?
- avoid data loss in case if DB is unavailable
- decoupling data collectors and data processors
Oldest comments (0)