DEV Community

Cover image for Beyond Prometheus: monitoring at scale
Ashok Nagaraj
Ashok Nagaraj

Posted on

Beyond Prometheus: monitoring at scale

Prometheus has it's limits

Prometheus is a powerful time series database, but it has some drawbacks:

  • Limited scalability: Prometheus is not as scalable as some other time series databases due to the use of single server to store all of its data.
  • Limited data retention: Prometheus can only retain historical data for a limited period of time due to in-memory storage design (by default)
  • Limited query capabilities: Prometheus's query language is not as powerful as some other time series databases. This can make it difficult to query complex data sets (especially when there are joins).

Prometheus architecture
Prometheus architecture
Image credit: promlabs


Monarch

Google in a research paper submitted in 2000, titled Monarch: Google's Planet-Scale In-Memory Time Series Database discussed techniques to achieve scalability and performance, including in-memory storage, columnar data layout, and distributed processing over it's in-house metrics solution on which Promtheus was based out of.

Design choices
  • In-memory storage: Monarch stores all of its data in memory. This allows for very fast query processing, as there is no need to read data from disk. However, it also means that Monarch can only store a limited amount of data, as memory is a finite resource.
  • Columnar data layout: Monarch uses a columnar data layout to store time series data. This layout is more efficient than row-based storage for time series data, as it allows for more efficient compression and query processing.
  • Distributed processing: Monarch uses a distributed processing engine to process queries. The distributed processing engine breaks down queries into smaller tasks, which are then executed in parallel on the worker nodes. This allows Monarch to achieve high performance for even the most complex queries.
  • Reliable and scalable architecture: Monarch has a reliable and scalable architecture. The system is divided into regions, which are further divided into zones. This allows Monarch to scale horizontally and provide high availability.
  • Rich data model: Monarch supports a rich data model. This allows users to store a wide variety of time series data, including metrics, events, and traces.
  • Flexible query language: Monarch supports a flexible query language. This allows users to query time series data in a variety of ways.

Architecture
Monarch architecture

Trade-offs
  • Limited capacity: Monarch stores all of its data in memory, which limits the amount of data that can be stored. This is not a major issue for large-scale systems, but it could be a problem for smaller systems.
  • Cost: Monarch can be more expensive than other time series databases, as it requires more memory and compute resources.
  • Complexity: Monarch is a complex system, which can make it difficult to set up and manage.

More details are in this blog post

Monarch is not open source. It is a proprietary product developed by Google. However, there are some open source projects that are inspired by Monarch, such as Thanos and Cortex.


Thanos

Thanos is a distributed system for collecting and storing Prometheus metrics. It is designed to be highly scalable and reliable.

Features:
a. Multi-tenancy: Thanos can be used to store metrics for multiple tenants. This makes it a good choice for organizations with a large number of Prometheus users.
b. Aggregation: Thanos can aggregate metrics from multiple Prometheus servers. This allows you to view metrics from different sources in a single view.
c. Downsampling: Thanos can downsample metrics to reduce the amount of storage space required. This makes it a good choice for organizations with large amounts of Prometheus data.

Architecture
Thanos architecture

The Thanos architecture is more complex than Prometheus. It consists of a three-tier architecture:

  • Querier: The querier is responsible for querying metrics. It can query data from the remote storage or the local storage.
  • Store Gateway: The store gateway is responsible for serving data to the querier. It can serve data from the remote storage or the local storage.
  • Remote Storage: The remote storage is responsible for storing data. It can be a distributed system, such as Thanos Compact or Thanos Sidecar.

Cortex

Cortex is a Prometheus-compatible time series database that is designed to be easy to use and deploy.

Features:
a. Multi-tenancy: Cortex can be used to store metrics for multiple tenants. This makes it a good choice for organizations with a large number of Prometheus users.
b. Aggregation: Cortex can aggregate metrics from multiple Prometheus servers. This allows you to view metrics from different sources in a single view.
c. Downsampling: Cortex can downsample metrics to reduce the amount of storage space required. This makes it a good choice for organizations with large amounts of Prometheus data.
d. Alerting: Cortex can be used to create alerts based on Prometheus metrics. This allows you to be notified when there are problems with your systems.

Architecture
Cortex architecture

The Cortex architecture is similar to Thanos. It consists of a three-tier architecture:

  • Ingester: The ingester is responsible for collecting metrics from exporters. It can collect metrics from Prometheus servers or other time series databases.
  • Query Engine: The query engine is responsible for querying metrics. It can query data from the ingesters or the remote storage.
  • Remote Storage: The remote storage is responsible for storing data. It can be a distributed system, such as Cortex Compact or Cortex Sidecar.

Conclusions

Comparison
Image credit: Bard

Final choice of time series database depends on the specific needs of the application. If you need a high-performance, scalable time series database with a simple data model and query language, then Thanos is a good option. If you need a simple, easy-to-use time series database with a rich data model and query language, then Cortex is a good option. If you need a simple, easy-to-use time series database that is easy to integrate with existing systems, then Prometheus is still the best option.

Top comments (0)