How Twitter handles Observability?

#monitoring #devjournal

The fact that Twitter uses a software development methodology that emphasizes monitoring and measuring is one of the reasons it can process 500 million tweets on a daily basis. The company's Observability team uses a specific software stack to track software service and performance metrics and notify teams of issues. Every minute, 170 million unique metrics (time series) are collected by the Observability stack, which also processes 200 million requests daily. Charts and dashboards are filled by using straightforward query tools.

The Twitter Observability Engineering team offers full-stack libraries and a variety of services to the internal engineering teams for monitoring service health, generating issues on alert, supporting root cause investigation by offering distributed systems call traces, and carrying on diagnosis by building a searchable index of aggregated application/system logs. The Observability Engineering team's charter has these four pillars as its foundation:

Monitoring
Alerting/visualization
Infrastructure tracing for distributed systems
Analysis and aggregation of logs

Twitter's time series metric ingestion service currently manages 25,000 queries per minute, 4.5 petabytes of time series data, and more than 2.8 billion write requests per minute.

Metric Ingestion

To push metrics to the Time Series Database & HDFS, the majority of Twitter services utilize a Python collection agent. The observability team also supports a slightly modified version of a popular StatsD server that delivers metrics to Carbon or our primary time series metrics ingestion service. For instances when simplicity of ingestion is more critical than performance, an HTTP API is finally offered.

Time series database

Twitter has its own time-series database known as Cuckoo. All of twitter’s metrics are sent and stored in it.

With support from Twitter's distributed key-value storage system, Cuckoo is actually two services: Cuckoo-Read and Cuckoo-Write.

Metrics are ingested by Cuckoo-Write, which also offers an API through which they can be written. Cuckoo-Write makes sure that these metrics are sent to the proper services for indexing in addition to storing them in Manhattan. Data is kept at an hourly level permanently and at minute granularity for two weeks.

Our alerting and dashboarding systems, as well as user-initiated ad-hoc query traffic, are handled by the time series database query engine Cuckoo-Read. The Cuckoo query language, or CQL, is used for query specification.

CQL query engine

Given the size of Twitter's data set, it is theoretically very difficult to provide low latency and high availability for all queries. Twitter's engineers rely on the monitoring system and the more than 36 million real-time queries that are run daily to meet their service SLA.

Cuckoo-Read natively supports CQL queries. The three parts of the query engine are the parser, rewriter, and executor. The parser is in charge of converting query strings into internal Abstract Syntax Trees (ASTs); the rewriter then goes through the AST nodes and performs some of the calculations to replace more complex ones with simpler ones in order to improve performance; and finally, the executor gathers data from downstream services and computes the output.