Time-Series Databases are powerful and interesting beasts. We are selecting a new one for our OpsStack Total Operations Platform, for use in all parts of the system, from gathering field metrics, to driving our AI & Expert Systems, to handling our internal app logs, function timing, and other data gathering that makes things work (taking after Etsy in this regard).
We’ve looked at all the various players, including Prometheus, OpenTSDB, Graphite, and others. In the end, we chose InfluxDB and its related tools, for a variety of reasons I wanted to lay out here.
First, we need to tag on multiple dimensions, which is the new standard and thus makes obsolete the older Graphite-like tags-in-metric-name concepts. We all need a metric name and many, highly-variable tags around that, which are indexed for rapid lookup, like hostname, region name, http request path, log fingerprint, etc. InfluxDB and most new TSDBs support this convenient multi-tag concept.
InfluxDB also allows multiple data fields, making it easy to gather muilti-field data like CPU / RAM use or SQL query types per second. The tag vs. field and indexing / aggregation models are all clear and convincing.
Second, we get data from everywhere, and as InfluxDB is a latecomer, it supports all sorts of data feeds, from its own Telegraf to statsd to collectd to various HTTP endpoints to UDP (perfect and necessary for logging from app code). This lets us integrate with other systems and over time migrate to the most appropriate, plus use the ever-increasing Telegraf ecosystem where we can.
The InfluxDB mostly-automatic aggregation / reduction system is similar to what others do and very helpful in data crunching, something we are very familiar with from our large-scale monitoring systems.
Third, using a query language as close to SQL as possible is genius, as it just makes it easier to use while avoiding endless mistakes and challenges from just being different. I detest JSON and various random query languages, or even worse, in-code custom logic functions that resemble a bad ORM. Just say no and use SQL as much as you can, and no one gets hurt.
Using SQL is a huge plus
Related to using SQL is that most commands are very similar to MySQL, which we know and love, e.g. ‘show databases’ or ‘use dbname’. This just makes life easier, increasing efficiency while reducing mistakes and confusion.
Fourth, we looked closely at the increasingly-popular Prometheus, but as a monitoring system, the pull-only agent model is a deal breaker for a SaaS system. Our old systems worked this way, and we just cannot continue to ask customers to open ports for us; the push gateway is not really a solution. In addition, and partly due to the pull model, Prometheus does not allow sending timestamps with the data, which makes it useless for batch gathering and sending, which we need for high-resolution gathering, not to mention in poor-connectivity environments on a global scale.
No push model is deal-breaker for Prometheus Model
Fifth, using Go makes InfluxDB absurdly easy to install and configure, and it has nice packages for every platform, including native MacOS, Windows, various Linux distributions, etc. Easy, easy, and works as advertised. Very different from the systems dependent on Hadoop, or systems that use Go, Python, Java, and Ruby all misguidedly mixed together, for example.
Finally, the docs are really quite good, as expected for a commercial provider.
Of course, InfluxDB is pretty new and has had some challenges / changes in their clustering and storage models, so we’ll see how those work out at scale, but for now we are willing to put with this for what is, to us, a product best-suited to our needs. And we still have to work out if we’ll really commit Influx’s Kapacitor for alerting, but we’ll see.
Learn more about our Total Ops Platform at OpsStack.io