Time-Series Databases are powerful and interesting beasts. We are selecting a new one for our OpsStack Total Operations Platform, for use in all parts of the system, from gathering field metrics, to driving our AI & Expert Systems, to handling our internal app logs, function timing, and other data gathering that makes things work (taking after Etsy in this regard).
We’ve looked at all the various players, including Prometheus, OpenTSDB, Graphite, and others. In the end, we chose InfluxDB and its related tools, for a variety of reasons I wanted to lay out here.
First, we need to tag on multiple dimensions, which is the new standard and thus makes obsolete the older Graphite-like tags-in-metric-name concepts. We all need a metric name and many, highly-variable tags around that, which are indexed for rapid lookup, like hostname, region name, http request path, log fingerprint, etc. InfluxDB and most new TSDBs support this convenient multi-tag concept.
InfluxDB also allows multiple data fields, making it easy to gather muilti-field data like CPU / RAM use or SQL query types per second. The tag vs. field and indexing / aggregation models are all clear and convincing.
Second, we get data from everywhere, and as InfluxDB is a latecomer, it supports all sorts of data feeds, from its own Telegraf to statsd to collectd to various HTTP endpoints to UDP (perfect and necessary for logging from app code). This lets us integrate with other systems and over time migrate to the most appropriate, plus use the ever-increasing Telegraf ecosystem where we can.
The InfluxDB mostly-automatic aggregation / reduction system is similar to what others do and very helpful in data crunching, something we are very familiar with from our large-scale monitoring systems.
Third, using a query language as close to SQL as possible is genius, as it just makes it easier to use while avoiding endless mistakes and challenges from just being different. I detest JSON and various random query languages, or even worse, in-code custom logic functions that resemble a bad ORM. Just say no and use SQL as much as you can, and no one gets hurt.
Using SQL is a huge plus
Related to using SQL is that most commands are very similar to MySQL, which we know and love, e.g. ‘show databases’ or ‘use dbname’. This just makes life easier, increasing efficiency while reducing mistakes and confusion.
Fourth, we looked closely at the increasingly-popular Prometheus, but as a monitoring system, the pull-only agent model is a deal breaker for a SaaS system. Our old systems worked this way, and we just cannot continue to ask customers to open ports for us; the push gateway is not really a solution. In addition, and partly due to the pull model, Prometheus does not allow sending timestamps with the data, which makes it useless for batch gathering and sending, which we need for high-resolution gathering, not to mention in poor-connectivity environments on a global scale.
No push model is deal-breaker for Prometheus Model
Fifth, using Go makes InfluxDB absurdly easy to install and configure, and it has nice packages for every platform, including native MacOS, Windows, various Linux distributions, etc. Easy, easy, and works as advertised. Very different from the systems dependent on Hadoop, or systems that use Go, Python, Java, and Ruby all misguidedly mixed together, for example.
Finally, the docs are really quite good, as expected for a commercial provider.
Of course, InfluxDB is pretty new and has had some challenges / changes in their clustering and storage models, so we’ll see how those work out at scale, but for now we are willing to put with this for what is, to us, a product best-suited to our needs. And we still have to work out if we’ll really commit Influx’s Kapacitor for alerting, but we’ll see.
Learn more about our Total Ops Platform at OpsStack.io
Top comments (5)
Might as well just use something like MemSQL and get a real distributed relational database with columnstore functionality. Works great with time-series or any other data warehouse needs with full SQL support, including joins (and fast in-memory rowstore tables for OLTP).
Yes, InfluxDB certainly needs an HA / distributed solution, but using non-TSDB like this means losing tagging, protocol support (UDP rules!), alerting options, data aggregation, dimensional / time analyses, and probably more like high cardinality, multi-tag indexing on large data sets - especially with some newer TSDBs really diving into stats functions for clustering, anomalies, etc. MemSQL looks interesting, a bit complex, MySQL wire-compat is nice. I'm pretty new to TSDBs, too, so probably missing stuff.
Influx does have a HA solution, but it's closed source and had associated license costs. We use it in our setup with a load balancer to front the HTTP endpoint and it works pretty well. Search for Influx Enterprise for details, all the docs are public.
Yes, they do have a new commercial HA system, though seems quite young and I think was created after another attempt, so still a ways to go, plus of course would prefer to see an open-source edition, even if had limited shards, size, etc. so lots of people could run it for real in real systems.
If you are interested in a fully open-source time-series database with clustering support build into it's core you might want to take a look at SiriDB, a time-series data we build from the ground up to be scalable on the fly, fast and robust.
The full project is available on GitHub