In Part 1: Docker Container monitoring and management Challenges – we discussed why container monitoring is challenging, especially in the context of orchestration tools. In Part 2 we described Top 10 Container Metrics to Monitor. Next, let’s have a look at examples and available container monitoring tools for better operational insights into container deployments.
The first step to get visibility into your container infrastructure is probably using the built-in tools like docker command line and kubectl for Kubernetes. There’s a whole set of commands that are used to find the relevant information about containers. Please note that the usage of kubectl and docker command is typically available to only a few people who have direct access to the orchestration tool. Nevertheless, all cloud engineers require command line skills and, in some situations, command line tools are indeed the only tools available.
Before we start looking at Docker log collection tools, check out these two useful Docker Cheatsheets.
Check out Docker Commands Cheat Sheet
Check out Docker Swarm Cheat Sheet
There are several monitoring dashboards available for orchestration tools like Rancher, Portainer, Docker Enterprise. They typically provide a simple real-time metrics and a real-time logs view. The navigation to containers takes a few clicks. However, overviews with aggregated metrics or cluster-wide log search are typically not integrated. As such, the basic monitoring functionality in most orchestration tools is simply too basic. Better tools are needed. Let’s look at some more attractive and capable monitoring and log management solutions.
Several open source tools are available for DIY-style container monitoring and logging. Typically logs and metrics are stored in different data stores. Elastic Stack is the tool of choice for logs while Prometheus is popular for metrics.
Depending on your metrics and logs data store choices you may need to use a different set of data collectors and dashboard tools. Telegraf and Prometheus are the most flexible open source data collectors we’ve evaluated. Prometheus exporters need a scraper (Prometheus Server or alternative 3rd party scraper) or a remote storage interface for Prometheus Server to store metrics in alternative data stores. Grafana is the most flexible monitoring dashboard tool with support for most data sources like Prometheus, InfluxDB, Elasticsearch, etc.
Kibana and metric beats for data collection are tightly bound to Elasticsearch and are thus not usable with any other data store.
The following matrix shows which data collectors typically play with which storage engine and monitoring dashboard tool. Note there are several other variations possible.
|Data Collector for Containers||Storage / Time Series DB||User Interface|
|Telegraf / syslog + docker syslog driver||InfluxDB||Grafana, Chronograf|
|Logagent||Elasticsearch & Sematext Cloud||Kibana & Sematext|
|Various 3rd party and commercial integrations||Promdash, Grafana|
|Telegraf Elasticsearch output||Elasticsearch||Kibana|
|Sematext Agent||InfluxDB & Sematext Cloud||Chronograf & Sematext Cloud|
Compatibility of monitoring tools and time series storage engines
The Elastic Stack might seem like an excellent candidate to unify metrics and logs in one data store. As providers of Elasticsearch consulting, training, and Elasticsearch support we would love nothing more than everyone using Elasticsearch for not just logs, but also metrics. However, the truth is that Elasticsearch is not the most efficient as time series databases for metrics. Trust us, we’ve run numerous benchmarks, applied all kinds of performance tuning tricks from our rather big back of Elasticsearch tricks, but it turns out there are better, more efficient, faster data stores for metrics than Elasticsearch. The setup and maintenance of logging and monitoring infrastructure become complicated when it reaches a larger scale.
After the initial setup of storage engines for metrics and logs the time-consuming work starts: the setup of log shippers and monitoring agents, dashboards and alert rules. Dealing with log collection for containers can be tricky, so you’ll want to consult top 10 Docker logging gotchas and Docker log driver alternatives.
After the setup of data collectors, we need to visualize metrics and logs. The most popular dashboard tools are Grafana and Kibana. Grafana does an excellent job as a dashboard tool for showing data from a number of data sources including Elasticsearch, InfluxDB, and Prometheus. In general, though, Grafana is really more tailored for metrics even though using Grafana for logs with Elasticsearch is possible, too. Grafana is still very limited for ad-hoc log searches but has integrated alerting on logs.
Kibana, on the other hand, supports only Elasticsearch as a data source. Some dashboard views are “impossible” to implement because different monitoring and logging tools have limited options to correlate data from different data stores. Once dashboards are built and ready to share with the team, the next hot topic for Kibana users is security, authentication and role-based access control (RBAC). Grafana supports user authentication and simple roles, while Kibana (or in general the Elastic Stack) requires X-Pack as commercial extensions to support various security features like user authentication and RBAC. Depending on the requirements of your organization one of the X-Pack Alternatives might be helpful.
When planning the setup of open source monitoring people often underestimate the amount of data generated by monitoring agents and log shippers. More specifically, most organizations underestimate the resources needed for processing, storage, and retrieval of metrics and logs as their volume grows and, even more importantly, organizations often underestimate the human effort and time that will have to be invested into ongoing maintenance of the monitoring infrastructure and open-source tools. When that happens not only does the cost of infrastructure for monitoring and logging jump beyond anyone’s predictions, but so does the time and thus money required for maintenance. A common way to deal with this is to limit data retention. This requires fewer resources, less expertise needed to scale the infrastructure and tools and thus less maintenance, but this, of course, limits visibility and insights one can derive from long-term data.
Infrastructure costs are only one reason why we often see limited storage for metrics, traces, and logs. For example, InfluxDB has no clustering or sharding in the open source edition, and Prometheus supports only short retention time to avoid performance problems.
Another approach used for dealing with that is the reduction of granularity of metrics from 10-second accuracy to a minute or even more, sampling, and such. As a consequence, DevOps teams have less accurate information with less time to analyze problems, and limited view visibility for permanent or recurring issues, conducting historical trend analysis, or capacity planning.
In this post, we discussed only monitoring and logging. We completely ignoring distributed transaction tracing as the third pillar of observability for a moment. Keep in mind that as soon we start collecting transaction traces across microservices, the amount of data will explode and thus further increase the total cost of ownership of an on-premise monitoring setup. Note that data collection tools mentioned in this post handle only metrics and logs, not traces (for more on transaction tracing and, more specifically OpenTracing-compatible tracers see our Jaeger vs. Zipkin). Similarly, the dashboard tools we covered here don’t come with data collection and visualizations for transaction traces. This means that for distributed transaction tracing we need the third set of tools if we want to put together and run our own monitoring setup – welcome to the DevOps jungle!
There are a number of open source container observability tools for logging, monitoring, and tracing. If you and your team have time and if observability really needs to be your team’s core competency, you’ll need to invest time into finding the most promising tools, learning how to actually use them while evaluating them, and finally install, configure, and maintaining them. It would be wise to compare multiple solutions and check how well various tools play together. We suggest the following evaluation criteria:
- Coverage of collected metrics. Some tools collect only a few metrics, some gather a ton of metrics (which you may not really need), while other tools let you configure which metrics to collect. Missing relevant metrics can be frustrating when one is working under pressure to solve a production issue, just like having too many or wrong metrics will make it harder to locate signals that truly matter. Tools that require configuration for collection or visualization of each metric are time-consuming to set up and maintain. Don’t choose such tools. Instead, look for tools that give you good defaults and freedom to customize which metrics to collect.
- Coverage of log formats. A typical application stack consists of multiple components like databases, web servers, message queues, etc. Make sure that you can structure logs from your applications. This is key if you want to use your logs not only for troubleshooting, but also for deriving insights from logs. Defining log parser patterns with regular expressions or grok is time-consuming. It is very helpful having a library of existing patterns. This is a time saver, especially in the container world when you use official docker images.
- Collection of events. Any indication of why a service was restarted or crashed will help you classify problems quickly and get to the root cause faster. Any container monitoring tool should thus be collecting Docker events and Kubernetes status events if you run Kubernetes.
- Correlation of metrics, logs, and traces. Whether you initially spot a problem through metrics, logs, or traces, having access to all these observability data makes troubleshooting so much faster. A single UI displaying data from various sources is thus key for an interactive drill down, fast troubleshooting, faster MTTR and, frankly, makes devops’ job more enjoyable. See example.
- Machine Learning capabilities and anomaly detection for alerting on logs and metrics. Threshold-based alerts work well only for known and constant workloads. In dynamic environments, threshold-based alerts create too much noise. Make sure the solution you select has this core capability and that it doesn’t take ages to learn the baseline or require too much tweaking, training, and such.
- Detect and correlate metrics** with the same behavior.** When metrics behave in similar patterns, we typically find one of the metrics is the symptom of the root cause of a performance bottleneck. A good example we have seen in practice is high CPU usage paired with container swap activity and disk IO – in such a case CPU usage and even more disk IO could be reduced by switching off swapping for containers. For system metrics above the correlation is often known – but when you track your application-specific metrics you might find new correlation and bottlenecks in your microservices to optimize.
- Single sign-on. Correlating data stored in silos is impossible. Moreover, using multiple services often requires multiple accounts and forces you to learn not one, but multiple services, their UIs, etc. Each time you need to use both of them there is the painful overhead of needing to adjust things like time ranges before you can look at data in them in separate windows. This costs time and money and makes it harder to share data with the team.
- Role-based access control. Lack of RBAC ** ** is going to be a show-stopper for any tool seeking adoption at the corporate level. Tools that work fine for small teams and SMBs, but lack multi-user support with roles and permissions almost never meet requirements of large enterprises.
Concluding above, DevOps engineers need well-integrated monitoring, logging and tracing solution with advanced functionality like correlating between metrics, traces and logs. The saved costs of engineering and infrastructure required to run in-house monitoring can quickly pay off. Adjustable data retention times per monitored service help to optimize costs and satisfy operational needs. The better user experience for your DevOps team helps a lot, especially faster troubleshooting minimizes the revenue loss once a critical bug or performance issue hits your commercial services. In part 4 we will describe container monitoring with Sematext. While developing Sematext Cloud, we had had the above ideas in mind with the goal to provide a better container monitoring solution.