In the previous article in this series — The Everything Guide to Data Collection in DevSecOps — we discussed the importance of data collection. In this article, we’ll explore the role of monitoring in observability, especially as it relates to security, performance, and reliability.
Monitoring is essential for detecting issues and outliers as they happen in production and allows DevSecOps teams to identify and address issues before they cause serious damage. Monitoring performance degradations or suspicious activity can result in alerts and automatic responses to isolate potential problems or attacks.
In this article, we’ll look at monitoring in detail, provide several use cases and best practices, and discuss how monitoring can specifically improve security, performance, and reliability through observability.
What is the role of monitoring in observability?
In an observable system, we collect data from logs, metrics, and distributed traces. And while for very small systems you can manually browse and search the logs, visualize the metrics as charts, and trace through diagrams showing how traffic flows through the system in order to identify problems — at scale, this is not enough. You need monitoring, an automated process that keeps an eye on this data and alerts you appropriately. (For a more detailed treatment on the difference between monitoring and observability, you can check out this resource.)
In an enterprise, you need automated ways to filter, aggregate, enrich, and analyze all this data. Enterprises also need automated ways to take action when something unusual is detected. The automated response can notify the responsible team or even take remediating action directly.
In other fields like medicine, monitoring the vital signs of patients is a key activity, which saves lives. Monitoring software systems is very similar, and we even use the same methodology when performing health checks and discussing the health of different components.
Enough theory, let’s look at some concrete examples of monitoring.
Use cases of monitoring for observability
Here are some typical use cases that take advantage of monitoring:
Web applications are a major part of many large-scale distributed systems and key to the success of digital-first businesses. Monitoring Kubernetes, containerized applications or simply the web server logs for an excessive appearance of error codes (such as
5xx) can help a team address performance and reliability issues before they become significant problems.
At the infrastructure level, it is important to monitor the CPU, memory, and storage of your servers. Like most enterprises, you’re likely using autoscaling so your system can allocate more capacity. Platform logs capture when there are changes to resources, such as when they are provisioned, deprovisioned, or reconfigured. However, monitoring these resource metrics and logs can help you ensure you’re working within quotas and limits, and it can help your organization when it comes to resource planning and budgeting.
Datastores are at the heart of most large-scale systems. If your data is lost, corrupt, or unavailable, then you have a serious situation on your hands. To keep track of your data you need to monitor database connections, query duration metrics, disk space, backups, and error rates. You should also understand your datastores and set alerts when you observe values that are outside the expected range, such as slow queries, high rate of errors, or low disk space. You can also set up logging for your databases to capture connections, queries, and changes to fields or tables. Monitoring your database logs can help you detect not only where you can improve your performance and reliability, but also security if malicious (or unintentional) operations are being performed.
Note that monitoring is much more involved than setting a simple condition (such as “more than five
INSERT queries to the
orders database within two minutes”) and firing an alert when that condition is met. Seasonality may be at play, with usage patterns that cause spikes at certain times of the day, week, or year. Effective monitoring that detects unexpected behavior takes context into account and can recognize trends based on past data.
This type of monitoring, especially when implemented with a tool that combines observability, monitoring, and security at scale can be immensely effective, such as in this case study from Sumo Logic and Infor, where Infor was able to save 5,000 hours of time spent on incidents.
How does monitoring contribute specifically to improving performance and reliability?
Monitoring improves the performance and reliability of a system by detecting problems early to avoid degradation. Performance problems often become availability and reliability problems. This is especially true in the presence of timeouts. For example, suppose an application times out after 60 seconds. Due to a recent performance issue, many requests suddenly take more than 60 seconds to process. All these requests will now fail, and the application is now unreliable.
A common best practice for addressing this is to monitor the four golden signals of any component in the critical path of high-priority services and applications: latency, traffic, errors, and saturation.
How long does it take to process a request? Note that the latency of successful requests may be different than failed requests. Any significant increase in latency may indicate degrading system performance. On the other hand, any significant decrease might be a sign that some processing is skipped. Either way, monitoring will bring attention to the possible issue.
Monitoring traffic gives you an understanding of the overall load on each component. Traffic can be measured in different ways for different components. For example:
REST API: the number of requests
A backend service: the depth of a queue
A data crunching component: the total bytes of data processed.
An increase in traffic may be due to organic business growth, which is a good thing. However, it may also point to a problem in an upstream system that generates unusually more traffic than before.
An increase in error rates of any component directly impacts the reliability and utility of the system. In addition, if failed options are automatically retired, this can lead to an increase in traffic, and this may subsequently lead to performance problems.
Out of the resources available, how much is the service or application using? This is what saturation monitoring tells you. For example, if a disk is full, then a service that writes logs to that disk will fail on every subsequent request. At a higher level, if a Kubernetes cluster doesn’t have available space on its nodes, then new pods will be pending and not scheduled, which can lead to latency issues.
As you notice, the four golden signals are related to one another. Problems often manifest across multiple signals.
How does monitoring contribute specifically to improving security?
While any system health problem can directly or indirectly impact security, there are some direct threats that monitoring can help detect and mitigate.
Any anomaly, such as excessive CPU usage or a high volume of requests, may be an attacker trying to cause segmentation faults, perform illegal cryptomining, or launch a DDoS attack on the system.
An unusual number of packets hitting unusual ports might be a port knocking attack.
A high number of 401 errors (authentication errors) with valid usernames and invalid passwords might be a dictionary attack.
A high number of 403 errors (forbidden access) may be a privilege escalation by an attacker using a compromised account.
Invalid payloads to public APIs resulting in an increase in 400 errors might be an attacker trying to maliciously crash your public-facing web applications.
A download of large amounts of data or any sensitive data outside of business hours might be an exfiltration attack by a compromised employee or rogue insider.
Best Practices for Monitoring to Improve Performance and Security
A system is made of multiple components, but it is more than the sum of its parts. At a basic level, you should monitor every component of your system (at least on the critical paths) for the four golden signals. What does this mean in practice?
Observing the key metrics
Establishing the metric ranges for normal operation
Setting alerts when components deviate from the acceptable range
You should also pay close attention to external dependencies. For example, if you run in the cloud or integrate with a third-party service provider, then you should monitor the public endpoints that you depend on and set alerts to detect problems. If a third party is down or its performance is degraded, this can cause a cascading failure in your system.
It’s not possible to have 100% reliable components. However, monitoring can help you create a reliable system from non-reliable components by detecting problems with components — both internal and external — and either replacing them or gracefully degrading service. For example, if you are running your system in a multi-zone configuration and there is a problem in one zone, then monitoring can detect this and trigger re-routing (manually or automatically) of all traffic to other zones.
For security, the four signals may be auxiliary indicators of a compromise too. This is especially the case, for example, if you see a spike in your endpoint device or cloud workload CPUs, or an increase in the number of failed login attempts. However, security monitoring must be very deliberate since you deal with malicious adversaries. You must define the attack service of each component and the entire system and ensure the information you are collecting is sufficient to detect issues. For example, to detect data exfiltration, you can monitor the IP addresses and amount of data sent outside your internal network by different applications and services. If you don’t have that data, you will be blind to that attack methodology.
Implementing a Monitoring Strategy
Once you set up your data collection, you can follow the steps below to roll out a robust and effective monitoring strategy.
1. Identify critical assets.
You have already performed a comprehensive inventory of all your assets as part of data collection. Now, your task is to identify the critical assets that must be monitored closely to prevent and mitigate disasters. It is easy to say, “just monitor everything,” but there are costs to consider with monitoring. Monitoring and raising alerts for your staging and developing environments or experimental services can put a lot of unnecessary stress on your engineers. Frequent 3 AM alerts for insignificant issues will cause alert fatigue, crippling your team’s drive to address an issue when it really matters.
2. Assign an owner for every critical asset.
Once you identify the critical assets, you need a clear owner for each one. The owner can be a person or a team. In the case of a person, be sure to also identify a fallback. It’s also important to maintain asset ownership as people join and leave the organization or move to other roles and teams.
3. Define alerts for critical assets.
Ultimately, your monitoring strategy will live or die based on how you define alerts for assets that are unhealthy or potentially compromised. You need to understand what’s normal for each asset.
If you’re monitoring metrics, then defining “normal” means associating an attribute (such as CPU utilization) with a range of values (such as “50%-80%”). The normal band can change dynamically with the business and can vary at different times and different locations. In some cases, you may just have ceilings or floors. By defining normal ranges, you create alerts to notify an asset owner when their asset is operating outside of the normal range.
If you’re monitoring logs, then alerts are usually defined based on the result of certain log queries (such as “number of 404 errors logged across all API services in the last five minutes”) satisfying or failing a condition (such as “is fewer than 10”). Log management and analytics tools can help.
4. Define runbooks for every alert.
When a critical alert fires, what do you do? What you don’t want to do is try to figure out your strategy on the spot, while customers are tweeting about your company’s unreliable products and management is panicking.
A runbook is a recipe of easy-to-follow-up steps that you prepare and test ahead of time to help you collect additional information (for example, which dashboards to look at and what command-line scripts to run to diagnose root cause) and mitigate actions (for example, deploy the previous version of the application). Your runbook should help you to quickly pinpoint the problem to a specific issue and identify the best person to handle it.
5. Set up an on-call process.
You have owners, alerts, and runbooks. Often, the alerts are not specific enough to map directly to the owners. The best practice is to assign on-call engineers to different areas of the business. This on-call engineer will receive the alert, follow the runbook, look at the dashboard, and try to understand the root cause. If they can’t understand or fix the problem, they will escalate it to the owner. Keep in mind that this process can be complicated; often, a problem occurs due to a chain of failures that require multiple stakeholders to collaborate to solve the issue.
6. Move towards self-healing.
Runbooks are great, but maintaining complex runbooks and training on-call engineers to follow them takes a lot of effort. And in the end, your remediation process still depends on a slow and error-prone human. If your runbook is not up to date, following it can worsen the crisis.
Theoretically, a runbook can be executed programmatically. If the runbook says, “when alert X fires, process Y should restart”, then a script or program can receive a notification of alert X and restart process Y. The same program can monitor process Y post-restart, ensure everything is fine, and eventually generate a report of the incident — all without waking up the on-call engineer. If the self-healing action fails, then the on-call engineer can be contacted.
7. Establish a post-mortem process.
Self-healing is awesome, however, an ounce of prevention is worth a pound of cure, so it’s best to prevent problems in the first place. Every incident is an opportunity to learn and possibly prevent a whole class of problems. For example, if multiple incidents happen because buggy code makes its way to production, then a lesson from incident post-mortems could be to improve testing in staging. If the response of the on-call engineer to an alert was too slow or the runbook was out of date, then this may suggest that the team should invest in some self-healing practices.
Monitoring is a crucial part of observability in general and observability for security in particular. It’s impractical at a large scale for humans to “just look every now and then” at various dashboards and graphs to detect problems. You need an entire set of incident response practices that include identifying owners, setting up alerts, writing runbooks, automating runbooks, and setting up on-call processes and post-mortem processes.
Have a really great day!
Top comments (1)
Great post John.
I agree entirely that observability is a crucial part to monitoring. As projects grow in size, network complexity tends to increase exponentially, especially when working with microservices.
In our past positions, our team has witnessed millions of dollars wasted on downtime and digging through stack traces to find out where network failures are occurring. Tools that can conveniently and simply provide data are an effective insurance policy and early warning system for scaling projects.