When it comes to cloud security, there seems to be a constant refrain: automate, automate, automate. At first blush, it sounds logical, right? In theory, automation should help eliminate human error, identify misconfigurations, check access and authorization, scan containers and Kubernetes clusters, accelerate the release process, and so on.
And who wouldn't want to automatically remediate problems, threats, and vulnerabilities, right? Well, actually, maybe you and/or your DevOps team.
Before we dissect the issues involved, let's define what we mean by automatic remediation (also called auto-remediation).
What is auto-remediation?
Generally speaking, auto-remediation is the use of tools that detect and remediate cybersecurity issues (misconfigurations, threats, vulnerabilities, and so on) without human intervention. A great example is a security orchestration, automation, and response (SOAR) tool, such as Splunk Phantom, that gives security teams the ability to create "if this, then that" rules for their environment. For example, if the tool finds malware on an endpoint, it might automatically isolate the endpoint to ensure the malware doesn't infect other endpoints in the network. An alert would be sent to security staff to notify them of the problem and/or any action taken (which might include patching software or fixing misconfigurations, for example).
One of the major benefits of auto-remediation is that it's quicker than human intervention. Excessive time lag between detection and automation provides more opportunity for an attacker or piece of malware to do damage. And since some malware can spread quicker than celebrity gossip on Twitter, waiting even a few hours to remediate can be disastrous.
In addition to its efficiency, auto-remediation also helps reduce the load on already overburdened teams. Tracking down problems and fixing them is a tedious, complicated job. Automation can allow humans to spend more time on other, less onerous activities.
What's the problem?
With so many benefits, it's hard to imagine why anyone would object to deploying an auto-remediation tool in their environment. But just like with self-driving cars, too much automation can present its own hazards.
Here are some examples:
- Limited context. Even though AI has come a long way, it's still got a long way to go, as far as making judgment calls. One reason is because machine-learning models are limited by insufficient data. There's no way to feed a list of every possible threat or warning sign into a machine-learning model. Those lists just don't exist. Over time, the data sets will grow (especially with new federated technologies that allow data sharing without compromising privacy), but even then, there will always be new monsters under the bed. What's more, every new piece of hardware or software or change in configuration or cloud provider reshapes the equation. In sum, there is an infinite number of items on the "crap that can go wrong" list. And since the humans are the ones making these changes, we're often more equipped to be on alert for problems when we alter the environment.
- Unforeseen consequences. Sometimes, when software is allowed to make changes without human intervention or approval, things can go awry. For example, if an auto-remediation tool decides to isolate a whole server (as opposed to killing a specific process), it could result in outages and service-level agreement (SLA) violations. If the human is making the decision, they might be able to switch over to a redundant server, kill only the offending process, or take another, less drastic action. In most cases, humans will be less prone to overcorrection.
- Disproportionate reliance on the auto-remediation tool. To explain this one, I'll use the self-driving car analogy again. Weather—excessive fog, rain, or snow—can obscure these vehicles' sensors and cause accidents. Similarly, if the auto-remediation tool hangs or fails (and the error is not caught in time), it could render your network vulnerable.
- Robot uprising. Just kidding.
The bottom line is that if you do choose to use auto-remediation, extensive testing and tool validation are critical. DevOps teams will need to closely monitor the environment for unanticipated changes or events. If you're the one choosing the tool (but not one of the DevOps people who have to test and monitor it), be aware that the DevOps team may stop inviting you out for after-work cocktails in retaliation for foisting this burden upon them.
Dynamic remediation: a more prudent alternative?
A middle ground between manual and automated remediation is dynamic remediation. In this scenario, instead of relying completely on an auto-remediation tool, DevOps uses templates as guardrails to apply corrective actions. This allows your team to reap some of the benefits of automation, while mitigating some of the risk. Lightspin, for example, uses infrastructure as code (IaC) Terraform files to generate these templates for users to download and deploy. Users can customize the templates to fit their specific environments.
Some of the major advantages of dynamic remediation include:
- Customizability. As I mentioned earlier, most environments are constantly growing, evolving, and changing. Dynamic remediation allows you to tweak and adjust the actions the tool takes, as necessary, in real time.
- Integration of human intelligence. As I mentioned earlier, fully automatic remediation is limited by machine learning data. Keeping humans in the equation can lower the chance of overcorrection.
- Appropriate time lags. Earlier, I explained how an excessive time lag between detection and remediation can create opportunities for a problem to escalate (lateral movement, for example). But that doesn't mean that all time lags are a bad thing. A quick pause to examine wider context and appropriate alternatives may make the difference between a "good" decision and a decision that will result in an SLA violation. Context is everything.
Both dynamic and automated remediation will undoubtedly improve over time as datasets expand and as our technology grows increasingly sophisticated. In the meantime, just as with driverless cars, we should proceed with caution—continuously re-evaluating the risk/reward ratio we are comfortable with at any given moment.
I'd love to hear your thoughts on this issue, especially if you work in security or DevOps. What level of risk is acceptable? At what level of risk do we lose the benefits of automation? Let me know in the comments. And don't forget to connect with Outshift on Slack!