- Teams of 1: Use minimal images, add only the software you need, add a cronjob to auto-update using the package manager, or in the case of containers: pin your Dockerfile to the latest tag and redeploy as often as possible.
- Teams of 10: Automate your build process to create a golden image or container base image with best security practices.
- Teams of 100: Automate and monitor as much as possible. Try to keep your developers excited about patching, and start getting strict about not letting anything but your approved images go into production. Security team responsible for updates and patching strategy.
- Teams of 1000: Dedicated team for building, updating, and pentesting base images. Demand full E2E automation. Monitor in realtime and define RRDs with consequences.
Lately I've spent some time thinking about Vulnerability Management, hereafter 'vulnmgmt' - a major blueteam responsibility, this refers to keeping packages, kernels, and OSes up-to-date with patches. Generally if you deploy a VM with the latest distribution of all software there will be a public CVE for it within a few weeks, which now leaves you vulnerable. While this sounds like a huge problem I want to say that I don't believe vulnmgmt should be anywhere near the top of the priority list for a team just starting to improve their security posture - there can be as many 10.0 CVEs on a box as you like if it is airgapped, sat in a closet somewhere collecting dust. Like all of security, this is a game of risk considerations - I would prefer to spend time and energy on ensuring strong network segmentation and good appsec than vulnmgmt. Inevitably though, it does become a priority - especially because it's an easy business sell to auditors, customers, etc.
This is a huge industry with numerous different products and solutions being offered by most major vendors within the Cyber space, which of course means there's a lot of bullsh*t. I'm a big believer in build-not-buy as a general approach, although managers and senior engineers seem keen to tell me this will change as I get older/higher up. In short, I think Cyber is stuck in the 2000s-era of product development, trying to come up with these catch-all solutions which offer a silver bullet rather than keeping their products and feature sets inline with the unix philosophy of 'do one thing and do it well', and promoting interoperability. We should try to kill the idea that spending €100,000/yr on a product means we have good security.
For a brief primer on vulnmgmt in an engineering-led organisation, we have several types of compute resources we want to secure: likely bare-metal or virtual, and containers. For each of those we have two states, pre- and post-deploy. Some of these resources may have very short lifetimes eg. EC2 instances in an autoscaling group, while some might be long-running eg. a database instance for some back-office legacy app. N.B. Most cloud-native organisations will have a reasonable amount of serverless code as well, which I won't touch on here.
Bare-metal and virtual instances will be deployed from an image, either from a generic OS distribution or with a 'Golden Image' AMI/snapshot (take a generic, use something like Packer or Puppet to run some setup steps, pickle it into a reusable artifact). In this state, the possible vulnerability sources are:
- From the generic base image, more likely if it is out-of-date
- From any packages or modifications made to the base during initialization.
Containers are conceptually similar at this stage, except the base image isn't a single artifact but a multiple of layers comprising the container image that we're extending. Many applications tend to extend 'minimal' images (see alpine, ubuntu-minimal, debian-buster etc) which focus on small image size, but it is entirely possible that by the time we reach application images we have 10+ layers, each of which is a new opportunity to have packages pinned to a specific, vulnerable version.
At this stage we should be focusing on a few things:
- We do not use public / non-hardened base images.
- They're unlikely to be set up with defaults which are applicable for our use-case
- It is so cheap to maintain a clone of a public image but it ensures we start in a clean, healthy state. The further along in the process we apply controls, the more likely they are to fail. Catch problems early.
- We should be publishing our own base images as frequently as possible, which should pre-updated and upgraded, running the latest OS version and package upgrades.
- These images should be pre-enrolled into whatever monitoring/SIEM programs we're running, reducing workload for the end-user of them.
- We should use static scanners during this process, and prevent the publishing of images which contain fixable vulnerabilities. Here is an awesome description of OVO's approach.
Luckily there's a multitude of tool options we have at our disposal:
- Ansible, Puppet, Chef - build-as-code providing strong repeatability and consistency.
- Hashicorp Packer, Vagrant, AWS CodeBuild - create Golden Images or deploy during CI and publish snapshots.
- cloud-init - the gold standard of consistent Unix initialization.
- Vuls - agentless scanner for Unix systems, checking package versions against NVD. People will get tired of me talking about this project but it's such a great concept.
- osquery/osquery - query your VM like it's a SQL db
- wazuh/wazuh - I haven't used this personally, I've heard good things.
- jsitech/JShielder, CISOfy/Lynis, lateralblast/lunar - automated Linux hardening/compliance checkers.
My perfect scenario IMO looks something like this: we have a hardened base image which is rebuilt on a daily/weekly basis using Packer. When that gets published to staging we use Lambda to spin up an instance with it, and perform whatever scans against it we want, either using Vuls or Lynis. If those tools pass then we can continue the build, publishing the image to production. If not, report the results and remediate issues. We should also validate that the instance connected successfully to our SIEM, and maybe we could attempt a portscan or try to drop a shell to verify it's catching low hanging fruit.
This is where things get more complex because our assets are now in the wild becoming more outdated and unpatched by the day. The longer we are in this state the further we deviate from our nice, safe, clean starting point - so a lot of effort should be reducing the expected lifetime of any single asset before redeploy. I would preach more for ensuring repeatable infrastructure than for perfect monitoring and patching of assets but unfortunately, that's just not a reality for a lot of contexts. Some guy will always keep a pet server that you can't upgrade because 'it might break something and it's mission-critical'.
For VMs we will be relying on some solution to continuously monitor drift from initial state, and report back so that we can keep track of it. Previously I've used Vuls for this purpose, but if you have something like the AWS SSM agent installed on the instance then it's possible to run whichever tools best fit. This can be a minefield as you'll have to either 1) enable inbound access to the machine, increasing risk or 2) upload results from the box to some shared location. I'd prefer #2, as it's less complex from a networking ACL standpoint - but there could be complications with that too.
Containers at runtime is slightly harder; if you've got a reasonably hardened base image and runtime environment then it is likely that inbound shell access is forbidden, and if it's not then you're unlikely to have access to the tools you need to perform runtime heuristic tests. If the containers are running within something like Istio it is easier to extract logs and metrics, so it would be a good idea to integrate these into whatever alerting engine we're using. There are numerous k8s and docker scanners which check configuration against CIS benchmarks to determine the health of a pod/container:
As well as hardening the environment, we can scan our images for vulnerabilities at any point. The results of a static container image analysis tool will be different between build time and some point when it's running in the future, so we can periodically re-run the checks we ran at build time to establish whether that container has become vulnerable to any new CVEs since we launched it. If so, we probably want to rotate in a version which has been patched. After trying
docker scan (backed by Snyk) and some inbuilt Docker Trusted Registry (DTR) scanners (ECR+GCR) I strongly prefer quay/clair and aquasec/trivy. They're only part of the solution though, telling you what vulnerabilities exist at the surface - which is great for measuring overall progress but not for determining where you should focus. Container images are a composed of a series of layers, each executing some command or installing a set of files etc. A vulnerability can either be added or removed by a layer: essentially, we could have 0 vulnerabilities in our base and add them later in the image, or we could have 100s in the base and they could all be fixed later in the image.
When it comes to operationalising the fixing of images, there seems to be two approaches:
Go to teams, show them their results, and tell them to fix it.
- This could either be by nudging teams with alerts, or by creating dashboards and giving POs ownership of their metrics.
- By putting responsibility directly on teams for the images they run, you get close to the problem.
- It's likely there will be some duplicated effort, and it requires strong communication about the problem and education on how to fix the problems.
Determine which base images are used, and fix those.
- To do this, we need a way to link a leaf image (one that runs) to it's parents. We can do that by inspecting the manifest and keeping a trie of which images have subsets of ours. That's quite expensive, and I haven't seen any open-source solution.
- Once you have that information though, you can focus efforts. Depending on the team who owns a base image you can delegate the work, and make a large amount of impact very quickly (as many images likely inherit from one or two vulnerable base images)
Whether it's an engineering or a wider business effort to fix container vulnerabilities, it should be visible. When I started looking at this problem I thought that engineers would understand the risks associated with vulnerabilities in production; how attack chains work and the theories of defense-in-depth. That's probably not the situation, and education is the most time consuming part of all of this.
Rather than submitting to extortionate subscription fees and vendor lock-in, we can achieve great security posture for VMs and containers using open-source tools and a little engineering. As a result we'll be a lot more confident in our claims and have developed a deeper understanding of our environments, allowing us to deploy tools which are genuinely extensible and well-suited for our use-cases. These are hard problems involving a combination of technical and meat work, and they require planning and careful execution on both parts.