Cybersecurity and its awareness have never been more crucial than they are today. Considering the increasing amount of attacks, it has become clear that protecting digital assets plays a significant role in software development and operations. What concrete steps can be taken to enhance the security of our services even further?
Starting at a Lower Level
While antivirus a well-executed read-only backup strategy are essential for identifying and reducing the impact of threats, it's important to establish a strong foundation of security from the outset. Rather than solely focusing on mitigating consequences after the fact, reducing the attack surface should be a primary goal.
This can be done by limiting access to the underlying system, like running as an arbitrary user and dropping unneeded privileges. In Kubernetes, this would for example typically mean using non-root base images in combination with securityContext
definitions.
But in some cases, it's better or even required to deploy directly on virtual machines. So how can a similar strategy be applied there?
π Hardening Nginx: Step by Step
Let's examine a real-world example using the Nginx service file provided by Ubuntu 20.04:
The Defaults
david@proxy:~$ systemctl cat nginx.service
# /lib/systemd/system/nginx.service
[Unit]
Description=A high performance web server and a reverse proxy server
Documentation=man:nginx(8)
After=network.target
[Service]
Type=forking
PIDFile=/run/nginx.pid
ExecStartPre=/usr/sbin/nginx -t -q -g 'daemon on; master_process on;'
ExecStart=/usr/sbin/nginx -g 'daemon on; master_process on;'
ExecReload=/usr/sbin/nginx -g 'daemon on; master_process on;' -s reload
ExecStop=-/sbin/start-stop-daemon --quiet --stop --retry QUIT/5 --pidfile /run/nginx.pid
TimeoutStopSec=5
KillMode=mixed
[Install]
WantedBy=multi-user.target
By default, the service runs as the root
user. Therefore, processes spawned by /usr/sbin/nginx
have all privileges of the root
user and group, which could allow malicious software to control every part of the system when there is an exploit for Nginx. While Nginx is also able to use arbitrary users by itself, the main process that's started by the service still has root privileges. In many cases, this is not required and can be avoided by using Systemd's already built-in capabilities.
Breaking it Down
The systemd-analyze
cli tool can help to get an overview of potential issues of Systemd services:
systemd-analyze security # provides a high-level overview including a
# numeric "exposure" value of Systemd services
systemd-analyze security <service_name> # shows detailed security-related information
# about a single service
The output for the nginx
service looks like this:
david@proxy:~$ systemd-analyze security nginx.service --no-pager
Nginx Service Security Summary
NAME DESCRIPTION EXPOSURE
β PrivateNetwork= Service has access to the host's network 0.5
β User=/DynamicUser= Service runs as root user 0.4
β CapabilityBoundingSet=~CAP_SET(UID|GID|PCAP) Service may change UID/GID identities/capabilities 0.3
β CapabilityBoundingSet=~CAP_SYS_ADMIN Service has administrator privileges 0.3
β CapabilityBoundingSet=~CAP_SYS_PTRACE Service has ptrace() debugging abilities 0.3
β RestrictAddressFamilies=~AF_(INET|INET6) Service may allocate Internet sockets 0.3
β RestrictNamespaces=~CLONE_NEWUSER Service may create user namespaces 0.3
β RestrictAddressFamilies=~β¦ Service may allocate exotic sockets 0.3
β CapabilityBoundingSet=~CAP_(CHOWN|FSETID|SETFCAP) Service may change file ownership/access mode/capabilities unresβ¦ 0.2
β CapabilityBoundingSet=~CAP_(DAC_*|FOWNER|IPC_OWNER) Service may override UNIX file/IPC permission checks 0.2
β CapabilityBoundingSet=~CAP_NET_ADMIN Service has network configuration privileges 0.2
β CapabilityBoundingSet=~CAP_RAWIO Service has raw I/O access 0.2
β CapabilityBoundingSet=~CAP_SYS_MODULE Service may load kernel modules 0.2
β CapabilityBoundingSet=~CAP_SYS_TIME Service processes may change the system clock 0.2
β DeviceAllow= Service has no device ACL 0.2
β IPAddressDeny= Service does not define an IP address whitelist 0.2
β KeyringMode= Service doesn't share key material with other services
β NoNewPrivileges= Service processes may acquire new privileges 0.2
β NotifyAccess= Service child processes cannot alter service state
β PrivateDevices= Service potentially has access to hardware devices 0.2
β PrivateMounts= Service may install system mounts 0.2
β PrivateTmp= Service has access to other software's temporary files 0.2
β PrivateUsers= Service has access to other users 0.2
β ProtectClock= Service may write to the hardware clock or system clock 0.2
β ProtectControlGroups= Service may modify the control group file system 0.2
β ProtectHome= Service has full access to home directories 0.2
β ProtectKernelLogs= Service may read from or write to the kernel log ring buffer 0.2
β ProtectKernelModules= Service may load or read kernel modules 0.2
β ProtectKernelTunables= Service may alter kernel tunables 0.2
β ProtectSystem= Service has full access to the OS file hierarchy 0.2
β RestrictAddressFamilies=~AF_PACKET Service may allocate packet sockets 0.2
β RestrictSUIDSGID= Service may create SUID/SGID files 0.2
β SystemCallArchitectures= Service may execute system calls with all ABIs 0.2
β SystemCallFilter=~@clock Service does not filter system calls 0.2
β SystemCallFilter=~@debug Service does not filter system calls 0.2
β SystemCallFilter=~@module Service does not filter system calls 0.2
β SystemCallFilter=~@mount Service does not filter system calls 0.2
β SystemCallFilter=~@raw-io Service does not filter system calls 0.2
β SystemCallFilter=~@reboot Service does not filter system calls 0.2
β SystemCallFilter=~@swap Service does not filter system calls 0.2
β SystemCallFilter=~@privileged Service does not filter system calls 0.2
β SystemCallFilter=~@resources Service does not filter system calls 0.2
β AmbientCapabilities= Service process does not receive ambient capabilities
β CapabilityBoundingSet=~CAP_AUDIT_* Service has audit subsystem access 0.1
β CapabilityBoundingSet=~CAP_KILL Service may send UNIX signals to arbitrary processes 0.1
β CapabilityBoundingSet=~CAP_MKNOD Service may create device nodes 0.1
β CapabilityBoundingSet=~CAP_NET_(BIND_SERVICE|BROADCAST|RAW) Service has elevated networking privileges 0.1
β CapabilityBoundingSet=~CAP_SYSLOG Service has access to kernel logging 0.1
β CapabilityBoundingSet=~CAP_SYS_(NICE|RESOURCE) Service has privileges to change resource use parameters 0.1
β RestrictNamespaces=~CLONE_NEWCGROUP Service may create cgroup namespaces 0.1
β RestrictNamespaces=~CLONE_NEWIPC Service may create IPC namespaces 0.1
β RestrictNamespaces=~CLONE_NEWNET Service may create network namespaces 0.1
β RestrictNamespaces=~CLONE_NEWNS Service may create file system namespaces 0.1
β RestrictNamespaces=~CLONE_NEWPID Service may create process namespaces 0.1
β RestrictRealtime= Service may acquire realtime scheduling 0.1
β SystemCallFilter=~@cpu-emulation Service does not filter system calls 0.1
β SystemCallFilter=~@obsolete Service does not filter system calls 0.1
β RestrictAddressFamilies=~AF_NETLINK Service may allocate netlink sockets 0.1
β RootDirectory=/RootImage= Service runs within the host's root directory 0.1
SupplementaryGroups= Service runs as root, option does not matter
β CapabilityBoundingSet=~CAP_MAC_* Service may adjust SMACK MAC 0.1
β CapabilityBoundingSet=~CAP_SYS_BOOT Service may issue reboot() 0.1
β Delegate= Service does not maintain its own delegated control group subtree
β LockPersonality= Service may change ABI personality 0.1
β MemoryDenyWriteExecute= Service may create writable executable memory mappings 0.1
RemoveIPC= Service runs as root, option does not apply
β RestrictNamespaces=~CLONE_NEWUTS Service may create hostname namespaces 0.1
β UMask= Files created by service are world-readable by default 0.1
β CapabilityBoundingSet=~CAP_LINUX_IMMUTABLE Service may mark files immutable 0.1
β CapabilityBoundingSet=~CAP_IPC_LOCK Service may lock memory into RAM 0.1
β CapabilityBoundingSet=~CAP_SYS_CHROOT Service may issue chroot() 0.1
β ProtectHostname= Service may change system host/domainname 0.1
β CapabilityBoundingSet=~CAP_BLOCK_SUSPEND Service may establish wake locks 0.1
β CapabilityBoundingSet=~CAP_LEASE Service may create file leases 0.1
β CapabilityBoundingSet=~CAP_SYS_PACCT Service may use acct() 0.1
β CapabilityBoundingSet=~CAP_SYS_TTY_CONFIG Service may issue vhangup() 0.1
β CapabilityBoundingSet=~CAP_WAKE_ALARM Service may program timers that wake up the system 0.1
β RestrictAddressFamilies=~AF_UNIX Service may allocate local sockets 0.1
β Overall exposure level for nginx.service: 9.6 UNSAFE π¨
A lot of those capabilities are not required to run a web server, so it's best to limit the service's privileges. As interfacing with the Linux kernel can be very complex and is prone to changes, Systemd services offer a way to define common configurations directly in the service files. Given the multitude of configuration parameters for Systemd services, this example will concentrate on values significantly affecting security. It will use a standard Kubernetes securityContext
as a foundation.
The Principle of Least Privilege
Adopting the principle of least privilege is crucial. By restricting access and privileges to the bare essentials, the attack surface diminishes significantly. When using Kubernetes resources, you'd usually use a securityContext
definition to limit capabilities of a Pod:
...
securityContext:
runAsNonRoot: true
runAsUser: 1001
runAsGroup: 2001
allowPrivilegeEscalation: false
privileged: false
readOnlyRootFilesystem: true
capabilities:
drop:
- all
...
In the above example, the process runs without root privileges on a read-only filesystem and all capabilities are dropped. A similar setup can be achieved using a Systemd service:
-
runAsNonRoot: true
β no equivalent, if possibleDynamicUser
can be used -
runAsUser: 1001
βUser=<username>
-
runAsGroup: 2001
βGroup=<groupname>
-
allowPrivilegeEscalation: false
βNoNewPrivileges=true
-
privileged: false
β no equivalent,PrivateDevices=<...>
,Protect<...>=<...>
etc. can be used -
readOnlyRootFilesystem: true
βProtectSystem=strict
/TemporaryFileSystem=/:ro
(this also hides all files, needs Systemd >= 238) -
capabilities.drop: ["all"]
βCapabilityBoundingSet=<...>
There are a lot more ways to control the capabilities and permissions of Systemd services which are documented here. After applying some of these parameters to the Nginx service, the Unit File looks as follows:
david@proxy:~$ systemctl cat nginx
# /etc/systemd/system/nginx.service
# Rootless Nginx service based on https://github.com/stephan13360/systemd-services/blob/master/nginx/nginx.service
[Unit]
# This is from the default nginx.service
Description=nginx (hardened rootless)
Documentation=https://nginx.org/en/docs/
Documentation=https://github.com/stephan13360/systemd-services/blob/master/nginx/README.md
After=network-online.target remote-fs.target nss-lookup.target
Wants=network-online.target
[Service]
# forking is not necessary as `daemon` is turned off in the nginx config
Type=exec
User=nginx
Group=nginx
## can be used e.g. for accessing directory containing SSL certs
#SupplementaryGroups=acme
# define runtime directory /run/nginx as rootless services can't access /run
RuntimeDirectory=nginx
# write logs to /var/log/nginx
LogsDirectory=nginx
# write cache to /var/cache/nginx
CacheDirectory=nginx
# configuration is in /etc/nginx
ConfigurationDirectory=nginx
ExecStart=/usr/sbin/nginx -c /etc/nginx/nginx.conf
# PID is not necessary here as the service is not forking
ExecReload=/usr/sbin/nginx -s reload
Restart=on-failure
RestartSec=10s
# Hardening
# hide the entire filesystem tree from the service and also make it read only, requires systemd >=238
TemporaryFileSystem=/:ro
# Remount (bind) necessary paths, based on https://gitlab.com/apparmor/apparmor/blob/master/profiles/apparmor.d/abstractions/base,
# https://github.com/jelly/apparmor-profiles/blob/master/usr.bin.nginx,
# https://www.freedesktop.org/software/systemd/man/systemd.exec.html#RootDirectory=
#
# This gives access to (probably) necessary system files, allows journald logging
BindReadOnlyPaths=/lib/ /lib64/ /usr/lib/ /usr/lib64/ /etc/ld.so.cache /etc/ld.so.conf /etc/ld.so.conf.d/ /etc/bindresvport.blacklist /usr/share/zoneinfo/ /usr/share/locale/ /etc/localtime /usr/share/common-licenses/ /etc/ssl/certs/ /etc/resolv.conf
BindReadOnlyPaths=/dev/log /run/systemd/journal/socket /run/systemd/journal/stdout /run/systemd/notify
# Additional access to service-specific directories
BindReadOnlyPaths=/usr/sbin/nginx
BindReadOnlyPaths=/run/ /usr/share/nginx/
PrivateTmp=true
PrivateDevices=true
ProtectControlGroups=true
ProtectKernelModules=true
ProtectKernelTunables=true
# Network access
RestrictAddressFamilies=AF_UNIX AF_INET AF_INET6
# Miscellaneous
SystemCallArchitectures=native
# also implicit because settings like MemoryDenyWriteExecute are set
NoNewPrivileges=true
MemoryDenyWriteExecute=true
ProtectKernelLogs=true
LockPersonality=true
ProtectHostname=true
RemoveIPC=true
RestrictSUIDSGID=true
ProtectClock=true
# Capabilities to bind low ports (80, 443)
AmbientCapabilities=CAP_NET_BIND_SERVICE
[Install]
WantedBy=multi-user.target
Now, not only is the service running as non-root, but the process and sub-processes also only have access to a very limited part of the system. All filesystem access is dropped by default and only necessary system directories are either made available or substituted by temporary paths. Besides that, persistence is only possible where necessary which further limits the attack surface. Running systemd-analyze
again on the new service, the results are showing effect:
david@proxy:~$ systemd-analyze security nginx.service --no-pager
Nginx Service Security Summary
NAME DESCRIPTION EXPOSURE
β PrivateNetwork= Service has access to the host's network 0.5
β User=/DynamicUser= Service runs under a static non-root user identity
β CapabilityBoundingSet=~CAP_SET(UID|GID|PCAP) Service may change UID/GID identities/capabilities 0.3
β CapabilityBoundingSet=~CAP_SYS_ADMIN Service has administrator privileges 0.3
β CapabilityBoundingSet=~CAP_SYS_PTRACE Service has ptrace() debugging abilities 0.3
β RestrictAddressFamilies=~AF_(INET|INET6) Service may allocate Internet sockets 0.3
β RestrictNamespaces=~CLONE_NEWUSER Service may create user namespaces 0.3
β RestrictAddressFamilies=~β¦ Service cannot allocate exotic sockets
β CapabilityBoundingSet=~CAP_(CHOWN|FSETID|SETFCAP) Service may change file ownership/access mode/capabilities unresβ¦ 0.2
β CapabilityBoundingSet=~CAP_(DAC_*|FOWNER|IPC_OWNER) Service may override UNIX file/IPC permission checks 0.2
β CapabilityBoundingSet=~CAP_NET_ADMIN Service has network configuration privileges 0.2
β CapabilityBoundingSet=~CAP_RAWIO Service has no raw I/O access
β CapabilityBoundingSet=~CAP_SYS_MODULE Service cannot load kernel modules
β CapabilityBoundingSet=~CAP_SYS_TIME Service processes cannot change the system clock
β DeviceAllow= Service has a device ACL with some special devices 0.1
β IPAddressDeny= Service does not define an IP address whitelist 0.2
β KeyringMode= Service doesn't share key material with other services
β NoNewPrivileges= Service processes cannot acquire new privileges
β NotifyAccess= Service child processes cannot alter service state
β PrivateDevices= Service has no access to hardware devices
β PrivateMounts= Service cannot install system mounts
β PrivateTmp= Service has no access to other software's temporary files
β PrivateUsers= Service has access to other users 0.2
β ProtectClock= Service may write to the hardware clock or system clock 0.2
β ProtectControlGroups= Service cannot modify the control group file system
β ProtectHome= Service has full access to home directories 0.2
β ProtectKernelLogs= Service cannot read from or write to the kernel log ring buffer
β ProtectKernelModules= Service cannot load or read kernel modules
β ProtectKernelTunables= Service cannot alter kernel tunables (/proc/sys, β¦)
β ProtectSystem= Service has full access to the OS file hierarchy 0.2
β RestrictAddressFamilies=~AF_PACKET Service cannot allocate packet sockets
β RestrictSUIDSGID= SUID/SGID file creation by service is restricted
β SystemCallArchitectures= Service may execute system calls only with native ABI
β SystemCallFilter=~@clock Service does not filter system calls 0.2
β SystemCallFilter=~@debug Service does not filter system calls 0.2
β SystemCallFilter=~@module Service does not filter system calls 0.2
β SystemCallFilter=~@mount Service does not filter system calls 0.2
β SystemCallFilter=~@raw-io Service does not filter system calls 0.2
β SystemCallFilter=~@reboot Service does not filter system calls 0.2
β SystemCallFilter=~@swap Service does not filter system calls 0.2
β SystemCallFilter=~@privileged Service does not filter system calls 0.2
β SystemCallFilter=~@resources Service does not filter system calls 0.2
β AmbientCapabilities= Service process receives ambient capabilities 0.1
β CapabilityBoundingSet=~CAP_AUDIT_* Service has audit subsystem access 0.1
β CapabilityBoundingSet=~CAP_KILL Service may send UNIX signals to arbitrary processes 0.1
β CapabilityBoundingSet=~CAP_MKNOD Service cannot create device nodes
β CapabilityBoundingSet=~CAP_NET_(BIND_SERVICE|BROADCAST|RAW) Service has elevated networking privileges 0.1
β CapabilityBoundingSet=~CAP_SYSLOG Service has no access to kernel logging
β CapabilityBoundingSet=~CAP_SYS_(NICE|RESOURCE) Service has privileges to change resource use parameters 0.1
β RestrictNamespaces=~CLONE_NEWCGROUP Service may create cgroup namespaces 0.1
β RestrictNamespaces=~CLONE_NEWIPC Service may create IPC namespaces 0.1
β RestrictNamespaces=~CLONE_NEWNET Service may create network namespaces 0.1
β RestrictNamespaces=~CLONE_NEWNS Service may create file system namespaces 0.1
β RestrictNamespaces=~CLONE_NEWPID Service may create process namespaces 0.1
β RestrictRealtime= Service may acquire realtime scheduling 0.1
β SystemCallFilter=~@cpu-emulation Service does not filter system calls 0.1
β SystemCallFilter=~@obsolete Service does not filter system calls 0.1
β RestrictAddressFamilies=~AF_NETLINK Service cannot allocate netlink sockets
β RootDirectory=/RootImage= Service runs within the host's root directory 0.1
β SupplementaryGroups= Service has no supplementary groups
β CapabilityBoundingSet=~CAP_MAC_* Service may adjust SMACK MAC 0.1
β CapabilityBoundingSet=~CAP_SYS_BOOT Service may issue reboot() 0.1
β Delegate= Service does not maintain its own delegated control group subtree
β LockPersonality= Service cannot change ABI personality
β MemoryDenyWriteExecute= Service cannot create writable executable memory mappings
β RemoveIPC= Service user cannot leave SysV IPC objects around
β RestrictNamespaces=~CLONE_NEWUTS Service may create hostname namespaces 0.1
β UMask= Files created by service are world-readable by default 0.1
β CapabilityBoundingSet=~CAP_LINUX_IMMUTABLE Service may mark files immutable 0.1
β CapabilityBoundingSet=~CAP_IPC_LOCK Service may lock memory into RAM 0.1
β CapabilityBoundingSet=~CAP_SYS_CHROOT Service may issue chroot() 0.1
β ProtectHostname= Service cannot change system host/domainname
β CapabilityBoundingSet=~CAP_BLOCK_SUSPEND Service may establish wake locks 0.1
β CapabilityBoundingSet=~CAP_LEASE Service may create file leases 0.1
β CapabilityBoundingSet=~CAP_SYS_PACCT Service may use acct() 0.1
β CapabilityBoundingSet=~CAP_SYS_TTY_CONFIG Service may issue vhangup() 0.1
β CapabilityBoundingSet=~CAP_WAKE_ALARM Service cannot program timers that wake up the system
β RestrictAddressFamilies=~AF_UNIX Service may allocate local sockets 0.1
β Overall exposure level for nginx.service: 6.1 MEDIUM π
The score shows there's still room for improvement, but in the end, a lot of potential attack vectors have been mitigated in comparison to the officially provided Unit file.
π Where to Continue
In summary, Systemd offers a straightforward method for constraining a process's capabilities, primarily leveraging Linux namespaces. This approach can significantly enhance security, but it does have its constraints. That is where Mandatory Access Control steps in, with tools such as AppArmor and SELinux providing fine grained control over system access. These tools enable a more nuanced approach to restricting system access, albeit with a more intricate configuration process. It's worth noting that numerous Linux distributions provide predefined profiles for a wide range of services, simplifying the implementation of these controls.
Ultimately, achieving a balance between security and practical implementation boils down to leveraging Systemd's capabilities alongside predefined Mandatory Access Control profiles. This approach strikes an effective compromise, ensuring both enhanced security and efficient deployment timelines.
Top comments (0)