DEV Community

Cover image for Building Secure Foundations: A Practical Guide to Minimizing Linux Services' Attack Surface
David Gries
David Gries

Posted on • Updated on

Building Secure Foundations: A Practical Guide to Minimizing Linux Services' Attack Surface

Cybersecurity and its awareness have never been more crucial than they are today. Considering the increasing amount of attacks, it has become clear that protecting digital assets plays a significant role in software development and operations. What concrete steps can be taken to enhance the security of our services even further?

Starting at a Lower Level

While antivirus a well-executed read-only backup strategy are essential for identifying and reducing the impact of threats, it's important to establish a strong foundation of security from the outset. Rather than solely focusing on mitigating consequences after the fact, reducing the attack surface should be a primary goal.

This can be done by limiting access to the underlying system, like running as an arbitrary user and dropping unneeded privileges. In Kubernetes, this would for example typically mean using non-root base images in combination with securityContext definitions.

But in some cases, it's better or even required to deploy directly on virtual machines. So how can a similar strategy be applied there?

πŸ”’ Hardening Nginx: Step by Step

Let's examine a real-world example using the Nginx service file provided by Ubuntu 20.04:

The Defaults

david@proxy:~$ systemctl cat nginx.service

# /lib/systemd/system/nginx.service
[Unit]
Description=A high performance web server and a reverse proxy server
Documentation=man:nginx(8)
After=network.target

[Service]
Type=forking
PIDFile=/run/nginx.pid
ExecStartPre=/usr/sbin/nginx -t -q -g 'daemon on; master_process on;'
ExecStart=/usr/sbin/nginx -g 'daemon on; master_process on;'
ExecReload=/usr/sbin/nginx -g 'daemon on; master_process on;' -s reload
ExecStop=-/sbin/start-stop-daemon --quiet --stop --retry QUIT/5 --pidfile /run/nginx.pid
TimeoutStopSec=5
KillMode=mixed

[Install]
WantedBy=multi-user.target
Enter fullscreen mode Exit fullscreen mode

By default, the service runs as the root user. Therefore, processes spawned by /usr/sbin/nginx have all privileges of the root user and group, which could allow malicious software to control every part of the system when there is an exploit for Nginx. While Nginx is also able to use arbitrary users by itself, the main process that's started by the service still has root privileges. In many cases, this is not required and can be avoided by using Systemd's already built-in capabilities.

Breaking it Down

The systemd-analyze cli tool can help to get an overview of potential issues of Systemd services:

systemd-analyze security                # provides a high-level overview including a
                                        # numeric "exposure" value of Systemd services

systemd-analyze security <service_name> # shows detailed security-related information
                                        # about a single service
Enter fullscreen mode Exit fullscreen mode

The output for the nginx service looks like this:

david@proxy:~$ systemd-analyze security nginx.service --no-pager

Nginx Service Security Summary
  NAME                                                        DESCRIPTION                                                       EXPOSURE
βœ— PrivateNetwork=                                             Service has access to the host's network                               0.5
βœ— User=/DynamicUser=                                          Service runs as root user                                              0.4
βœ— CapabilityBoundingSet=~CAP_SET(UID|GID|PCAP)                Service may change UID/GID identities/capabilities                     0.3
βœ— CapabilityBoundingSet=~CAP_SYS_ADMIN                        Service has administrator privileges                                   0.3
βœ— CapabilityBoundingSet=~CAP_SYS_PTRACE                       Service has ptrace() debugging abilities                               0.3
βœ— RestrictAddressFamilies=~AF_(INET|INET6)                    Service may allocate Internet sockets                                  0.3
βœ— RestrictNamespaces=~CLONE_NEWUSER                           Service may create user namespaces                                     0.3
βœ— RestrictAddressFamilies=~…                                  Service may allocate exotic sockets                                    0.3
βœ— CapabilityBoundingSet=~CAP_(CHOWN|FSETID|SETFCAP)           Service may change file ownership/access mode/capabilities unres…      0.2
βœ— CapabilityBoundingSet=~CAP_(DAC_*|FOWNER|IPC_OWNER)         Service may override UNIX file/IPC permission checks                   0.2
βœ— CapabilityBoundingSet=~CAP_NET_ADMIN                        Service has network configuration privileges                           0.2
βœ— CapabilityBoundingSet=~CAP_RAWIO                            Service has raw I/O access                                             0.2
βœ— CapabilityBoundingSet=~CAP_SYS_MODULE                       Service may load kernel modules                                        0.2
βœ— CapabilityBoundingSet=~CAP_SYS_TIME                         Service processes may change the system clock                          0.2
βœ— DeviceAllow=                                                Service has no device ACL                                              0.2
βœ— IPAddressDeny=                                              Service does not define an IP address whitelist                        0.2
βœ“ KeyringMode=                                                Service doesn't share key material with other services
βœ— NoNewPrivileges=                                            Service processes may acquire new privileges                           0.2
βœ“ NotifyAccess=                                               Service child processes cannot alter service state
βœ— PrivateDevices=                                             Service potentially has access to hardware devices                     0.2
βœ— PrivateMounts=                                              Service may install system mounts                                      0.2
βœ— PrivateTmp=                                                 Service has access to other software's temporary files                 0.2
βœ— PrivateUsers=                                               Service has access to other users                                      0.2
βœ— ProtectClock=                                               Service may write to the hardware clock or system clock                0.2
βœ— ProtectControlGroups=                                       Service may modify the control group file system                       0.2
βœ— ProtectHome=                                                Service has full access to home directories                            0.2
βœ— ProtectKernelLogs=                                          Service may read from or write to the kernel log ring buffer           0.2
βœ— ProtectKernelModules=                                       Service may load or read kernel modules                                0.2
βœ— ProtectKernelTunables=                                      Service may alter kernel tunables                                      0.2
βœ— ProtectSystem=                                              Service has full access to the OS file hierarchy                       0.2
βœ— RestrictAddressFamilies=~AF_PACKET                          Service may allocate packet sockets                                    0.2
βœ— RestrictSUIDSGID=                                           Service may create SUID/SGID files                                     0.2
βœ— SystemCallArchitectures=                                    Service may execute system calls with all ABIs                         0.2
βœ— SystemCallFilter=~@clock                                    Service does not filter system calls                                   0.2
βœ— SystemCallFilter=~@debug                                    Service does not filter system calls                                   0.2
βœ— SystemCallFilter=~@module                                   Service does not filter system calls                                   0.2
βœ— SystemCallFilter=~@mount                                    Service does not filter system calls                                   0.2
βœ— SystemCallFilter=~@raw-io                                   Service does not filter system calls                                   0.2
βœ— SystemCallFilter=~@reboot                                   Service does not filter system calls                                   0.2
βœ— SystemCallFilter=~@swap                                     Service does not filter system calls                                   0.2
βœ— SystemCallFilter=~@privileged                               Service does not filter system calls                                   0.2
βœ— SystemCallFilter=~@resources                                Service does not filter system calls                                   0.2
βœ“ AmbientCapabilities=                                        Service process does not receive ambient capabilities
βœ— CapabilityBoundingSet=~CAP_AUDIT_*                          Service has audit subsystem access                                     0.1
βœ— CapabilityBoundingSet=~CAP_KILL                             Service may send UNIX signals to arbitrary processes                   0.1
βœ— CapabilityBoundingSet=~CAP_MKNOD                            Service may create device nodes                                        0.1
βœ— CapabilityBoundingSet=~CAP_NET_(BIND_SERVICE|BROADCAST|RAW) Service has elevated networking privileges                             0.1
βœ— CapabilityBoundingSet=~CAP_SYSLOG                           Service has access to kernel logging                                   0.1
βœ— CapabilityBoundingSet=~CAP_SYS_(NICE|RESOURCE)              Service has privileges to change resource use parameters               0.1
βœ— RestrictNamespaces=~CLONE_NEWCGROUP                         Service may create cgroup namespaces                                   0.1
βœ— RestrictNamespaces=~CLONE_NEWIPC                            Service may create IPC namespaces                                      0.1
βœ— RestrictNamespaces=~CLONE_NEWNET                            Service may create network namespaces                                  0.1
βœ— RestrictNamespaces=~CLONE_NEWNS                             Service may create file system namespaces                              0.1
βœ— RestrictNamespaces=~CLONE_NEWPID                            Service may create process namespaces                                  0.1
βœ— RestrictRealtime=                                           Service may acquire realtime scheduling                                0.1
βœ— SystemCallFilter=~@cpu-emulation                            Service does not filter system calls                                   0.1
βœ— SystemCallFilter=~@obsolete                                 Service does not filter system calls                                   0.1
βœ— RestrictAddressFamilies=~AF_NETLINK                         Service may allocate netlink sockets                                   0.1
βœ— RootDirectory=/RootImage=                                   Service runs within the host's root directory                          0.1
    SupplementaryGroups=                                        Service runs as root, option does not matter
βœ— CapabilityBoundingSet=~CAP_MAC_*                            Service may adjust SMACK MAC                                           0.1
βœ— CapabilityBoundingSet=~CAP_SYS_BOOT                         Service may issue reboot()                                             0.1
βœ“ Delegate=                                                   Service does not maintain its own delegated control group subtree
βœ— LockPersonality=                                            Service may change ABI personality                                     0.1
βœ— MemoryDenyWriteExecute=                                     Service may create writable executable memory mappings                 0.1
    RemoveIPC=                                                  Service runs as root, option does not apply
βœ— RestrictNamespaces=~CLONE_NEWUTS                            Service may create hostname namespaces                                 0.1
βœ— UMask=                                                      Files created by service are world-readable by default                 0.1
βœ— CapabilityBoundingSet=~CAP_LINUX_IMMUTABLE                  Service may mark files immutable                                       0.1
βœ— CapabilityBoundingSet=~CAP_IPC_LOCK                         Service may lock memory into RAM                                       0.1
βœ— CapabilityBoundingSet=~CAP_SYS_CHROOT                       Service may issue chroot()                                             0.1
βœ— ProtectHostname=                                            Service may change system host/domainname                              0.1
βœ— CapabilityBoundingSet=~CAP_BLOCK_SUSPEND                    Service may establish wake locks                                       0.1
βœ— CapabilityBoundingSet=~CAP_LEASE                            Service may create file leases                                         0.1
βœ— CapabilityBoundingSet=~CAP_SYS_PACCT                        Service may use acct()                                                 0.1
βœ— CapabilityBoundingSet=~CAP_SYS_TTY_CONFIG                   Service may issue vhangup()                                            0.1
βœ— CapabilityBoundingSet=~CAP_WAKE_ALARM                       Service may program timers that wake up the system                     0.1
βœ— RestrictAddressFamilies=~AF_UNIX                            Service may allocate local sockets                                     0.1

β†’ Overall exposure level for nginx.service: 9.6 UNSAFE 😨
Enter fullscreen mode Exit fullscreen mode

A lot of those capabilities are not required to run a web server, so it's best to limit the service's privileges. As interfacing with the Linux kernel can be very complex and is prone to changes, Systemd services offer a way to define common configurations directly in the service files. Given the multitude of configuration parameters for Systemd services, this example will concentrate on values significantly affecting security. It will use a standard Kubernetes securityContext as a foundation.

The Principle of Least Privilege

Adopting the principle of least privilege is crucial. By restricting access and privileges to the bare essentials, the attack surface diminishes significantly. When using Kubernetes resources, you'd usually use a securityContext definition to limit capabilities of a Pod:

...
    securityContext:
    runAsNonRoot: true
    runAsUser: 1001
    runAsGroup: 2001
    allowPrivilegeEscalation: false
    privileged: false
    readOnlyRootFilesystem: true
    capabilities:
        drop:
        - all
...
Enter fullscreen mode Exit fullscreen mode

In the above example, the process runs without root privileges on a read-only filesystem and all capabilities are dropped. A similar setup can be achieved using a Systemd service:

  • runAsNonRoot: true ➜ no equivalent, if possible DynamicUser can be used
  • runAsUser: 1001 ➜ User=<username>
  • runAsGroup: 2001 ➜ Group=<groupname>
  • allowPrivilegeEscalation: false ➜ NoNewPrivileges=true
  • privileged: false ➜ no equivalent, PrivateDevices=<...>, Protect<...>=<...> etc. can be used
  • readOnlyRootFilesystem: true ➜ ProtectSystem=strict / TemporaryFileSystem=/:ro (this also hides all files, needs Systemd >= 238)
  • capabilities.drop: ["all"] ➜ CapabilityBoundingSet=<...>

There are a lot more ways to control the capabilities and permissions of Systemd services which are documented here. After applying some of these parameters to the Nginx service, the Unit File looks as follows:

david@proxy:~$ systemctl cat nginx

# /etc/systemd/system/nginx.service
# Rootless Nginx service based on https://github.com/stephan13360/systemd-services/blob/master/nginx/nginx.service
[Unit]
# This is from the default nginx.service
Description=nginx (hardened rootless)
Documentation=https://nginx.org/en/docs/
Documentation=https://github.com/stephan13360/systemd-services/blob/master/nginx/README.md
After=network-online.target remote-fs.target nss-lookup.target
Wants=network-online.target

[Service]
# forking is not necessary as `daemon` is turned off in the nginx config
Type=exec
User=nginx
Group=nginx
## can be used e.g. for accessing directory containing SSL certs
#SupplementaryGroups=acme
# define runtime directory /run/nginx as rootless services can't access /run
RuntimeDirectory=nginx
# write logs to /var/log/nginx
LogsDirectory=nginx
# write cache to /var/cache/nginx
CacheDirectory=nginx
# configuration is in /etc/nginx
ConfigurationDirectory=nginx

ExecStart=/usr/sbin/nginx -c /etc/nginx/nginx.conf
# PID is not necessary here as the service is not forking
ExecReload=/usr/sbin/nginx -s reload

Restart=on-failure
RestartSec=10s

# Hardening
# hide the entire filesystem tree from the service and also make it read only, requires systemd >=238
TemporaryFileSystem=/:ro
# Remount (bind) necessary paths, based on https://gitlab.com/apparmor/apparmor/blob/master/profiles/apparmor.d/abstractions/base,
# https://github.com/jelly/apparmor-profiles/blob/master/usr.bin.nginx,
# https://www.freedesktop.org/software/systemd/man/systemd.exec.html#RootDirectory=
#
# This gives access to (probably) necessary system files, allows journald logging
BindReadOnlyPaths=/lib/ /lib64/ /usr/lib/ /usr/lib64/ /etc/ld.so.cache /etc/ld.so.conf /etc/ld.so.conf.d/ /etc/bindresvport.blacklist /usr/share/zoneinfo/ /usr/share/locale/ /etc/localtime /usr/share/common-licenses/ /etc/ssl/certs/ /etc/resolv.conf
BindReadOnlyPaths=/dev/log /run/systemd/journal/socket /run/systemd/journal/stdout /run/systemd/notify
# Additional access to service-specific directories
BindReadOnlyPaths=/usr/sbin/nginx
BindReadOnlyPaths=/run/ /usr/share/nginx/

PrivateTmp=true
PrivateDevices=true
ProtectControlGroups=true
ProtectKernelModules=true
ProtectKernelTunables=true

# Network access
RestrictAddressFamilies=AF_UNIX AF_INET AF_INET6

# Miscellaneous
SystemCallArchitectures=native
# also implicit because settings like MemoryDenyWriteExecute are set
NoNewPrivileges=true
MemoryDenyWriteExecute=true
ProtectKernelLogs=true
LockPersonality=true
ProtectHostname=true
RemoveIPC=true
RestrictSUIDSGID=true
ProtectClock=true

# Capabilities to bind low ports (80, 443)
AmbientCapabilities=CAP_NET_BIND_SERVICE

[Install]
WantedBy=multi-user.target
Enter fullscreen mode Exit fullscreen mode

Now, not only is the service running as non-root, but the process and sub-processes also only have access to a very limited part of the system. All filesystem access is dropped by default and only necessary system directories are either made available or substituted by temporary paths. Besides that, persistence is only possible where necessary which further limits the attack surface. Running systemd-analyze again on the new service, the results are showing effect:

david@proxy:~$ systemd-analyze security nginx.service --no-pager

Nginx Service Security Summary
  NAME                                                        DESCRIPTION                                                       EXPOSURE
βœ— PrivateNetwork=                                             Service has access to the host's network                               0.5
βœ“ User=/DynamicUser=                                          Service runs under a static non-root user identity
βœ— CapabilityBoundingSet=~CAP_SET(UID|GID|PCAP)                Service may change UID/GID identities/capabilities                     0.3
βœ— CapabilityBoundingSet=~CAP_SYS_ADMIN                        Service has administrator privileges                                   0.3
βœ— CapabilityBoundingSet=~CAP_SYS_PTRACE                       Service has ptrace() debugging abilities                               0.3
βœ— RestrictAddressFamilies=~AF_(INET|INET6)                    Service may allocate Internet sockets                                  0.3
βœ— RestrictNamespaces=~CLONE_NEWUSER                           Service may create user namespaces                                     0.3
βœ“ RestrictAddressFamilies=~…                                  Service cannot allocate exotic sockets
βœ— CapabilityBoundingSet=~CAP_(CHOWN|FSETID|SETFCAP)           Service may change file ownership/access mode/capabilities unres…      0.2
βœ— CapabilityBoundingSet=~CAP_(DAC_*|FOWNER|IPC_OWNER)         Service may override UNIX file/IPC permission checks                   0.2
βœ— CapabilityBoundingSet=~CAP_NET_ADMIN                        Service has network configuration privileges                           0.2
βœ“ CapabilityBoundingSet=~CAP_RAWIO                            Service has no raw I/O access
βœ“ CapabilityBoundingSet=~CAP_SYS_MODULE                       Service cannot load kernel modules
βœ“ CapabilityBoundingSet=~CAP_SYS_TIME                         Service processes cannot change the system clock
βœ— DeviceAllow=                                                Service has a device ACL with some special devices                     0.1
βœ— IPAddressDeny=                                              Service does not define an IP address whitelist                        0.2
βœ“ KeyringMode=                                                Service doesn't share key material with other services
βœ“ NoNewPrivileges=                                            Service processes cannot acquire new privileges
βœ“ NotifyAccess=                                               Service child processes cannot alter service state
βœ“ PrivateDevices=                                             Service has no access to hardware devices
βœ“ PrivateMounts=                                              Service cannot install system mounts
βœ“ PrivateTmp=                                                 Service has no access to other software's temporary files
βœ— PrivateUsers=                                               Service has access to other users                                      0.2
βœ— ProtectClock=                                               Service may write to the hardware clock or system clock                0.2
βœ“ ProtectControlGroups=                                       Service cannot modify the control group file system
βœ— ProtectHome=                                                Service has full access to home directories                            0.2
βœ“ ProtectKernelLogs=                                          Service cannot read from or write to the kernel log ring buffer
βœ“ ProtectKernelModules=                                       Service cannot load or read kernel modules
βœ“ ProtectKernelTunables=                                      Service cannot alter kernel tunables (/proc/sys, …)
βœ— ProtectSystem=                                              Service has full access to the OS file hierarchy                       0.2
βœ“ RestrictAddressFamilies=~AF_PACKET                          Service cannot allocate packet sockets
βœ“ RestrictSUIDSGID=                                           SUID/SGID file creation by service is restricted
βœ“ SystemCallArchitectures=                                    Service may execute system calls only with native ABI
βœ— SystemCallFilter=~@clock                                    Service does not filter system calls                                   0.2
βœ— SystemCallFilter=~@debug                                    Service does not filter system calls                                   0.2
βœ— SystemCallFilter=~@module                                   Service does not filter system calls                                   0.2
βœ— SystemCallFilter=~@mount                                    Service does not filter system calls                                   0.2
βœ— SystemCallFilter=~@raw-io                                   Service does not filter system calls                                   0.2
βœ— SystemCallFilter=~@reboot                                   Service does not filter system calls                                   0.2
βœ— SystemCallFilter=~@swap                                     Service does not filter system calls                                   0.2
βœ— SystemCallFilter=~@privileged                               Service does not filter system calls                                   0.2
βœ— SystemCallFilter=~@resources                                Service does not filter system calls                                   0.2
βœ— AmbientCapabilities=                                        Service process receives ambient capabilities                          0.1
βœ— CapabilityBoundingSet=~CAP_AUDIT_*                          Service has audit subsystem access                                     0.1
βœ— CapabilityBoundingSet=~CAP_KILL                             Service may send UNIX signals to arbitrary processes                   0.1
βœ“ CapabilityBoundingSet=~CAP_MKNOD                            Service cannot create device nodes
βœ— CapabilityBoundingSet=~CAP_NET_(BIND_SERVICE|BROADCAST|RAW) Service has elevated networking privileges                             0.1
βœ“ CapabilityBoundingSet=~CAP_SYSLOG                           Service has no access to kernel logging
βœ— CapabilityBoundingSet=~CAP_SYS_(NICE|RESOURCE)              Service has privileges to change resource use parameters               0.1
βœ— RestrictNamespaces=~CLONE_NEWCGROUP                         Service may create cgroup namespaces                                   0.1
βœ— RestrictNamespaces=~CLONE_NEWIPC                            Service may create IPC namespaces                                      0.1
βœ— RestrictNamespaces=~CLONE_NEWNET                            Service may create network namespaces                                  0.1
βœ— RestrictNamespaces=~CLONE_NEWNS                             Service may create file system namespaces                              0.1
βœ— RestrictNamespaces=~CLONE_NEWPID                            Service may create process namespaces                                  0.1
βœ— RestrictRealtime=                                           Service may acquire realtime scheduling                                0.1
βœ— SystemCallFilter=~@cpu-emulation                            Service does not filter system calls                                   0.1
βœ— SystemCallFilter=~@obsolete                                 Service does not filter system calls                                   0.1
βœ“ RestrictAddressFamilies=~AF_NETLINK                         Service cannot allocate netlink sockets
βœ— RootDirectory=/RootImage=                                   Service runs within the host's root directory                          0.1
βœ“ SupplementaryGroups=                                        Service has no supplementary groups
βœ— CapabilityBoundingSet=~CAP_MAC_*                            Service may adjust SMACK MAC                                           0.1
βœ— CapabilityBoundingSet=~CAP_SYS_BOOT                         Service may issue reboot()                                             0.1
βœ“ Delegate=                                                   Service does not maintain its own delegated control group subtree
βœ“ LockPersonality=                                            Service cannot change ABI personality
βœ“ MemoryDenyWriteExecute=                                     Service cannot create writable executable memory mappings
βœ“ RemoveIPC=                                                  Service user cannot leave SysV IPC objects around
βœ— RestrictNamespaces=~CLONE_NEWUTS                            Service may create hostname namespaces                                 0.1
βœ— UMask=                                                      Files created by service are world-readable by default                 0.1
βœ— CapabilityBoundingSet=~CAP_LINUX_IMMUTABLE                  Service may mark files immutable                                       0.1
βœ— CapabilityBoundingSet=~CAP_IPC_LOCK                         Service may lock memory into RAM                                       0.1
βœ— CapabilityBoundingSet=~CAP_SYS_CHROOT                       Service may issue chroot()                                             0.1
βœ“ ProtectHostname=                                            Service cannot change system host/domainname
βœ— CapabilityBoundingSet=~CAP_BLOCK_SUSPEND                    Service may establish wake locks                                       0.1
βœ— CapabilityBoundingSet=~CAP_LEASE                            Service may create file leases                                         0.1
βœ— CapabilityBoundingSet=~CAP_SYS_PACCT                        Service may use acct()                                                 0.1
βœ— CapabilityBoundingSet=~CAP_SYS_TTY_CONFIG                   Service may issue vhangup()                                            0.1
βœ“ CapabilityBoundingSet=~CAP_WAKE_ALARM                       Service cannot program timers that wake up the system
βœ— RestrictAddressFamilies=~AF_UNIX                            Service may allocate local sockets                                     0.1

β†’ Overall exposure level for nginx.service: 6.1 MEDIUM 😐
Enter fullscreen mode Exit fullscreen mode

The score shows there's still room for improvement, but in the end, a lot of potential attack vectors have been mitigated in comparison to the officially provided Unit file.

πŸš€ Where to Continue

In summary, Systemd offers a straightforward method for constraining a process's capabilities, primarily leveraging Linux namespaces. This approach can significantly enhance security, but it does have its constraints. That is where Mandatory Access Control steps in, with tools such as AppArmor and SELinux providing fine grained control over system access. These tools enable a more nuanced approach to restricting system access, albeit with a more intricate configuration process. It's worth noting that numerous Linux distributions provide predefined profiles for a wide range of services, simplifying the implementation of these controls.

Ultimately, achieving a balance between security and practical implementation boils down to leveraging Systemd's capabilities alongside predefined Mandatory Access Control profiles. This approach strikes an effective compromise, ensuring both enhanced security and efficient deployment timelines.

Top comments (0)