Add NVIDIA GPU support to k3s with containerd

#kubernetes #k3s #gpu #containerd

After some failed attempts of adding GPU support to k3s, this article describes how to boot up a worker node with NVIDIA GPU support.
k3s, for those who are new to it, is a very small kubernetes distribution.

There are a few reasons why adding GPU support is not that easy. Main reason is that k3s is using containerd as it's container runtime. Most tutorials and also the official NVIDIA k8s device plugin assume docker as the container runtime. While you can easily switch to docker in k3s, we didn't want to change the runtime itself.
Kubernetes itself has a guide for adding GPU support which outlines the basic steps.

The following recipe has been tested on GCP instances n2-standard-1 with a NVIDIA Tesla T4 GPU attached.
It assumes a running master node. Each worker with an attached GPU needs a few additional steps which are outlined below.

Create device plugin DaemonSet

The device plugin is responsible for advertising the nvidia.com/gpu resource on a node (via kubelet).
This needs to be done on the kubernetes node only once. Every node with the label cloud.google.com/gke-accelerator then gets automatically a pod from this DaemonSet assigned.

$ kubectl apply -f https://raw.githubusercontent.com/kubernetes/kubernetes/release-1.14/cluster/addons/device-plugins/nvidia-gpu/daemonset.yaml

The following steps are necessary for each node which needs GPU support. Placing it in a startup script is a good option.

Install drivers

# required kernel module
modprobe ipmi_devintf

# add necessary repositories
add-apt-repository -y ppa:graphics-drivers
curl -s -L https://nvidia.github.io/nvidia-container-runtime/gpgkey | \
  apt-key add -
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-container-runtime/$distribution/nvidia-container-runtime.list | \
  tee /etc/apt/sources.list.d/nvidia-container-runtime.list
apt-get update

# install graphics driver
apt-get install -y nvidia-driver-440 nvidia-container-runtime nvidia-modprobe

Ensure nvidia driver is loaded and device files ready

From: https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#runfile-verifications

/sbin/modprobe nvidia

if [ "$?" -eq 0 ]; then
  # Count the number of NVIDIA controllers found.
  NVDEVS=`lspci | grep -i NVIDIA`
  N3D=`echo "$NVDEVS" | grep "3D controller" | wc -l`
  NVGA=`echo "$NVDEVS" | grep "VGA compatible controller" | wc -l`

  N=`expr $N3D + $NVGA - 1`
  for i in `seq 0 $N`; do
    mknod -m 666 /dev/nvidia$i c 195 $i
  done

  mknod -m 666 /dev/nvidiactl c 195 255

else
  exit 1
fi

/sbin/modprobe nvidia-uvm

if [ "$?" -eq 0 ]; then
  # Find out the major device number used by the nvidia-uvm driver
  D=`grep nvidia-uvm /proc/devices | awk '{print $1}'`

  mknod -m 666 /dev/nvidia-uvm c $D 0
else
  exit 1
fi

Install k3s

curl -sfL https://get.k3s.io | \
    INSTALL_K3S_SKIP_START=true \
    K3S_URL=https://IP_OF_MASTER_ADDRESS:6443 \
    K3S_TOKEN=CONTENT_OF_/var/lib/rancher/k3s/server/node-token_ON_MASTER \
    sh -s - \
    --node-label "cloud.google.com/gke-accelerator=$(curl -fs "http://metadata.google.internal/computeMetadata/v1/instance/attributes/gpu-platform" -H "Metadata-Flavor: Google")"

INSTALL_K3S_SKIP_START prevents k3s from starting, as we need first to change containerd config (see below)
node-label should be set to that key, the value is only important if you want to schedule pods based on the GPU available. The example here uses a metadata attribute on the instance within GCE. Feel free to change that to something else or just true.

Configure containerd

Containerd needs to be changed to use a different container runtime. This can be achieved by adjusting the config.toml or rather creating a config.toml.tmpl file.


mkdir -p /var/lib/rancher/k3s/agent/etc/containerd/

# why "EOF":
# https://serverfault.com/questions/399428/how-do-you-escape-characters-in-heredoc
# ($ signs would need to be escaped -> use "EOF" instead of EOF)
cat <<"EOF" > /var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl
[plugins.opt]
  path = "{{ .NodeConfig.Containerd.Opt }}"

[plugins.cri]
  stream_server_address = "127.0.0.1"
  stream_server_port = "10010"

{{- if .IsRunningInUserNS }}
  disable_cgroup = true
  disable_apparmor = true
  restrict_oom_score_adj = true
{{end}}

{{- if .NodeConfig.AgentConfig.PauseImage }}
  sandbox_image = "{{ .NodeConfig.AgentConfig.PauseImage }}"
{{end}}

{{- if not .NodeConfig.NoFlannel }}
[plugins.cri.cni]
  bin_dir = "{{ .NodeConfig.AgentConfig.CNIBinDir }}"
  conf_dir = "{{ .NodeConfig.AgentConfig.CNIConfDir }}"
{{end}}

[plugins.cri.containerd.runtimes.runc]
  # ---- changed from 'io.containerd.runc.v2' for GPU support
  runtime_type = "io.containerd.runtime.v1.linux"

# ---- added for GPU support
[plugins.linux]
  runtime = "nvidia-container-runtime"

{{ if .PrivateRegistryConfig }}
{{ if .PrivateRegistryConfig.Mirrors }}
[plugins.cri.registry.mirrors]{{end}}
{{range $k, $v := .PrivateRegistryConfig.Mirrors }}
[plugins.cri.registry.mirrors."{{$k}}"]
  endpoint = [{{range $i, $j := $v.Endpoints}}{{if $i}}, {{end}}{{printf "%q" .}}{{end}}]
{{end}}

{{range $k, $v := .PrivateRegistryConfig.Configs }}
{{ if $v.Auth }}
[plugins.cri.registry.configs."{{$k}}".auth]
  {{ if $v.Auth.Username }}username = "{{ $v.Auth.Username }}"{{end}}
  {{ if $v.Auth.Password }}password = "{{ $v.Auth.Password }}"{{end}}
  {{ if $v.Auth.Auth }}auth = "{{ $v.Auth.Auth }}"{{end}}
  {{ if $v.Auth.IdentityToken }}identitytoken = "{{ $v.Auth.IdentityToken }}"{{end}}
{{end}}
{{ if $v.TLS }}
[plugins.cri.registry.configs."{{$k}}".tls]
  {{ if $v.TLS.CAFile }}ca_file = "{{ $v.TLS.CAFile }}"{{end}}
  {{ if $v.TLS.CertFile }}cert_file = "{{ $v.TLS.CertFile }}"{{end}}
  {{ if $v.TLS.KeyFile }}key_file = "{{ $v.TLS.KeyFile }}"{{end}}
{{end}}
{{end}}
{{end}}
EOF