Kubernetes gone bust. Now what?

#kubernetes #etcd #sre

Originally published on mccricardo.com.

We've been operating a few Kubernetes clusters. Someone trips over, falls on a keyboard, and deletes several services. We need to (quickly!) get those back online.

We have several options to get things back to how they were:

we have everything in version control - pipelines or GitOps reconcilers will take care of it;
restore ectd backup - all Kubernetes objects are stored on etcd. Periodically backing up the etcd cluster data can be a lifesaver under disaster scenarios;
use specific Kubernetes backup tools - for example Velero.

A tool like Velero is great since it makes backups of Kubernetes objects, as well as, instructing your cloud provider to make backups of PersistentVolumes. That said, this has a ramp-up and we need something now. Backing up our etcd cluster is always a safe bet and there are ways of doing that.

For a while now I've been a fan of Earliest Testable/Usable/Lovable as an "opposition" to MVP.

With this in mind, what we want is a fast way to have a safety net (skate) in case something goes wrong. Fortunately, etcd come equipped with built-in snapshot capabilities.

Backup etcd

We need to identify a few things from the etcd deployment in order to make a backup.

spec:
  containers:
  - command:
    - etcd
    - --advertise-client-urls=https://172.23.0.3:2379
    - --cert-file=/etc/kubernetes/pki/etcd/server.crt
    - --client-cert-auth=true
    - --data-dir=/var/lib/etcd
    - --initial-advertise-peer-urls=https://172.23.0.3:2380
    - --initial-cluster=backup-control-plane=https://172.23.0.3:2380
    - --key-file=/etc/kubernetes/pki/etcd/server.key
    - --listen-client-urls=https://127.0.0.1:2379,https://172.23.0.3:2379
    - --listen-metrics-urls=http://127.0.0.1:2381
    - --listen-peer-urls=https://172.23.0.3:2380
    - --name=backup-control-plane
    - --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt
    - --peer-client-cert-auth=true
    - --peer-key-file=/etc/kubernetes/pki/etcd/peer.key
    - --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
    - --snapshot-count=10000
    - --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt

Armed with advertise-client-urls, cert-file, key-file and trusted-ca-file values we can:

ETCDCTL_API=3 etcdctl --endpoints https://172.23.0.3:2379 \
  --cacert="/etc/kubernetes/pki/etcd/ca.crt" \
  --cert="/etc/kubernetes/pki/etcd/server.crt" \
  --key="/etc/kubernetes/pki/etcd/server.key" \
  snapshot save snapshotdb

{"level":"info","ts":1610913776.2521563,"caller":"snapshot/v3_snapshot.go:119","msg":"created temporary db file","path":"snapshotdb.part"}
{"level":"info","ts":"2021-01-17T20:02:56.256Z","caller":"clientv3/maintenance.go:200","msg":"opened snapshot stream; downloading"}
{"level":"info","ts":1610913776.2563014,"caller":"snapshot/v3_snapshot.go:127","msg":"fetching snapshot","endpoint":"https://172.23.0.3:2379"}
{"level":"info","ts":"2021-01-17T20:02:56.273Z","caller":"clientv3/maintenance.go:208","msg":"completed snapshot read; closing"}
{"level":"info","ts":1610913776.2887816,"caller":"snapshot/v3_snapshot.go:142","msg":"fetched snapshot","endpoint":"https://172.23.0.3:2379","size":"3.6 MB","took":0.036583317}
{"level":"info","ts":1610913776.2891474,"caller":"snapshot/v3_snapshot.go:152","msg":"saved","path":"snapshotdb"}
Snapshot saved at snapshotdb

To be safe we can ensure the backup is ok:

 ETCDCTL_API=3 etcdctl --write-out=table snapshot status snapshotdb
+----------+----------+------------+------------+
|   HASH   | REVISION | TOTAL KEYS | TOTAL SIZE |
+----------+----------+------------+------------+
| 9b193bf0 |     1996 |       2009 |     2.7 MB |
+----------+----------+------------+------------+

Restore etcd

kube-apiserver uses etcd to store and retrieve information and, as such, we need to stop ip first. This will depend on how you have kube-apiserver configured. Next, we restore etcd:

ETCDCTL_API=3 etcdctl snapshot restore snapshotdb --data-dir="/var/lib/etcd-restore"
{"level":"info","ts":1610913810.5761065,"caller":"snapshot/v3_snapshot.go:296","msg":"restoring snapshot","path":"snapshotdb","wal-dir":"/var/lib/etcd-restore/member/wal","data-dir":"/var/lib/etcd-restore","snap-dir":"/var/lib/etcd-restore/member/snap"}
{"level":"info","ts":1610913810.599168,"caller":"mvcc/kvstore.go:380","msg":"restored last compact revision","meta-bucket-name":"meta","meta-bucket-name-key":"finishedCompactRev","restored-compact-revision":7655}
{"level":"info","ts":1610913810.60404,"caller":"membership/cluster.go:392","msg":"added member","cluster-id":"cdf818194e3a8c32","local-member-id":"0","added-peer-id":"8e9e05c52164694d","added-peer-peer-urls":["http://localhost:2380"]}
{"level":"info","ts":1610913810.6153672,"caller":"snapshot/v3_snapshot.go:309","msg":"restored snapshot","path":"snapshotdb","wal-dir":"/var/lib/etcd-restore/member/wal","data-dir":"/var/lib/etcd-restore","snap-dir":"/var/lib/etcd-restore/member/snap"}

We need to tell etcd to use this data folder and once it's up-and-running bring kube-apiserver back online:

volumes:
  - hostPath:
      path: /var/lib/etcd-restore
      type: DirectoryOrCreate
    name: etcd-data

Although this looks a bit clunky it's an easy way (skate again) to ensure a safety net in case of disaster while buying time to work a more capable solution (scooter -> bicycle -> motorcycle -> car). It might even come to the point where, for example, the bicycle is good enough.

DEV Community

Kubernetes gone bust. Now what?

Backup etcd

Restore etcd

Top comments (0)

Read next

Amazon Elastic Kubernetes Service now supports using NVIDIA and AWS Neuron accelerated instance types with AL2023

VS Code YAML Plugin Setup for Kubernetes Beginners

What's A WIT (Wasm Interface Type): Quickstart

Exposing Pod Information to Containers Through Files in Kubernetes