Date: July 8th, 2018
Kubernetes Version: v1.11.0
Reading logs is part of an essential toolkit that is needed to debug a Kubernetes cluster.
Kubernetes logging
There are generally only two methods needed to find Kubernetes logs, systemd and docker. You will have to identify what is managing the service you're interested in and then know how to extract logs from that manager. For example, you need to know how to read systemd logs if the kubelet is managed by systemd.
systemd logs
Logs for services running on systemd can be viewed with journalctl. Here is an example reading logs of the kubelet, a service generally run through systemd:
journalctl --unit kubelet
If you are using a different service manager, please consult the documentation for how to extract logs for your particular service manager.
Docker logs
Control plane components can be managed by the kubelet using static pods. You can get their logs through kubectl logs <podname>
or if you are on the node where the static pod is running you can access the docker logs directly:
docker logs <container_id>
This is good for cases when the kube-apiserver is not working and therefore kubectl
commands are not working.
Learning to read logs
Get your bearings
Get your bearings by reading logs from a fully functional Kubernetes cluster. You'll get a glimpse into what the component is doing as well as familiarize yourself with log lines that can be ignored which will help you notice lines that might contain useful debugging information.
Don't be afraid to dig into the source code and see where each log line is coming from. Kubernetes code search is really good for this. Copy and paste parts of the log line until you get some hits and read the surrounding code. You could even use these log lines as the starting point for the strategies I outline in my Learning the Kubernetes codebase post to gain a very deep understanding of the component.
Once you're comfortable reading through logs of a working Kubernetes cluster it's time to break things. Turn off etcd and watch the logs. Turn it back on and see how the component responds. Do that with other components and watch what happens. You will start to understand which components communicate to other components and it will help you identify failures from symptoms faster.
The three most common components you will look at are the kubelet, the kube-apiserver and etcd. I'd suggest focusing on those, but do look at logs from all the components.
Examples
These examples come from a kubeadm cluster. The kubelet is managed by systemd and all control plane components are managed by the kubelet as static pods.
kubelet logs
Jul 08 16:32:51 ip-10-0-25-163 kubelet[20747]: E0708 16:32:51.415697 20747 kubelet_node_status.go:391] Error updating node status, will retry: error getting node "ip-10-0-25-163.us-west-2.compute.internal": Get https://10.0.25.163:6443/api/v1/nodes/ip-10-0-25-163.us-west-2.compute.internal?resourceVersion=0&timeout=10s: context deadline exceeded
Jul 08 16:33:01 ip-10-0-25-163 kubelet[20747]: W0708 16:33:01.416933 20747 reflector.go:341] k8s.io/kubernetes/pkg/kubelet/kubelet.go:455: watch of *v1.Service ended with: very short watch: k8s.io/kubernetes/pkg/kubelet/kubelet.go:455: Unexpected watch close - watch lasted less than a second and no items received
Jul 08 16:33:01 ip-10-0-25-163 kubelet[20747]: W0708 16:33:01.417001 20747 reflector.go:341] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: watch of *v1.Pod ended with: very short watch: k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Unexpected watch close - watch lasted less than a second and no items received
Jul 08 16:33:01 ip-10-0-25-163 kubelet[20747]: W0708 16:33:01.417031 20747 reflector.go:341] k8s.io/kubernetes/pkg/kubelet/kubelet.go:464: watch of *v1.Node ended with: very short watch: k8s.io/kubernetes/pkg/kubelet/kubelet.go:464: Unexpected watch close - watch lasted less than a second and no items received
Jul 08 16:33:01 ip-10-0-25-163 kubelet[20747]: E0708 16:33:01.417105 20747 mirror_client.go:88] Failed deleting a mirror pod "etcd-ip-10-0-25-163.us-west-2.compute.internal_kube-system": Delete https://10.0.25.163:6443/api/v1/namespaces/kube-system/pods/etcd-ip-10-0-25-163.us-west-2.compute.internal: read tcp 10.0.25.163:36190->10.0.25.163:6443: use of closed network connection; some request body already written
The errors and warnings here are:
- deleting a mirror pod failed
- watches fail
- updating node status fails
These errors together indicate either etcd or the kube-apiserver is not working. If you know how the kubelet and the kube-apiserver interact, you would even be able to directly tell this is a problem with etcd and not a problem with the kube-apiserver.
But in the interest of examples, let's go look at the kube-apiserver logs.
kube-apiserver logs
The kube-apiserver is currently crashing and restarting over and over. The logs look like this before the kube-apiserver is restarted:
Flag --insecure-port has been deprecated, This flag will be removed in a future version.
I0708 16:38:26.448081 1 server.go:703] external host was not specified, using 10.0.25.163
I0708 16:38:26.448218 1 server.go:145] Version: v1.11.0
I0708 16:38:26.775022 1 plugins.go:158] Loaded 7 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,DefaultTolerationSeconds,DefaultStorageClass,MutatingAdmissionWebhook.
I0708 16:38:26.775040 1 plugins.go:161] Loaded 5 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,PersistentVolumeClaimResize,ValidatingAdmissionWebhook,ResourceQuota.
I0708 16:38:26.775722 1 plugins.go:158] Loaded 7 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,DefaultTolerationSeconds,DefaultStorageClass,MutatingAdmissionWebhook.
I0708 16:38:26.775737 1 plugins.go:161] Loaded 5 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,PersistentVolumeClaimResize,ValidatingAdmissionWebhook,ResourceQuota.
F0708 16:38:36.778177 1 storage_decorator.go:57] Unable to create storage backend: config (&{ /registry [https://127.0.0.1:2379] /etc/kubernetes/pki/apiserver-etcd-client.key /etc/kubernetes/pki/apiserver-etcd-client.crt /etc/kubernetes/pki/etcd/ca.crt true false 1000 0xc42029ce80 <nil> 5m0s 1m0s}), err (dial tcp 127.0.0.1:2379: connect: connection refused)
The last failure line is the most important. This tells us that the kube-apiserver is not able to connect to etcd because the connection was refused.
If we restart etcd, everything comes back up and all components are happy again.
etcd logs
There are generally no logs for normal behavior, unlike the kubelet which likes to tell you what it's doing constantly even when everything is fine.
Here are some example etcd logs:
2018-07-08 16:42:20.351537 I | etcdserver: 8e9e05c52164694d as single-node; fast-forwarding 9 ticks (election ticks 10)
2018-07-08 16:42:21.039713 I | raft: 8e9e05c52164694d is starting a new election at term 4
2018-07-08 16:42:21.039745 I | raft: 8e9e05c52164694d became candidate at term 5
2018-07-08 16:42:21.039777 I | raft: 8e9e05c52164694d received MsgVoteResp from 8e9e05c52164694d at term 5
2018-07-08 16:42:21.039792 I | raft: 8e9e05c52164694d became leader at term 5
2018-07-08 16:42:21.039803 I | raft: raft.node: 8e9e05c52164694d elected leader 8e9e05c52164694d at term 5
2018-07-08 16:42:21.040764 I | etcdserver: published {Name:ip-10-0-25-163.us-west-2.compute.internal ClientURLs:[https://127.0.0.1:2379]} to cluster cdf818194e3a8c32
2018-07-08 16:42:21.041029 I | embed: ready to serve client requests
2018-07-08 16:42:21.041449 I | embed: serving client requests on 127.0.0.1:2379
WARNING: 2018/07/08 16:42:21 Failed to dial 127.0.0.1:2379: connection error: desc = "transport: authentication handshake failed: remote error: tls: bad certificate"; please retry.
The final warning is a bug in a particular version of etcd but has no impact on the functionality. This is the github issue for that bug.
everything else
Take a look at all the other components while you're there and you'll start to build both an understanding of each component and how to recognize when something is wrong.
Top comments (3)
I use this to read logs
github.com/johanhaleby/kubetail
That's cool for app logs, but this post is about system level components of kubernetes, some of which do not expose their logs through the kube-apiserver.
Sometimes you don't have a working cluster, so kubetail won't work. This post addresses what to do in that situation.
Where did you find those system component logs and any idea on how to forward them to Splunk?