DEV Community

Arseny Zinchenko
Arseny Zinchenko

Posted on • Originally published at rtfm.co.ua on

Kubernetes: Pods and WorkerNodes — control the placement of the Pods on the Nodes

Kubernetes: Pods and WorkerNodes — control the placement of the Pods on the Nodes

Kubernetes allows very flexible control over how its Pods will be located on servers, i.e. WorkerNodes.

This can be useful if you need to run a pod on a specific node configuration, for example — a WorkerNode must have a GPU, or an SSD instead of an HDD. Another example is when you need to place individual Pods next to each other to reduce their communication latency, or to reduce cross Availability-zone traffic (see AWS: Grafana Loki, InterZone traffic in AWS, and Kubernetes nodeAffinity).

And, of course, this is important for building a High Availability and Fault Tolerance architecture, when you need to divide pods into individual nodes or Availability Zones.

We have four main approaches to control how Kubernetes Pods are hosted on WorkerNodes:

  • configure Nodes in such a way that they will accept only individual Pods that meet the criteria specified on the node
  • taints and tolerations: on the Node we set the taint, for which Pods must have the appropriate toleration to run on this node
  • configure the Pod itself in such a way that it will select only individual Nodes that meet the criteria specified in the Pod
  • for this, we can use nodeName – only a Node with the specified name is selected
  • or nodeSelector to select Nodes with corresponding labels and their values
  • or nodeAffinity and nodeAntiAffinity - the rules by which Kubernetes Scheduler will choose a Node to launch the Pod depending on the parameters of this Node
  • configure the Pod itself so that it will select a Node based on how other Pods are running
  • for this, we can use podAffinity and podAntiAffinity - the rules by which Kubernetes Scheduler will choose a Node to launch the Pod depending on the other Pods on this Node
  • and a separate topic — Pod Topology Spread Constraints, i.e. the rules for placing Pods by failure domains — regions, Availability zones, or nodes

kubectl explain

Just a tip: you can always read the relevant documentation for any parameter or resource using kubectl explain:

$ kubectl explain pod
KIND: Pod
VERSION: v1
DESCRIPTION:
Pod is a collection of containers that can run on a host. This resource is
created by clients and scheduled onto hosts.
…
Enter fullscreen mode Exit fullscreen mode

Or:

$ kubectl explain Pod.spec.nodeName
KIND: Pod
VERSION: v1
FIELD: nodeName <string>
DESCRIPTION:
NodeName is a request to schedule this pod onto a specific node. If it is
non-empty, the scheduler simply schedules this pod onto that node, assuming
that it fits resource requirements.
Enter fullscreen mode Exit fullscreen mode

Node Taints and Pods Tolerations

So, the first option is to set restrictions on the Node on what Pods can be run on it using Taints and Tolerations.

Here a taint “repels” Pods that do not have a corresponding toleration to that Node, and a toleration “pulls” a Pod to a specific Node that has a corresponding one taint.

For example, we can create a Node on which only Pods with some critical services such as controllers will be launched.

To do so, specify a tain with the effect: NoSchedule - that is, prohibit the creation of new Pods on this Node:

$ kubectl taint nodes ip-10–0–3–133.ec2.internal critical-addons=true:NoSchedule
node/ip-10–0–3–133.ec2.internal tainted
Enter fullscreen mode Exit fullscreen mode

Next, create a Pod with a toleration with the key "critical-addons":

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  containers:
  - name: my-container
    image: nginx:latest
  tolerations:
    - key: "critical-addons"
      operator: "Exists"
      effect: "NoSchedule"
Enter fullscreen mode Exit fullscreen mode

Deploy, and check Pods on that Node:

$ kubectl get pod --all-namespaces -o wide --field-selector spec.nodeName=ip-10–0–3–133.ec2.internal
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
default my-pod 1/1 Running 0 2m11s 10.0.3.39 ip-10–0–3–133.ec2.internal <none> <none>
dev-monitoring-ns atlas-victoriametrics-loki-logs-zxd9m 2/2 Running 0 10m 10.0.3.8 ip-10–0–3–133.ec2.internal <none> <none>
…
Enter fullscreen mode Exit fullscreen mode

But where does Loki come from? Because while the Taint was set, the Scheduler managed to move a Loki’s Pod to this Node.

To prevent this, add a key NoExecute to the Tain - then the scheduler will perform Pod eviction to move already running Pods from this Node to other Nodes:

$ kubectl taint nodes ip-10–0–3–133.ec2.internal critical-addons=true:NoExecute
Enter fullscreen mode Exit fullscreen mode

Check taints now:

$ kubectl get node ip-10–0–3–133.ec2.internal -o json | jq '.spec.taints'
[
{
“effect”: “NoExecute”,
“key”: “critical-addons”,
“value”: “true”
},
{
“effect”: “NoSchedule”,
“key”: “critical-addons”,
“value”: “true”
}
]
Enter fullscreen mode Exit fullscreen mode

For our Pod add the second one toleration, otherwise it will be evicted from this Node too:

...
  tolerations:
    - key: "critical-addons"
      operator: "Exists"
      effect: "NoSchedule"
    - key: "critical-addons"
      operator: "Exists"
      effect: "NoExecute"
Enter fullscreen mode Exit fullscreen mode

Deploy and check Pods on this Node again:

$ kubectl get pod --all-namespaces -o wide --field-selector spec.nodeName=ip-10–0–3–133.ec2.internal
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
default my-pod 1/1 Running 0 3s 10.0.3.246 ip-10–0–3–133.ec2.internal <none> <none>
kube-system aws-node-jrsjz 1/1 Running 0 16m 10.0.3.133 ip-10–0–3–133.ec2.internal <none> <none>
kube-system csi-secrets-store-secrets-store-csi-driver-cctbj 3/3 Running 0 16m 10.0.3.144 ip-10–0–3–133.ec2.internal <none> <none>
kube-system ebs-csi-node-46fts 3/3 Running 0 16m 10.0.3.187 ip-10–0–3–133.ec2.internal <none> <none>
kube-system kube-proxy-6ztqs 1/1 Running 0 16m 10.0.3.133 ip-10–0–3–133.ec2.internal <none> <none>
Enter fullscreen mode Exit fullscreen mode

Now, on this Node, we have only our Pod, and Pods from DaemonSets which by default should run on all Nodes and have the corresponding tolerations, see How Daemon Pods are scheduled.

In addition to the Exists that only checks for the presence of a specified label, it is possible to check the value of this label.

To do so, use Equal in the operator, and add a required value:

...
  tolerations:
    - key: "critical-addons"
      operator: "Equal"
      value: "true"
      effect: "NoSchedule"
    - key: "critical-addons"
      operator: "Equal"
      value: "true"
      effect: "NoExecute"
Enter fullscreen mode Exit fullscreen mode

To delete a tain - add a minus at the end:

$ kubectl taint nodes ip-10–0–3–133.ec2.internal critical-addons=true:NoSchedule-
node/ip-10–0–3–133.ec2.internal untainted

$ kubectl taint nodes ip-10–0–3–133.ec2.internal critical-addons=true:NoExecute-
node/ip-10–0–3–133.ec2.internal untainted
Enter fullscreen mode Exit fullscreen mode

Choosing a Node by a Pod: nodeName, nodeSelector, and nodeAffinity

Another approach is when we configure a Pod in such a way that “it” chooses which Node to run on.

For this we have nodeName, nodeSelector, nodeAffinity and nodeAntiAffinity. See Assign Pods to Nodes.

nodeName

The most straightforward way. Has precedence over all others:

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  containers:
  - name: my-container
    image: nginx:latest
  nodeName: ip-10-0-3-133.ec2.internal
Enter fullscreen mode Exit fullscreen mode

nodeSelector

With the nodeSelector we can choose Nodes which has a corresponding labels.

Add a label to the Node:

$ kubectl label nodes ip-10–0–3–133.ec2.internal service=monitoring
node/ip-10–0–3–133.ec2.internal labeled
Enter fullscreen mode Exit fullscreen mode

Check it:

$ kubectl get node ip-10–0–3–133.ec2.internal -o json | jq '.metadata.labels'
{
…
“kubernetes.io/hostname”: “ip-10–0–3–133.ec2.internal”,
“kubernetes.io/os”: “linux”,
“node.kubernetes.io/instance-type”: “t3.medium”,
“service”: “monitoring”,
…
Enter fullscreen mode Exit fullscreen mode

In the Pod’s manifest set the nodeSelector:

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  containers:
  - name: my-container
    image: nginx:latest
  nodeSelector:
    service: monitoring
Enter fullscreen mode Exit fullscreen mode

If several labels are assigned in the Pod’s nodeSelector, then the corresponding Node must have all these labels in order for this Pod to run on it.

nodeAffinity and nodeAntiAffinity

nodeAffinity and nodeAntiAffinity operate in the same way as the nodeSelector, but have more flexible capabilities.

For example, you can set hard or soft launch limits — for a soft limit, the scheduler will try to launch a Pod on the corresponding Node, and if it cannot, it will launch it on another. Accordingly, if you set a hard limit and the scheduler cannot start the Pod on the selected Node, the Pod will remain in Pending status.

The hard limit is set in the field .spec.affinity.nodeAffinity with the requiredDuringSchedulingIgnoredDuringExecution, and the soft limit is set with the preferredDuringSchedulingIgnoredDuringExecution.

For example, we can launch a Pod in AvailabilityZone us-east-1a or us-east-1b using node-label topology.kubernetes.io/zone:

$ kubectl get node ip-10–0–3–133.ec2.internal -o json | jq '.metadata.labels'
{
…
“topology.kubernetes.io/region”: “us-east-1”,
“topology.kubernetes.io/zone”: “us-east-1b”
}
Enter fullscreen mode Exit fullscreen mode

Set a hard-limit:

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  containers:
  - name: my-container
    image: nginx:latest
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: topology.kubernetes.io/zone
            operator: In
            values:
            - us-east-1a
            - us-east-1b
Enter fullscreen mode Exit fullscreen mode

Or a soft limit. For example, with a non-existent label:

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  containers:
  - name: my-container
    image: nginx:latest
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 1
        preference:
          matchExpressions:
          - key: non-exist-node-label
            operator: In
            values:
            - non-exist-value
Enter fullscreen mode Exit fullscreen mode

In this case, the Pod will still be launched on whichever Node is most available.

You can also combine conditions:

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  containers:
  - name: my-container
    image: nginx:latest
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: topology.kubernetes.io/zone
            operator: In
            values:
            - us-east-1a
            - us-east-1b
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 1
        preference:
          matchExpressions:
          - key: non-exist-node-label
            operator: In
            values:
            - non-exist-value
Enter fullscreen mode Exit fullscreen mode

When using several conditions in the requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms, the first one that coincided with the Node's label will be selected.

When using several conditions in the matchExpressions field they all must match.

In the operator you can use operators In, NotIn, Exists, DoesNotExist, Gt (greater than) and Lt (less than).

soft-limit and the weight

In the preferredDuringSchedulingIgnoredDuringExecution you can set a weight of the condition setting a value from 1 to 100.

In this case, if all other conditions coincide, the scheduler will select a Node with the largest condition weight:

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  containers:
  - name: my-container
    image: nginx:latest
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 1
        preference:
          matchExpressions:
          - key: topology.kubernetes.io/zone
            operator: In
            values:
            - us-east-1a
      - weight: 100
        preference:
          matchExpressions:
          - key: topology.kubernetes.io/zone
            operator: In
            values:
            - us-east-1b
Enter fullscreen mode Exit fullscreen mode

This Pod will be launched on a Node in the us-east-1b zone:

$ kubectl get pod my-pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
my-pod 1/1 Running 0 3s 10.0.3.245 ip-10–0–3–133.ec2.internal <none> <none>
Enter fullscreen mode Exit fullscreen mode

And the zone of this Node:

$ kubectl get node ip-10–0–3–133.ec2.internal -o json | jq -r '.metadata.labels."topology.kubernetes.io/zone"'
Enter fullscreen mode Exit fullscreen mode

us-east-1b

podAffinity and podAntiAffinity

Similar to selecting a Node using hard and soft limits, you can adjust Pod Affinity depending on what labels Pods already running on the Node will have. See Inter-pod affinity and anti-affinity.

For example, Grafana Loki has three Pods — Read, Write, and Backend.

We want to run the Read and Backend in the same AvailabilityZone to avoid cross-AZ traffic, but at the same time, we want them not to run on those Nodes where there are Write Pods.

Loki Pods have labels corresponding to a component — app.kubernetes.io/component=read, app.kubernetes.io/component=backend, and app.kubernetes.io/component=write.

So, for the Read Pod, we can set a podAffinity to Pods with the label app.kubernetes.io/component=backend, and podAntiAffinity to Pods with a label app.kubernetes.io/component=read:

...
    spec:
      affinity:
        podAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app.kubernetes.io/component
                operator: In
                values:
                - backend
            topologyKey: "topology.kubernetes.io/zone"
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app.kubernetes.io/component
                operator: In
                values:
                - write
            topologyKey: "kubernetes.io/hostname"
...
Enter fullscreen mode Exit fullscreen mode

Here in the podAffinity.topologyKey we set that we want to place Pods using the topology.kubernetes.io/zone domain - that is, topology.kubernetes.io/zone for Read Pods must match the Backend Pods.

And in the podAntiAffinity.topologyKey we set the kubernetes.io/hostname, that is, do not place on WorkerNodes, where there are Pods with the label app.kubernetes.io/component=write.

Let’s deploy and check where there is a Write Pod:

$ kubectl -n dev-monitoring-ns get pod loki-write-0 -o json | jq '.spec.nodeName'
“ip-10–0–3–53.ec2.internal”
Enter fullscreen mode Exit fullscreen mode

And AvailabilityZone of this Node:

$ kubectl -n dev-monitoring-ns get node ip-10–0–3–53.ec2.internal -o json | jq -r '.metadata.labels."topology.kubernetes.io/zone"'
us-east-1b
Enter fullscreen mode Exit fullscreen mode

Check where the Backend Pod is placed:

$ kubectl -n dev-monitoring-ns get pod loki-backend-0 -o json | jq '.spec.nodeName'
“ip-10–0–2–220.ec2.internal”
Enter fullscreen mode Exit fullscreen mode

And its zone:

$ kubectl -n dev-monitoring-ns get node ip-10–0–2–220.ec2.internal -o json | jq -r '.metadata.labels."topology.kubernetes.io/zone"'
us-east-1a
Enter fullscreen mode Exit fullscreen mode

And now, a Read Pod:

$ kubectl -n dev-monitoring-ns get pod loki-read-698567cdb-wxgj5 -o json | jq '.spec.nodeName'
“ip-10–0–2–173.ec2.internal”
Enter fullscreen mode Exit fullscreen mode

The Node is different from the Write or Backend Nodes, but:

$ kubectl -n dev-monitoring-ns get node ip-10–0–2–173.ec2.internal -o json | jq -r '.metadata.labels."topology.kubernetes.io/zone"'
us-east-1a
Enter fullscreen mode Exit fullscreen mode

The same AvailabilityZone as in the Backend Pod.

Pod Topology Spread Constraints

We can configure Kubernetes Scheduler in such a way that it distributes Pods by “domains”, that is, by nodes, regions, or Availability Zones. See Pod Topology Spread Constraints.

For this, we can set the necessary config in the field spec.topologySpreadConstraints, which describes exactly how pods will be created.

For example, we have 5 WorkerNodes in two AvailabilityZones.

We want to run 5 Pods and for fault tolerance we want each Pod to be on a separate Node.

Then our config for a Deployment can look like this:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-deployment
spec:
  replicas: 5
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
        - name: my-container
          image: nginx:latest
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: kubernetes.io/hostname
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              app: my-app
Enter fullscreen mode Exit fullscreen mode

Here:

  • maxSkew: the maximum difference in the number of pods in one domain (topologyKey)
  • plays a role only if whenUnsatisfiable=DoNotSchedule, when whenUnsatisfiable=ScheduleAnyway then a Pod will be created regardless of the conditions
  • whenUnsatisfiable: can have value DoNotSchedule - do not allow Pods to be created, or ScheduleAnyway
  • topologyKey: a WorkerNode label, by which the domain will be selected, that is, by which label we group the Nodes on which the placement of Pods is calculated
  • labelSelector: what Pods to take into account when placing new Pods (for example, if Pods are from different Deployments, but should be placed in the same way - then in both Deployments we configure topologySpreadConstraints with mutual ones labelSelector)

In addition, you can set the nodeAffinityPolicy parameters and/or nodeTaintsPolicy with the Honor ​​or Ignore values to configure if nodeAffinity or nodeTaints of a Pod must be taken into account during calculating the placement of a Pod.

Let’s deploy and check the Nodes of these Pods:

$ kk get pod -o json | jq '.items[].spec.nodeName'
“ip-10–0–3–53.ec2.internal”
“ip-10–0–3–22.ec2.internal”
“ip-10–0–2–220.ec2.internal”
“ip-10–0–2–173.ec2.internal”
“ip-10–0–3–133.ec2.internal”
Enter fullscreen mode Exit fullscreen mode

All are placed on separate Nodes.

Originally published at RTFM: Linux, DevOps, and system administration.


Top comments (0)