Maël Valais

Posted on Nov 12, 2019 • Edited on Dec 20, 2021 • Originally published at maelvls.dev

What the heck are Conditions in Kubernetes controllers?

While building a Kubernetes controller using CRDs, I stumbled across 'conditions' in the status field. What are conditions and how should I implement them in my controller?

In this post, I will explain what 'status conditions' are in Kubernetes and show how they can be used in your own controllers.

Table of contents:

Pod example
What other projects do
Conditions vs. State machine
Conditions vs. Events
Orthogonality vs. Extensibility
Are Conditions still used?
Conditions vs. Reasons
How many conditions?

In the following, a 'component' is considered to be one sync loop. A sync loop (also called reconciliation loop) is what must be done in order to synchronize the 'desired state' with the 'observed state'.

Kubernetes itself is made of multiple binaries (kubelet on each node, one apiserver, one kube-controller-manager and one kube-scheduler). And each of these binaries have multiple components (i.e., sync loops):

binary	sync loop = component	reads	creates	updates
kube-controller-manager	`syncDeployment`	Pod	ReplicaSet	Deployment
kube-controller-manager	`syncReplicaSet`		Pod
kubelet	`syncPod`			Pod
kube-scheduler	`scheduleOne`			Pod
kubelet	`syncNodeStatus`			Node

We can see that one single object (Pod) can be read, edited and updated by different components. When I say 'edited', I mean the sync loop edits the status (which contains the conditions), not the rest. The status is a way of communicating between components/sync loops.

As Brian Grant put it,

Pod example

Let's read a Pod manifest:

# kubectl -n cert-manager get pods -oyaml
kind: Pod
status:
  conditions:
    - lastTransitionTime: "2019-10-22T16:29:24Z"
      status: "True"
      type: PodScheduled
    - lastTransitionTime: "2019-10-22T16:29:24Z"
      status: "True"
      type: Initialized
    - lastTransitionTime: "2019-10-22T16:29:31Z"
      status: "True"
      type: ContainersReady
    - lastTransitionTime: "2019-10-22T16:29:31Z"
      status: "True"
      type: Ready
  containerStatuses:
    - image: quay.io/jetstack/cert-manager-controller:v0.6.1
      ready: true
      state: { running: { startedAt: "2019-10-22T16:29:31Z" } }
  phase: Running

A condition contains a Status and a Type. It may also contain a Reason and a Message. For example:

- lastTransitionTime: "2019-10-22T16:29:31Z"
  status: "True"
  type: ContainersReady

The pod-lifecycle documentation explains the difference between the 'phase' and 'conditions:

The top-level phase is an aggregated state that answers some user-facing questions such as is my pod in a terminal state? but has gaps since the actual state is contained in the conditions.
The conditions array is a set of types (Ready, PodScheduled...) with a status (True, False or Unknown) that make up the 'computed state' of a Pod at any time. As we will see later, the state is almost always 'partial' (open-ended conditions).

Now, let's see what I mean by 'components'. The status of a Pod is not updated by a single Sync loop: it is updated by multiple components: the kubelet, and the kube-scheduler. Here is a list of the condition types per component:

Possible condition types for a Pod	Component that updates this condition type
PodScheduled	`scheduleOne` (kube-scheduler)
Unschedulable	`scheduleOne` (kube-scheduler)
Initialized	`syncPod` (kubelet)
ContainersReady	`syncPod` (kubelet)
Ready	`syncPod` (kubelet)

Although conditions are a good way to convey information to the user, they also serve as a way of communicating between components (e.g., between kube-scheduler and apiserver) but also to external components (e.g. a custom controller that wants to trigger something as soon as a pod becomes 'Unschedulable', and maybe order more VMs to the cloud provider and add it as a node.

As you will see below, the 'conditions' array is considered to be containing all the 'ground truth'. The 'phase' just an abstraction of these conditions.

What other projects do

cluster-api and its providers (e.g., cluster-api-gcp-provider) do not use Conditions at all:

(Feng Min, Nov 2018) Brian Grant (@bgrant0607) and Eric Tune (@erictune) have indicated that the API pattern of having "Conditions" lists in object statuses is soon to be deprecated. These have generally been used as a timeline of state transitions for the object's reconciliation, and difficult to consume for clients that just want a meaningful representation of the object's current state. There are no existing examples of the new pattern to follow instead, just the guidance that we should use top-level fields in the status to represent meaningful information. We can revisit the specifics when new patterns start to emerge in core.

Instead of using a list of conditions, cluster-api projects go like that:

  status:
    ready: false
    phase: "Failed"
    failureReason: "GcpMachineCrashed"
    failureMessage: "VM crashed: ..."

cert-manager uses Conditions, but the only type is 'Ready'. The rest of the information is given in Ready's Reason field. That might be unfortunate for users or other components that are willing to use cert-manager's API, similarly to what happens with the PodSchedulable condition and its reason, PodReasonUnschedulable. For example, the Issuer's Sync:

  kind: Issuer
  status:
    conditions:
      - type: "Ready"
        status: "False"
        reason: "ErrorConfig"
        message: "Resource validation failed: ..."

Openshift uses conditions a lot but does not use the standard 'Ready' type. For example, cluster-version-operators (CVO) has a kind: ClusterOperator with condition types 'Available', 'Progressing', 'Degraded' and 'Upgradeable'.
Hyperconverged Cluster Operator (HCO) uses conditions but in a slightly different way that cert-manager or openshift does. Each condition is set by a different component

It's important to point out that the Condition types on the HCO don't represent the condition the HCO is in, rather the condition of the component CRs.

Gardener also uses conditions, and like Openshift, it doesn't rely on the standard 'Ready' type. For example, the kind: ControllerInstallation has conditions types 'Valid' and 'Installed'.

We see a lot of variance across projects. In late 2017, we saw some uncertainty about Conditions disappearing. In the remaining of this post, I will try to give answers as to why you should use conditions and how to do so.

Conditions vs. State machine

(api-conventions.md, July 2017) Conditions are observations and not, themselves, state machines.

So, what is the difference between a state in a state machine and an observation? A state machine has a known a fixed state. In comparison, conditions offers an 'open-world' perspective with the Unknown value. For example, the status of a Pod is partly constructed by the kube-scheduler, partly by the kubelet.

But when an object gets created, no conditions are present. What does that mean?

(Tim Hockin, May 2015) The absence of a condition should be interpreted the same as Unknown.

In the end, this set of conditions are just a way of communicating changes of state between components, and this state can always be reconstructed by observing the system:

(principles.md, 2015) Object status must be 100% reconstructable by observation. Any history kept (E.g., through conditions or other fields in 'status') must be just an optimization and not required for correct operation.

Do not define comprehensive state machines for objects with behaviors associated with state transitions and/or "assumed" states that cannot be ascertained by observation.

One last comment that explains why the Kubernetes docs don't have any state-machine diagram:

(Tim Hockin, Apr 2016) This [state-machine-alike diagram representing the Pod states] is not strictly correct though, because a pod can cease to be "ready" but not die. We've intentionally NOT drawn this as a state-machine because conditions are really orthogonal concepts. We report a singular status as part of kubectl, so maybe we should document THAT, but it's more complicated than this diagram captures.

Conditions vs. Events

Events are meant to save history (for non-machine consumption, i.e., for humans). The set of conditions describes the 'current' state:

(Brian Grant, Aug 2017 Status, represented as conditions or otherwise, should be complementary to events pertaining to the resource. Events provide additional information for debugging and tracing, akin to log messages, but aren't intended to be used for event-driven automation.

It was reminded in 2019:

(Kenneth Owens, Oct 2019) Events are used to express point in time occurrences, and Conditions are used to express the current state of an ongoing process. For instance, Pod creation is an event, Deployment progressing is a condition.

Orthogonality vs. Extensibility

In 2015, the Kubernetes team seemed to be attached to the idea of 'orthogonality' between conditions.

(David Oppenheimer, Apr 2015) To avoid confusion, Conditions should be orthogonal. The ones you suggested -- Instantiated, Initialized, and Terminated -- don't seem to be orthogonal (in fact I would expect a pod to transition from FFF to TFF to TTF to TTT). What you've described sounds more like a state machine than a set of orthogonal conditions, and I think Conditions should only be used for the latter.

In 2017, things start to get confusing: each condition represents one non-orthogonal "state" of a resource?! Conditions are orthogonal to each other, but the states that they represent may be non-orthogonal...

(Brian Grant, Aug 2017) Conditions were created as a way to wean contributors off of the idea of state machines with mutually exclusive states (called "phase").

Each condition should represent a non-orthogonal "state" of a resource -- is the resource in that state or not (or unknown).

Arbitrarily typed/structured status properties specific to a particular resource should just be status fields.

I did not find anywhere a definition of 'orthogonal condition'; my guess is that two conditions are orthogonal when they explain two uncorrelated parts of the system.

But in 2019, the narrative seems to have changed: Brian Grant talks about 'non-orthogonal & extensible conditions'.

(Brian Grant, Mar 2019) Rather than rigid fine-grained state enumerations that couldn't be evolved, we initially adopted simple basic states that could report open-ended reasons for being in each state (http://issues.k8s.io/1146), and later non-orthogonal, extensible conditions (http://issues.k8s.io/7856).

Orthogonality was hard to implement anyway:

(Steven E. Harris, Aug 2019) The burden is on the status type author to come up with dimensions that are orthogonal; that required careful thought when writing my own condition types.

Here is what I think: you should definitely use conditions, but don't bother with orthogonality. Just use conditions that represent important changes for the object, beginning with the 'Ready' condition type. 'Ready' is the strongest condition of all and indicates something 100% operational.

Are Conditions still used?

There was an extensive discussion, started in 2015, about removing or keeping the "phase" field. In 2017, the same thread turned into "should we keep using conditions?"; Brian Grant found conditions cumbersome (an array is harder to deal with than top-level fields) and confusing because of the open-ended statuses (True, False, Unknown). In late 2017, Brian Grant did a survey and found out that conditions are still what controller authors should use. In 2019, Daniel Smith reminded us that conditions are not going away:

(Daniel Smith, May 2019) Conditions are not going to be removed.

The ground truth is set as conditions by the components that are nearby, e.g. kubelet sets "DiskPressure = True".

The set of conditions is summarized into phases (or secondary conditions) for consumption by general controllers. E.g., if there is disk pressure, that can be aggregated into conditionSummary.schedulable: false. The process doing this needs to know all possible gound truth statements; the process of summarizing them isolates the rest of the cluster from needing to know this.

Conditions vs. Reasons

Do I really need to bother with this .status.conditions array, can I just use .status.phase with a simple enum? Or just a reason as part of an existing 'Ready' condition?

At first, Kubernetes would rely on many Reasons:

(api-conventions.md, July 2017) In condition types, and everywhere else they appear in the API, Reason is intended to be a one-word, CamelCase representation of the category of cause of the current status, and Message is intended to be a human-readable phrase or sentence, which may contain specific details of the individual occurrence.

Reason is intended to be used in concise output, such as one-line kubectl get output, and in summarizing occurrences of causes, whereas Message is intended to be presented to users in detailed status explanations, such as kubectl describe output.

Later, they mode to non-orthogonal (read: may-be-corrolated conditions):

(Brian Grant, Mar 2019) Rather than rigid fine-grained state enumerations that couldn't be evolved, we initially adopted simple basic states that could report open-ended reasons for being in each state (2014), and later non-orthogonal, extensible conditions (2017).

Since the Pod conditions do not have a 'Reason' field (in the previous Pod example), let's take a look at the Node conditions instead:

kind: Node
status:
  conditions:
    - type: DiskPressure
      status: "False"
      reason: KubeletHasNoDiskPressure
      message: kubelet has no disk pressure
      lastHeartbeatTime: "2019-11-17T14:18:26Z"
      lastTransitionTime: "2019-10-22T16:27:53Z"
    - type: Ready
      status: "True"
      reason: KubeletReady
      message: kubelet is posting ready status. AppArmor enabled
      lastHeartbeatTime: "2019-11-17T14:18:26Z"
      lastTransitionTime: "2019-10-22T16:27:53Z"

Here, the 'Ready' condition type has an additional 'Reason' field that gives the user more information about the transition. The Reason field is sometimes used by other components: for example, PodReasonUnschedulable is a reason of the PodSchedulable condition type. This reason is set by the kube-scheduler and is 'consumed' by the kubelet (or the reverse, I can't really tell).

So, Reason of another condition or Condition? If you feel that this state might be interesting to the rest of the system, then use a proper Condition.

Regarding the 'phase' top-level field, I would recommend to offer it to your users when the conditions alone cannot help them answer questions such as has my pod terminated?.

How many conditions?

If you think that the 'Errored' state can be useful for other components or the user (e.g. kubectl wait --for=condition=errored), then add it. If you think it is important for other components or for the user to know that something is in progress, go ahead and add a 'InProgress' condition.

Regarding the naming of condition types, here is some advice:

(api-conventions.md, July 2017) Condition types should indicate state in the "abnormal-true" polarity. For example, if the condition indicates when a policy is invalid, the "is valid" case is probably the norm, so the condition should be called "Invalid".

Also, remember that these types are part of your API and you should keep in mind that they require maintaining backwards- and forwards-compatibility. Adding a new condition is not free: you must maintain them over time.

(api-conventions.md, July 2017) The meaning of a Condition can not be changed arbitrarily - it becomes part of the API, and has the same backwards- and forwards-compatibility concerns of any other part of the API.

Conditions are also a clean way of letting third-party components (such as the cluster-api controller) to add their own 'named' conditions to an existing object, e.g. to a Pod (see pod-readiness-gate):

Kind: Pod
spec:
  readinessGates:
    - conditionType: "www.example.com/feature-1"
status:
  conditions:
    - type: Ready # this is a builtin PodCondition
      status: "False"
    - type: "www.example.com/feature-1" # an extra PodCondition
      status: "False"

Update 20 Dec 2021: Xiang Yuxuan kindly nudged me about the fact that I was was saying that the 2017 thread was talking about removing conditions, but it was initially to remove the 'phase' field.

Top comments (7)

XiangYuxuan • Dec 20 '21

Really nice blog !!!
There was an extensive dicussion about removing or keeping these conditions. They were described by Brian Grant in Aug 2017 as cumbersome (an array is harder to deal with than top-level fields)

Typo: The discuss may be about whether should remove phase.

Kubernetes API Conventions now indicate that it's better not use phase in the future.

Jeremy Lewi • Feb 12 '21

Really nice blog.

It looks like the Kubernetes API Conventions now include some pretty useful description of how conditions should be used.

krishnan-recrooz • Feb 5 '21

Condition is the observation of an observer and whereas the state is observation by the observed itself. So the sync loops can state the condition of the Pod but the Pod can itself declare its state. The Events are orthogonal and both the observer and observed can raise events. The soft difference here would be that the Observer events are coarse - i.e, the observer only gets to raise events at the point of observation whereas the observed can raise events anytime.

nsoubelet • Oct 14 '20 • Edited

Hello @Mäel. First of all really nice article. I do not remember when it was the last time I read something with this kind of good research and concrete/trustable references.

This discussion about Conditions and Status and Events lead me to a cul-de-sac when it comes to an event-driven design since Watchers in Kubernetes are related to Events, but as Brian Grant says, they aren't intended to be used for event-driven automation.
Let's imagine that we have a controller and a CRD. That controller does the reconciliation but, in case of problems, there is another component in the cluster that finishes the lifecycle, for example rolling back changes in case of certain "back-off" threshold is reached. Let's call it "distributed controller".

In that scenario, what should the other component listen to? That is, theoretically speaking events won't provide the actual state, only a series of "logs" but to get conditions the other component should be polling the API regularly? - It does not scale.

So, in a nutshell, the questions is: what do you think is the best solution when having an event-driven approach in which a controller is a bunch of independent components reacting to state changes?
(don't get me wrong, I am referring to state using the idea of Conditions explained in the article, so in this context a state could be the latest condition)

Thanks! :)

Maël Valais • Feb 26 '23 • Edited

What should the other component listen to?

Each component (or "control loop") should reconcile everytime something "happened". The control loop doesn't know what happened, it has to figure it out by itself (control loops are "level triggered" as opposed to being "edge triggered").

Now, how should one control loop "communicate" with the other control loop?

First, it is OK to have two control loops reconciling the same object as long as they are doing it in an orthogonal way (the control loops are "decoupled"). The pod object is a good example of an object being reconciled by orthogonal controllers: kube-scheduler and kubelets both reconcile pods. Neither controller needs information from the other control loop.

The opposite ("coupled" control loops) also exist: cert-manager has a 4 different control loops reconciling the Certificate object. The coupling is due to a condition that gets added by a control loop (the "trigger controller") and removed by another (the "readiness controller"). In my opinion, this is a mistake we made in the cert-manager codebase. These controllers are not orthogonal, and understand one controller means that you also need to understand how the other controllers behave.

In your example, the second controller will be triggered everytime something happens, and it will need to figure out by itself that the object has entered the backoff threshold. I imagine that the first controller stores some form time to keep track of the backoff, something like this:

status:
  conditions:
    - type: BackoffThresholdReached
      status: "True"

The second controller would wait for this condition to perform operations. This bit of state can't be guessed by the second controller, so the first controller is definitely passing state to the second controller.

I think it is OK to do that as long as the second controller doesn't change that same condition.