Michael Crenshaw

Posted on Mar 1, 2021

How to debug an Argo Workflow

#kubernetes

Argo Workflows is a tool for running a series (or a graph) of containers on Kubernetes, tying them together into a workflow.

It's a relatively young project, so things like error messages and documentation still need some work.

In the meantime, here are some steps you can take when your Workflow doesn't behave as expected.

Inspect the Workflow with Argo CLI

For simple issues, the Argo CLI will include a short description in the MESSAGE column of its argo get output.

$ argo get workflow-template-dag-diamond 
Name:                workflow-template-dag-diamond
Namespace:           default
ServiceAccount:      default
Status:              Succeeded
Conditions:          
 Completed           True
Created:             Mon Mar 01 08:51:26 -0500 (7 minutes ago)
Started:             Mon Mar 01 08:51:26 -0500 (7 minutes ago)
Finished:            Mon Mar 01 08:51:36 -0500 (7 minutes ago)
Duration:            10 seconds
Progress:            1/1
ResourcesDuration:   5s*(1 cpu),5s*(100Mi memory)

STEP                              TEMPLATE                                               PODNAME                                   DURATION  MESSAGE
 ✔ workflow-template-dag-diamond  diamond                                                                                                      
 └─✔ A                            workflow-template-whalesay-template/whalesay-template  workflow-template-dag-diamond-2997968480  6s

If there's a message that's just not quite detailed enough to figure out the problem, copy the PODNAME of the failed step, and skip to the section about using kubectl to describe the Pod.

If there's no useful message, try describing the Workflow.

Use kubectl to describe the Workflow

Sometimes there's a problem with the whole Workflow that doesn't fit nicely in argo get's MESSAGE column.

kubectl describe workflow will print a lot more details about the Workflow, including a list of Events. The Events often contain details about what went wrong.

$ kubectl describe workflow workflow-template-dag-diamond
Name:         workflow-template-dag-diamond

...

Events:
  Type    Reason                 Age    From                 Message
  ----    ------                 ----   ----                 -------
  Normal  WorkflowRunning        8m42s  workflow-controller  Workflow Running
  Normal  WorkflowNodeSucceeded  8m32s  workflow-controller  Succeeded node workflow-template-dag-diamond.A
  Normal  WorkflowNodeSucceeded  8m32s  workflow-controller  Succeeded node workflow-template-dag-diamond
  Normal  WorkflowSucceeded      8m32s  workflow-controller  Workflow completed

Use kubectl to describe the Pod

If a Pod fails, there are a number of places which may hold clues. One is the Events associated with the Pods.

Pod names may be unpredictable (they often have random suffixes like -93750129), so use argo get to get the name of the suspect Pod.

Then use kubectl describe po to see the Pod details, including the Events.

$ kubectl describe po workflow-template-dag-diamond-2997968480
Name:         workflow-template-dag-diamond-2997968480

...

Events:
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
  Normal   Scheduled  36s                default-scheduler  Successfully assigned default/workflow-template-dag-diamond-2997968480 to docker-desktop
  Normal   Pulled     35s                kubelet            Container image "argoproj/argoexec:v2.12.9" already present on machine
  Normal   Created    35s                kubelet            Created container wait
  Normal   Started    35s                kubelet            Started container wait
  Warning  Failed     31s                kubelet            Failed to pull image "docker/whalesay": rpc error: code = Unknown desc = Error response from daemon: Head https://registry-1.docker.io/v2/docker/whalesay/manifests/latest: x509: certificate is valid for auth.docker.io, not registry-1.docker.io
  Normal   Pulling    17s (x2 over 35s)  kubelet            Pulling image "docker/whalesay"
  Warning  Failed     16s (x2 over 31s)  kubelet            Error: ErrImagePull
  Warning  Failed     16s                kubelet            Failed to pull image "docker/whalesay": rpc error: code = Unknown desc = Error response from daemon: Get https://registry-1.docker.io/v2/: x509: certificate is valid for auth.docker.io, not registry-1.docker.io
  Normal   BackOff    5s (x2 over 30s)   kubelet            Back-off pulling image "docker/whalesay"
  Warning  Failed     5s (x2 over 30s)   kubelet            Error: ImagePullBackOff

This Workflow failed because a proxy issue is preventing pulls from Docker Hub.

Sometimes the problem doesn't show up in the Events, because the failure is inside one of the step's containers.

Use kubectl to read the Pod logs

Pods run as part of Argo Workflows have two or three containers: wait, main, and sometimes init.

The wait sidecar is injected by Argo to keep an eye on the main container (your code) and communicate with the Argo Workflow controller (another Pod) about the step's progress.

The main container is the one you set up when you defined the Workflow in yaml. (Look for the image, command, args, and source items to see part of this Pod's configuration.)

The init container, if present, is also injected by Argo. It does things like pulling artifacts into the Pod.

To read the logs, use kubectl logs. For example:

$ kubectl logs workflow-template-dag-diamond-2997968480 init
error: container init is not valid for pod workflow-template-dag-diamond-2997968480
C02D507EMD6P:test_workflows ekmed$ kubectl logs workflow-template-dag-diamond-2997968480 wait
time="2021-03-01T14:09:18.339Z" level=info msg="Starting Workflow Executor" version=v2.12.9
time="2021-03-01T14:09:18.346Z" level=info msg="Creating a docker executor"
time="2021-03-01T14:09:18.346Z" level=info msg="Executor (version: v2.12.9, build_date: 2021-02-16T22:51:48Z) initialized (pod: default/workflow-template-dag-diamond-2997968480) with template:\n{\"name\":\"whalesay-template\",\"arguments\":{},\"inputs\":{\"parameters\":[{\"name\":\"message\",\"value\":\"A\"}]},\"outputs\":{},\"metadata\":{},\"container\":{\"name\":\"\",\"image\":\"docker/whalesay\",\"command\":[\"cowsay\"],\"resources\":{}}}"
time="2021-03-01T14:09:18.346Z" level=info msg="Waiting on main container"
time="2021-03-01T14:14:17.998Z" level=info msg="Alloc=4699 TotalAlloc=14633 Sys=70080 NumGC=6 Goroutines=7"

The logs from init and wait may be a bit difficult to read, because they come from Argo. The logs for main will be from your configured image, so they'll probably be more familiar.

Use kubectl to read the Workflow controller logs

Argo comes with a Pod called the "Workflow controller" to sort of usher a Workflow through the process of running all its steps.

If all the other debugging techniques fail, the Workflow controller logs may hold helpful information.

First, find the Pod name. If you used the default Argo installation command, the Pod will be in the argo namespace.

$ kubectl get po -n argo
NAME                                  READY   STATUS    RESTARTS   AGE
argo-server-6bb488c6c8-ff88g          1/1     Running   0          40m
workflow-controller-57db6b46f-7qfr9   1/1     Running   0          40m
$ kubectl logs workflow-controller-57db6b46f-7qfr9 -n argo

... lots of stuff here ...

Ask for help

If none of these solves your problem, ask a question on StackOverflow, start a discussion on GitHub, or ask in the Argo Slack.

These are just a few of my go-to tools. If I'm missing anything, please comment!

Top comments (1)

fuzznaut • Nov 24 '21 • Edited

Good article! I have something to add which helps me a lot in debugging workflows.

Often times I need to get a shell inside the main pod of the stage I'm trying to debug. To do this I temporarily replace/add the command field of that workflow container to an infinite command like tail -f /dev/null or sleep infinity and submit the workflow.

Now the workflow will get stuck in the stage running the infinite command and I'm free to exec into the main container of that pod with something like kubectl -n argo exec -itc main <pod_name> -- bash (replace bash with any other command that your container can run).

After this, I can manually run the code the stage was intended to run with an interactive debugger or explore how the container environment is setup. This is useful when your code depends on artifacts that were produced on past stages or for debugging directly on the environment that is giving problems.