DEV Community

Cover image for Examining Approaches and Patterns for Debuggability: Ephemeral Containers and Argo Workflows
Anusha Ragunathan for Intuit Developers

Posted on with Jason Johl and Kevin Mathew Downey • Originally published at Medium

Examining Approaches and Patterns for Debuggability: Ephemeral Containers and Argo Workflows

This blog is co-authored by Anusha Ragunathan , Kevin Downey and Jason Johl and is part of a series by the Intuit platform engineering team examining approaches and patterns for debuggability:

Intuit’s scale is vast, operating over 325+ Kubernetes clusters, encompassing 28,000 namespaces that serve around 2,500 production services and a multitude of pre-production services. This colossal infrastructure supports a development community of 8,000 engineers across 1,000 development teams. Consequently, debugging distributed systems at such a scale is a non-trivial task.


In this blog, we will explore the journey behind building a “Debuggability Paved Road” at Intuit. Paved Roads are a concept in the platform engineering space that define the common set of standardized tools, practices, and processes to streamline development workflows. Our endeavor was to create a standardized debugging experience across our development ecosystem to improve developer efficiency and reduce mean time to resolution (MTTR).

We’ll delve into the background of our infrastructure, the challenges we faced, and the two solutions we implemented:

  1. Interactive debugging using ephemeral containers
  2. One-click debugging with Argo Workflows

Finally, we’ll share our key learnings and takeaways for others looking to tackle similar challenges.

The problem: bugs

Everyone, look under your seat

With a development community as extensive as ours, one thing remains constant — bugs. These can range from esoteric 5xx errors to transient timeouts, and debugging distributed systems is notoriously challenging. Traditional methods like metrics, logs, and distributed traces often fall short when deeper debugging and observability are required. Whether it’s running a memory profiler, capturing network packets, or deploying custom debugging tools, friction in the debugging process increases MTTR (mean time to resolution) and can negatively impact business-critical applications and end customers.

We identified three primary challenges impeding efficient debugging at Intuit:

  • Debug Tooling Setup: Most developers have a set of favorite debugging tools, but security best practices harden app container images, preventing direct access to shells, package managers, or any sort of debug tooling.
  • Abstraction: Platform abstractions, although simplifying deployment, abstract away Kubernetes complexities, posing a challenge when developers need to debug their applications.
  • Fragmented Expertise: Debugging expertise was siloed, with experts in Java, networking, or Linux not cross-pollinating their knowledge across teams, causing inconsistent and prolonged issue resolution timelines.

To address these challenges, we aimed to create a Debuggability Paved Road, extending the concept of a developer paved road — which usually focuses on best practices and guidelines for development and deployment. First we tackled the problem of creating an automated workflow using a built-in feature of Kubernetes — ephemeral containers.

Interactive Debugging Using Ephemeral Containers

Ephemeral containers were introduced in Kubernetes 1.23 as a beta feature and became generally available in version 1.25. Unlike traditional containers, ephemeral containers can be added to a running pod for debugging purposes without restarting it. They share the same process, IPC (inter-process communication), and UTS (UNIX time-sharing) namespaces with the main application container, making them ideal for diagnostics.

Ephemeral containers were added to Kubernetes in 1.23 and onwards.

We used ephemeral containers to develop a workflow that allows service personas to easily launch a debug shell by selecting their workspace, region, and pod name. Upon clicking “Connect,” an ephemeral container is launched in the pod, targeting the main application container, thereby enabling deep inspection.

One-click debugging experience to spin up a debug shell for any service at Intuit

Here’s an example workflow and how it works in the backend:

  • User Request: The service persona requests a debug shell specifying the namespace, region, and pod name through the UI.
  • Multi-Cluster Orchestrator: Our orchestrator receives this request, authenticates it, and uses Kubernetes API (client-go) to launch an ephemeral container in the targeted pod.
  • Bidirectional Streaming: A secure HTTP connection is upgraded to a WebSocket, providing the user with a debug shell. All interactions are streamed back and forth between the debug shell and the ephemeral container.
Debug shell request and response flow

We also ensured our debug container image included a comprehensive toolbox, from general Linux debugging tools to language-specific packages. Security measures, such as session timeouts, RBAC controls, OPA policies, and thorough auditing, further fortify our solution.

One surprise we found is that the debug shell is not just used during incidents, but also during service onboarding. Use cases like testing database connectivity from services, checking secrets configuration, and verifying connectivity to itself and downstream services are a few of the many popular ways we have seen it used by our developer community.

Once we had our debug shell workflow defined, we tackled the next problem: a workflow engine for language-specific debugging.

One-Click Debugging with Argo Workflows

We knew there was a universal need for developers to debug running code, but often the specific debugging techniques and tools varied quite a bit depending on the language and framework. Intuit primarily uses Java Spring for web services, though we often see services in Python, Golang, and others. We knew we needed to customize the debuggability experience per language to provide a good user experience. For purposes of this blog, we will take examples from our Java debugging workflows, but we have built debugging workflows for Python and Golang as well. Let us know in the comments if you’re interested in learning more about those workflows!

For our Java developers, we implemented the ability to take thread or heap dumps via a simplified UI, select the target environment and pod, and trigger a workflow. The workflow captures the required information, sanitizes it to mask any sensitive data, and uploads it to S3 for download. This approach automates complex debugging actions and provides auditable, downloadable artifacts for further analysis.

A debug workflow involves:

  • Triggering a Workflow: From the UI, the developer selects an action (e.g., take a thread dump) and the relevant pod.
  • Multi-Cluster Orchestrator: The orchestrator determines the target cluster and namespace for the relevant pod and prepares to launch an Argo workflow in a privileged namespace of the target cluster.
  • Executing the Workflow: The workflow is launched in the debug namespace and captures the required data by hitting the application’s Java Spring Boot actuator endpoints.
  • Sanitizing and Uploading: The payload is sanitized and uploaded to S3. The developer gets a pre-signed URL to download the artifact.
  • Analyzing the Data: The developer can analyze the downloaded thread or heap dump using their preferred tools, such as yCrash.
Debugging workflow architecture

With the combination of the ephemeral containers for debug shells and ability to spin up these workflows on demand with Argo Workflows, we were able to realize our goal of a consistent Debuggability Paved Road for all Intuit services. From our development portal, any Intuit engineer at any level of expertise can debug their service without needing to set up complex local debug tooling.

Debuggability UI that an Intuit engineer sees for debugging a Java service. They can take a thread or heap dump of the application with one simple click.

Finally, we also wanted to ensure the stability of the underlying infrastructure during debugging. By incorporating managed endpoints and network policies, we ensure that debugging actions do not compromise the integrity and performance of the application. For example, this helps guard against crashes in the case of taking a heap dump during a memory leak or otherwise overloaded applications. Intuit engineers can debug without fear of negatively impacting the underlying infrastructure.

Takeaways and Next Steps

The journey of building a Debuggability Paved Road at Intuit has been enlightening and full of learnings. By leveraging ephemeral containers and Argo Workflows, we’ve managed to enhance our debugging capabilities, reduce MTTR, and boost developer productivity. We came away with a few larger takeaways:

  • Enhanced Developer Velocity and Reduced MTTR: Streamlining debugging actions with tools like ephemeral containers and automated workflows accelerates issue resolution.
  • Automated and Secure Sensitive Data Access: Implement efficient audit mechanisms, RBAC controls, and session management to protect sensitive debugging data.
  • Facilitate Seamless Collaboration: Provide consistent, auditable workflows that enable all team members to contribute to debugging efforts effectively.
  • Democratize Tooling: Ensure all developers have access to the tools needed to resolve issues, breaking down silos of expertise.

By establishing a Debuggability Paved Road, we’ve been able to uphold our commitment to improving developer experiences, maintaining security, and ensuring the reliability of our services. We hope our experience and solutions inspire others to create more efficient and secure debugging processes for other organizations facing similar challenges in debugging distributed systems at scale.

Stay tuned here for more updates on our Intuit development platform journey!


Intuit’s Debuggability Paved Road: A Look at Ephemeral Containers and Argo Workflows was originally published in Intuit Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Top comments (0)