Michael Guarino

Posted on Aug 25, 2022 • Originally published at plural.sh on Aug 23, 2022

Why You Shouldn't Overlook Day 2 Kubernetes

#kubernetes

Day 2 Kubernetes can be challenging. Here's why you shouldn't overlook it's implications. Photo by Alexandr Bormotin / Unsplash.

Deciding to implement Kubernetes (Day 0) and then getting your first deployment up and running (Day 1) is hard enough. But then there’s everything that comes after, commonly known as Day 2 Kubernetes. Many organizations overlook this stage, which is fraught with challenges and problems.

Once the initial excitement wears off, Day 2 is the make-or-break moment when your team needs to figure out how to manage and maintain Kubernetes for the long term. Otherwise, as you add features to your app and grow the complexity of your deployment, costs can and will pile up in the form of expensive outages, integration headaches, and lost developer velocity.

I have spent the past year talking to dozens of best-in-class DevOps teams about how to overcome some common operational challenges engineering teams face when wrangling Kubernetes.

pluralsh / plural

Deploy open source software on kubernetes in record time. 🚀

The fastest way to build great infrastructure

Plural empowers you to build and maintain cloud-native and production-ready open source infrastructure on Kubernetes

🚀🔨☁️

✨ Features

Plural will deploy open source applications on Kubernetes in your cloud using common standards like Helm and Terraform.

The Plural platform provides the following:

Dependency management between Terraform/Helm modules, with dependency-aware deployment and upgrades.
Authenticated docker registry and chartmuseum proxy per repository.
Secret encryption using AES-256 (so you can keep the entire workflow in git).

In addition, Plural also handles:

Issuing the certificates.
Configuring a DNS service to register fully-qualified domains under onplural.sh to eliminate the hassle of DNS registration for users.
Being an OIDC provider to enable zero touch login security for all Plural applications.

We think it's pretty cool! 😎 Some other nice things:

☁️ Build and manage open cloud-native architectures

The plural platform ingests all deployment artifacts needed to deploy…

View on GitHub

Here is what I learned:

Why solving Day 2 Kubernetes is crucial

Day 2 Kubernetes can often feel like a puzzle for most engineering teams. Photo by Markus Winkler / Unsplash

Day 2 Kubernetes covers DevOps processes—like monitoring, testing, runbooks, and alerting—that maintain the performance and reliability of your clusters. Often, these operations aren’t given careful thought in the initial push to deploy Kubernetes as quickly as possible. After all, there’s an extensive amount of terminology and concepts to learn in order to break into Kubernetes and just figure out the basics, like how to convert a Docker Compose file into a production K8s service.

However, while figuring out your initial deployment, it’s important to also think ahead to Day 2 and beyond. As with any open-source technology, choosing to self-host Kubernetes rather than a managed solution can provide huge cost savings and flexibility, but it comes with risks.

If your Kubernetes clusters are not well managed, monitored, or understood, your engineers can end up spending a significant amount of time root-causing and fixing failures. Security breaches or governance issues could lead to PR or compliance disasters. You could run up cloud costs as a result of misconfigurations. And overall, morale can take a hit as engineers spend more time writing Helm charts than they spend working on product features.

What problems do organizations face with Day 2 Kubernetes?

While it varies by organization, you can break down Kubernetes Day 2 problems into the below five areas. Photo by Rob Wicks / Unsplash

The problems that engineering organizations encounter when managing K8s tend to break down into these five areas:

Learning curve & knowledge transfer

Whether you’re using Kubernetes for just your data stack or converting your entire monolithic system into distributed microservices, you want to avoid a situation where just one or two engineers are responsible for maintaining your solution. However, there’s a steep learning curve and an overwhelming amount of material out there about K8s.

Furthermore, not only do you have to master the core Kubernetes API, you also have to master the toolchains to manage K8s. With so many options out there for different tools (Helm or Kustomize? Terraform or Ansible?), your solution will often end up being very specialized, making it painful to onboard new engineers or lose knowledge that exists within a few engineers in the org.

Visibility

In most cases, especially if you use AWS, you won’t have a dashboard built-in for Kubernetes. To understand what all your resources are, you’ll need to use the command-line interface (kubectl)—and while some people are very comfortable with this, most aren’t and need the benefit of a visual interface.

Third-party app integrations

Often, the problems you’ll face with Day 2 Kubernetes aren’t technically Kubernetes problems. Rather, it’s the operational idiosyncrasies of how other applications interact with K8s that will give you headaches. For example, if you want to deploy Airflow on Kubernetes, you might not know how to scale the database underneath it or how to scale the workers, which metrics to visualize, or what CPU/memory tradeoffs to make.

This operational knowledge is unique to each application and has to be learned from scratch every time there’s a new open-source tool you want to use on Kubernetes. Any misconfigurations could result in a higher cloud bill than you really need to spend.

Monitoring, alerting, and disaster recovery

While you can get some logging built-in with K8s, in Day 2 it’s essential to set up your logs to connect to a central system (or set of tools) that you use for observability and alerting. Logging a dynamic, distributed system like Kubernetes is complicated. You’ll want to monitor multiple layers (e.g. Node and Cluster levels), each with its own lifecycle and different kinds of logs.

Along with logging, an alerting and disaster recovery strategy are a must for Day 2 Kubernetes. Again, teams can run into problems here because of the distributed nature of the system. It may not always be clear who the owner is for each service, so the person on-call might have no idea what to do or even who to contact in the case of an outage.

Security and governance

Kubernetes can be beneficial from a security perspective. If you have a consolidated networking layer using K8s, you don’t have to worry about exposing more data than you need to, and you can run an extra-secure layer on top of potentially less-secure third-party apps.

However, the way you store secrets and check for vulnerabilities will need to be adapted to work for Kubernetes, which can be especially challenging if you’re new to managing a distributed system. Furthermore, you’ll need to set up new access controls that follow your company’s best practices around governance and compliance.

What a solution to Kubernetes Day 2 looks like

A Kubernetes Day 2 Solution has to cover at a minimum the below six components. Photo by Antonio Janeski / Unsplash

In my experience, a solution to Day 2 Kubernetes needs to have the following components at a minimum:

Dashboarding: A visual interface for managing your resources, for people who don’t want to use the command line.
Integration testing suite: When you push a new version of a package to production, you want some way to automatically deploy it to test clusters and run health checks to make sure that everything is working perfectly.
Access controls: It should be easy to set up access controls for your cluster from a central location, and audit trails should be baked in.
Observability and alerting: If anything goes wrong, you need to be able to root-cause the issue quickly and alert the right people.
Runbooks for disaster recovery: When there’s an issue, you need runbooks so that anyone on-call can quickly implement a fix. Which leads to the final point…
Automation: Too often, teams end up reinventing the wheel when managing Kubernetes. When you want to deploy anything on K8s, you should be able to quickly find all the dashboards you need, all the hooks for scaling, and interactive runbooks that make the process repeatable.

Many companies try to string these components together from different fragmented DevOpssolutions. However, to have a really effective solution, you need the whole suite to work together. When an alert fires, it should hook up to a runbook and point you to the fix. All your operations should be automated—and the knowledge around these operations should be accessible and available to everyone on the team, not just a few engineers.

To learn more about how Plural works and how we are helping engineering teams across the world deploy open-source applications in a cloud production environment, check outour Github to get started today.

Join us on our Discord channel for questions, discussions, and to meet the rest of the community.

DEV Community