Don't get me wrong; we are strong supporters of Kubernetes. It is a critical piece of our architecture and provides massive value when wielded correctly. But, Kubernetes was originally intended to act as a container orchestration platform for stateless workloads, not stateful applications.
Over the past few years, the Kubernetes community has done a great job evolving the project to support stateful workloads by creating StatefulSets, which is Kubernetes' answer to storage-centric workloads.
The fastest way to build great infrastructure and deploy your software
Plural empowers you to build and maintain cloud-native and production-ready infrastructure on Kubernetes
đđ¨âď¸
⨠Features
Plural will deploy open-source applications and proprietary services on Kubernetes in your cloud using common standards like Helm and Terraform.
The Plural platform provides the following:
- Cluster API Providers to create and manage clusters at scale.
- Full visibility of your fleet and all deployed services via our secure Auth Proxy.
- Configuration management for deployments, allowing you to parameterize services for each deployment.
- Horizontal scaling to ingest and auto-shard as many repos as necessary.
- Dependency management between Terraform/Helm modules for open-source applications, with dependency-aware deployment and upgrades.
- Authenticated docker registry and chartmuseum proxy per repository.
In addition, Plural also handles:
- Issuing the certificates.
- Configuring a DNS service to register fully-qualified domains under onplural.sh to eliminate the hassle of DNS registration for users.
- Being anâŚ
StatefulSets run the gamut from databases, queues, and object store to janky old web applications that need to modify a local filesystem for whatever reason. They provide developers with a set of pretty powerful guarantees:
- Consistent network identity for each pod: This allows you to easily configure the DNS address to the pod in your application. It works great for database connection strings or configuring complicated Kafka clients. We also use it for setting up erlangâs mesh network at times too.
- Persistent volume automation: Whenever a pod is restarted, even if it is rescheduled onto a different node, the persistent volume is reattached to the node it is placed on. This is somewhat limited by the capabilities of the CSI (Container Storage Interface) youâre using. For instance on AWS this only works within the same regional AZ since EBS volumes are AZ-linked.
- Sequential Rolling Updates: StatefulSet updates are designed to be rolling and consistent. It will always update in the same order which can help preserve systems that have delicate coordination protocols.
These guarantees cover a ton of the operations needed to run a stateful workload. In particular, it almost completely handles the availability portion. Given that EBS uptime and redundancy guarantees are extremely strong, the StatefulSetâs rescheduling automation almost trivially guarantees you a high availability service. However, some caveats do apply (e.g., that you have room in your cluster and donât botch the AZ setup.)
Kubernetes has a ton of promise in this area, and in theory, could certainly evolve into a platform to easily run stateful workloads alongside the stateless ones most developers use it for.
Whatâs Missing From the Kubernetes StatefulSet?
So why do we think StatefulSets are broken? Well, if you run through the operational needs of a stateful workload in your head, thereâs one key component that you might notice is missing:
What do you do when you need to resize the underlying disk?
The dataset is a common database store that typically grows at a pretty constant positive rate. Unless you support horizontal scaling and partitioning, youâll need to add headroom in the disk as that dataset grows. This is where Kubernetes falls flat on its face.
Currently, the StatefulSet controller has no built-in support for volume resizing. This is despite the fact that almost all CSI implementations have native support for volume resizing the controller could hook into. There is a workaround, but itâs almost ludicrously roundabout:
- Delete the StatefulSet while orphaning pods to avoid downtime with: kubectl delete sts --cascade=orphan
- Manually edit the persistent volume for each pod to the new storage size
- Manually edit the StatefulSet volume claim with the new storage size and add a dummy pod annotation to force a rolling update
- Recreate the StatefulSet with that new spec which allows the controller to reclaim control of the orphaned pods and begin the rolling update which will trigger the CSI to apply the volume resize
We actually automated this entire process as part of the Plural operator. We knew weâd need to build storage resize automation to make stateful applications running with Plural to be operable by non-Kubernetes experts. Itâs a nontrivial amount of logic in reality and if someone were asked to do it in a high-pressure scenario, the chances of failure are incredibly high.
Okay, so thereâs a pretty noteworthy flaw in Kubernetes StatefulSets, but there is a workaround even if itâs somewhat janky.
That shouldnât be too bad, right?
But it gets worse!
The situation gets downright painful when you realize the impact of this limitation and that a lot of the Kubernetes operators have been built to manage stateful workloads.
A pretty good example is the Prometheus operator, which is a great project for both provisioning Prometheus databases and allowing a CRD-based workflow for configuring metrics, scrapers, and alerts.
The problem arises because the built-in controller for the operator has no logic to manage StatefulSet resize, but it does have the logic to recreate its underlying StatefulSet if it sees an event that triggered its deletion. This means that you effectively have no way to use the above workaround, since the moment you do a cascade orphan delete, the operator will recreate the StatefulSet against the old spec and prevent proper resize. The only solution is to delete the entire CRD or find a tweak that can fool the operator into not reconciling the object (sometimes scale to zero will do this).
Regardless, as a result of this flaw, there is effectively no way to resize a Prometheus instance with the operator without either significant downtime or data loss. Considering how robust the automation in StatefulSets is in all other cases, itâs pretty shocking that this is still a potential failure mode.
Our Head of Community, Abhi, actually hit this issue with interplay between operators and StatefulSet volume resizes as well while implementing it in the open-source Vitess operator.
âConsidering the natural complexity of a Vitess deployment, you can infer that disk resizing is proportionally complicated. Vitess is a database sharing system that sits on top of MySQL, meaning that volume resizing had to be both partitioning-aware and shard-aware. We had to manually write our own shard-safe rolling restarts, create a cascade condition that worked with the parent-child structure of Vitess custom resources, and address every conceivable failure condition to prevent downtime. Shoutout to notable Kubernetes contributor enisoc for designing this feature.â
Other widely used and notable database operators, like Zalando's Postgres operator, effectively reimplement the same procedure we implemented in the Plural operator in their own codebase. This causes a ton of wasted developer cycles on a problem that should only have to be fixed once.
The Potential of Kubernetes
In general, we are extremely bullish on the potential for Kubernetes to make the operations of virtually any workload almost trivial, and a huge part of our mission at Plural is to make that a possibility.
That said, we also need to be clear-eyed about gaps that still remain in the Kubernetes ecosystem, so we can either work around them or close them upstream. I think itâs pretty clear this is a significant gap, and if prioritized, this could be fixed pretty easily in a future release of Kubernetes.
If you thought this was interesting check out what weâre doing with Kubernetes here. Thanks for reading!
Top comments (0)