tarantool

Posted on Jul 29, 2022

How we wrote Tarantool Kubernetes Operator

#kubernetes #cartridge #tutorial #programming

Author: Konstantin Nosorev

Kubernetes is a fast-growing open-source project that allows managing Linux containers as a single system. With Kubernetes, we can easily start complex systems using YAML configurations. Systems are managed via declarative resources. The hierarchical structure of resources allows creating large systems with a minimum of configuration files. That's why more and more people move their infrastructure to Kubernetes, including both stateless and stateful applications. So why deny yourself the convenience of using Tarantool inside Kubernetes?

Hi! My name is Kostya, and today I'll tell you about the problems we encountered while developing Tarantool Kubernetes Operator — its Enterprise version for Kubernetes/Openshift. Welcome, everyone who is interested!

Tarantool is an efficient platform for in-memory computation and building high-loaded applications. It combines a database and an application server. As a database, it has a number of unique characteristics: high efficiency of hardware management, flexible data schema, support for both in-memory and disk storage, and the ability to scale by using the Lua language. As an application server, the platform allows storing code very close to your data, thus achieving the minimum response time and maximum throughput.

The Tarantool ecosystem is constantly growing. Today it already has a lot of connectors for popular programming languages (Golang, Python, Java, etc.), extension modules for building applications with blocks (vshard, queue, etc.), and frameworks that speed up the development process (Cartridge and Luatest).

For now, I'd like to talk about applications developed with the Tarantool Cartridge framework. This framework is designed for developing complex distributed systems. With Tarantool Cartridge, you can focus on writing business logic instead of wasting time on solving problems concerning the infrastructure.

Main capabilities of Tarantool Cartridge:

• Automated orchestration of a Tarantool cluster
• Extending the application functionality with new roles
• Application template for development and deployment
• Built-in automated sharding
• Integration with the Luatest test framework
• Managing a cluster with WebUI and API
• Packaging and deployment tools

Each cluster application built with Cartridge is based on roles — Lua modules that describe application business logic. For example, it could be the modules that deal with storing data, provide the HTTP API or cache data from Oracle. A role is assigned to a replica set — a set of instances unified by replication. The role is then enabled on each replica set individually. Different replica sets can have a different set of roles.

For more information about Cartridge, see the following articles:

• Scaling clusters without any hassle
• Distributed storage in 30 minutes

Cartridge has cluster configuration stored on each cluster node. The configuration describes the topology of the cluster. You can also add some configuration that your role will use to it. Such configuration can be changed in runtime to manage role's behavior.

Working with a framework is fine when you don't have a lot of instances. But if you set up more than 100 instances, you might face some difficulties configuring and updating large clusters. That's where Kubernetes comes in to solve a large part of these problems. But what if we want to use all advantages of Kubernetes to simplify the process of deployment and support of Tarantool Cartridge? The answer then is Tarantool Kubernetes Operator.

A little bit about Kubernetes operators

Kubernetes operator is a program for managing applications inside Kubernetes. Operators are a part of the main reconciliation cycle, which is intended to bring the current cluster state closer to the one described in the resources. Simply put, it is a manager that helps solve some often arising situations automatically. The operator is designed to help people who are unfamiliar with the specifics of an application to deploy and operate this application in a Kubernetes cluster.

How does an operator work?

The operator follows the changes to the resources it is assigned to observe and reacts to these changes. Most often, operators use custom resource definitions (CRD) that describe some resource.

Let's consider the following situation involving Tarantool Kubernetes Operator. During installation with helm, the operator creates two CRDs, Cluster and Role.

Cluster description example:

apiVersion: tarantool.io/v1alpha1
kind: Cluster
metadata:
  name: tarantool-cluster
spec:
  roles:
    - name: router
    - name: storage
...

Role description example:

apiVersion: tarantool.io/v1alpha1
kind: Role
metadata:
  name: router
spec:
  replicasets: 1
  vshard:
    clusterRoles: 
    - failover-coordinator
    - app.roles.router
    replicasetTemplate:
        replicas: 2
        podTemplate:
          spec:
            containers:
              - name: cartridge
                image: "tarantool/tarantool-operator-examples-kv:0.0.4"
...

During live performance, for each replica set the operator creates a Statefulset, since this resource is necessary for Volume and Persistent volume claim (PVC, a template used to create Persistent volume for pods). The resulting hierarchy of Kubernetes resources looks like this:

• Cluster — the main resource including general cluster settings such as Cluster-wide config and Failover settings.
• Role — in this context, it is a Kubernetes resource; it includes a template description for replica sets, information about the assigned Cartridge roles, as well as the number of replica sets with such settings and Tarantool instances in each replica set.

Kubernetes resources hierarchy

The operator is based on Operator SDK (https://sdk.operatorframework.io/) and includes two main controllers: Cluster and Role.

Each controller implements the Reconciler interface and subscribes to changes of specific resources. This is how it looks in code:

func (r *RoleReconciler) SetupWithManager(mgr ctrl.Manager) error {
    return ctrl.NewControllerManagedBy(mgr).
        For(&Role{}).
        Watches(&source.Kind{Type: &appsV1.StatefulSet{}}, &handler.EnqueueRequestForOwner{
            IsController: true,
            OwnerType:    &Role{},
        }).
        Watches(&source.Kind{Type: &coreV1.Pod{}}, &handler.EnqueueRequestForOwner{
            IsController: true,
            OwnerType:    &Role{},
        }).
        Complete(r)
}

When resources to which the controller is subscribed change, the Reconcile method is called. The controller compares the resource configuration and the current state of the cluster, then fixes the difference.

Let's take a look at the Cluster controller example:

func (r *ClusterReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    clusterContext := reconcilers.MakeReconciler(ctx, r.Reconciler)
    return clusterContext.RunCluster(ctx,
        reconcilers.GetObjectFromRequest(req),
        reconcilers.CheckDeletion,
        reconcilers.CheckFinalizer,
        reconcilers.SetupRolesOwnershipStep,
        reconcilers.SyncClusterWideServiceStep,
        reconcilers.WaitForRolesPhase(RoleReady),
        reconcilers.GetLeader,
        reconcilers.CreateTopologyClient,
        reconcilers.Bootstrap,
        reconcilers.SetupFailover,
        reconcilers.ApplyCartridgeConfig)
}

When you write a controller, keep in mind that the order of event processing is always random. So you can't expect that when a role resource changes, Reconcile will be called on the role controller first and then on the cluster controller, or vice versa.

Now that you know how the operator works, let's look at the main features of Tarantool Kubernetes Operator Enterprise. Currently, the operator can:

• Deploy a Cartridge cluster
• Change Failover configuration
• Perform a Rolling update
• Scale a cluster both ways: by the number of replica sets and by the number of replicas in each replica set
• Manage application settings
• Change Persistent volume without losing data or downtime, bypassing the Kubernetes restrictions (Kubernetes doesn't allow changing Persistent volume without recreating the resource).

Now let's move on to the difficulties we faced when we were writing the operator.

Divide and conquer

Development of the Enterprise version of the operator started with reevaluating its Community version where three CRDs were used to describe a cluster:

• Role
• ReplicasetTemplate (inherits Statefulset fields)
• Cluster

The first step was changing CRDs:

• Role
• Cluster

Our mistake was creating only one controller responsible for working with a cluster. This led to serious problems when we wanted to extend the operator's functionality. The code describing the Reconcile method began to grow very quickly. Each stage created at least 5 — 10 lines of code.

An example of a method for the cluster controller before its refactoring:

func (r *ClusterReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    log := ctrlLog.FromContext(ctx)
    log.Info("Reconcile cluster")

    cluster, err := r.GetCluster(ctx, req.NamespacedName.Namespace, req.NamespacedName.Name)
    if err != nil {
        if !apiErrors.IsNotFound(err) {
            log.Error(err, "Unable to retrieve cluster")

            return reconcile.Result(
                ctx,
                reconcile.WithError(
                    errors.Wrap(err, "unable to retrieve cluster for reconcile"),
                    10*time.Second,
                ),
            )
        }

        return reconcile.Result(ctx)
    }
    ...
    return reconcile.Result(
            ctx,
            reconcile.WithClusterPhaseUpdate(r.Status(), cluster, ClusterReady),
    )
}

We managed to solve this problem by dividing the logic into several controllers — Cluster and Role.

Now Cluster deals only with the general cluster settings — Failover and application configuration.

The Role controller is responsible for making Statefulsets. Over them, this controller creates replica sets and settings for specific instances.

But we didn't stop there. The Reconcile methods have similar steps in both controllers: getting the current object, deleting the object, creating an object for working with Tarantool topology, etc. In the end, we came to a rather elegant solution: now the Reconcile method is built with separate steps, and the code looks much more clear and readable.

func (r *ClusterReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    clusterContext := reconcilers.MakeReconciler(ctx, r.Reconciler)
    return clusterContext.RunCluster(ctx,
        reconcilers.GetObjectFromRequest(req),
        reconcilers.CheckDeletion,
        reconcilers.CheckFinalizer,
        reconcilers.SetupRolesOwnershipStep,
        reconcilers.SyncClusterWideServiceStep,
        reconcilers.WaitForRolesPhase(RoleReady),
        reconcilers.GetLeader,
        reconcilers.CreateTopologyClient,
        reconcilers.Bootstrap,
        reconcilers.SetupFailover,
        reconcilers.ApplyCartridgeConfig)
}

Besides, it comes with a pleasant bonus: a common code base for both the Enterprise and Community versions of the operator has become possible. Such modular structure allows developing the operator with modules that can be connected in the version where they are needed.

P.S. Yes, soon we will be reworking the community version of the operator. Then we'll also talk about how the operator works in more detail.

Rolling update

As I mentioned before, replica sets are supplied to Kubernetes through the standard resource Statefulset which already has two strategies for application update:

• OnDelete — pods inside Statefulset won't be updated automatically.
• RollingUpdate — pods are updated individually.

The RollingUpdate strategy isn't suitable for applications where pods don't have equal rights, which is the case with Tarantool. In one replica set instances can execute two roles:

• Master — an instance where data can be read from and written to.
• Replica — an instance with read-only access (ReadOnly mode).

In RollingUpdate Kubernetes doesn't know in which pod the master is currently located. Therefore, it can begin application update from the master, which would lead to partial unavailability for writing. The solution to this problem was writing our own update strategies:

• OnDelete — repeats the same-titled Statefulset strategy.
• ClusterPartitionUpdate — the strategy used for instances with no data. It is similar to the usual update strategy since functionally there is no master (no data).
• SwitchMasterUpdate — the strategy used for instances with data. It works within one replica set using the following algorithm:

Update all replicas
Switch the master to new instances
Update the previous master

You might wonder why a database contains instances with no data. It's important to remember that Tarantool is a database and an application server in one package. The thing is, sharding requires a separate instance (or instances) working as a router. A router is basically a controller telling where to go for necessary data.

Most often, there is no need to unite routers into replica sets, so the ClusterPartitionUpdate works over all replica sets and not inside one specific Statefulset.

Such strategies are easy to implement in code:

• Check the update condition
• Perform some operations, if necessary
• Delete the pod
• Wait until the Statefulset/Deployment controller creates new pods with a new image
• Repeat these steps until all necessary pods are updated

Solving network issues

In development, sometimes the situations arise where the operator requires access to all pods inside the Kubernetes network. This isn't a problem when the operator works normally inside Kubernetes. But what if you want to debug your code outside Kubernetes?

One of possible solutions is raising a VPN inside Kubernetes. That's what we did when we just started developing Tarantool Kubernetes Operator, since we used GraphQL API for Cartridge clusters configuration. But this solution additionally loads the developer's PC.

Another solution doesn't work for every application, but it worked wonderfully for operating with Tarantool: get rid of network requests and switch to using pod exec inside the container with the application. The current version of the operator uses this approach for Tarantool configuration. Tarantool's ecosystem includes the console utility, Tarantoolctl, that allows connecting to an operating instance through a control socket and configure the cluster with Lua code.

This approach helped us solve one more problem. In Cartridge, you can enable authorization. It used to be a problem when you used an HTTP connection. But when we connect through a socket, we already have maximum access rights, so the authorization problem is solved.

Naming the Statefulset when PVC changes

Sometimes when working with Statefulset you might want to change the size of Persistent volume claim. But in Kubernetes the Statefulset's PVC section is unchangeable. Since we are working with a database, the amount of data grows and at some point we'll have to increase the disk volume.

So we added a feature that allows changing the role's PVC. Here, a problem arises with pod names: in Kubernetes, two pods with the same name cannot work simultaneously. Initially, pod names were built by the following rule: <role_name>-<statefulset_ordinal>-<pod_ordinal>. PVC update uses the following algorithm:

• Create a new Statefulset with a new PVC
• Create for it a new replica set with a required weight
• Set the replication weight to 0 for the old replica set
• Wait until the old replica set has no data
• If topology leader is located in the old replica set, change it
• Delete the old replica set and all its instances

You might notice that the old rules of naming Statefulset and pods didn't suit us. We decided to change the naming rules to <role_name>-<statefulset_ordinal>-<hash_of_replicaSetTemplate>-<pod_ordinal>. ReplicasetTemplate uses standard fields of PVC, which has private fields that can be changed in runtime. So, we decided to take a 32-bit hash from the JSON representation of the ReplicasetTemplate object. This solution is not very elegant, but it let us get rid of dynamic fields. Example of a new name — router-0-7dfd9f68f-0.

Testing the Operator

As every software, the operator needs to be tested. In our case, we use two types of tests: Unit and E2E. For testing, usually mock code generation is used (for example, via golang/mock). We didn't like this option, so we decided to use Testify's mock module that allows to mock required function interfaces using the reflection API — the interfaces used to configure Tarantool.

If you are interested, here's an article that compares those libraries, testify/mock and golang/mock: GoMock vs. Testify: Mocking frameworks for Go

To create a fake Kubernetes cluster, we used a library by Kubernetes developers: «sigs.k8s.io/controller-runtime/pkg/client/fake».

Currently, unit tests work by the following schema:

• Create a fake topology and a Kubernetes cluster client
• Call the Reconcile method
• Check that the right topology methods were called, and the resources were changed correctly.

Those tests look like this:

BeforeEach(func() {
   cartridge = helpers.NewCartridge(namespace, clusterName).
      WithRouterRole(2, 1).
      WithStorageRole(2, 3).
      Finalized()

   fakeTopologyService = new(mocks.FakeCartridgeTopology)

   fakeTopologyService.
      On("BootstrapVshard", mock.Anything).
      Return(nil)
   fakeTopologyService.
      On("GetFailoverParams", mock.Anything).
      Return(&topology.FailoverParams{Mode: "disabled"}, nil)
   fakeTopologyService.
      On("GetConfig", mock.Anything).
      Return(map[string]interface{}{}, nil)
})

A test is written by the following schema:

cartridge.WithAllRolesReady().WithAllPodsReady()

fakeClient := cartridge.BuildFakeClient()

resourcesManager := resources.NewManager(fakeClient, scheme.Scheme)
clusterReconciler := &ClusterReconciler{...}
_, err := clusterReconciler.Reconcile(...)
Expect(err).NotTo(HaveOccurred(), "an error during reconcile")

err = fakeClient.Get(ctx, types.NamespacedName{Namespace: namespace, Name: clusterName}, cartridge.Cluster)
Expect(err).NotTo(HaveOccurred(), "cluster gone")

Expect(cartridge.Cluster.Status.Bootstrapped).To(BeTrue(), "cluster not bootstrapped")

As for E2E tests, we used the E2E framework for their implementation. It allowed us to fully check the operator's Helm chart and test it in different Kubernetes versions with KinD. Due to the specifics of tests in Kubernetes, we have to wait until different pods are created. Therefore, the duration of all tests grows very fast. E2E framework helped us solve this problem since it supports parallel start of test cases. It let us shorten the time of tests from 30 to 8 minutes.