Jesper Axelsen for IT Minds

Posted on Mar 5, 2021

Deploying a Ceph cluster with Kubernetes and Rook

#kubernetes #ceph #rook #storage

This blog post is the first in a series concerning Ceph.

Introduction

In a world that is seeing an ever-increasing data generation, the need for scalable storage solutions will naturally rise. I am going to introduce you to one of these today. It is called Ceph.

Ceph is an open-source software storage platform. It implements object storage on a distributed computer cluster and provides an interface for three storage types: block, object, and file. Ceph's aim is to provide a free, distributed storage platform without any single point of failure that is highly scalable and will keep your data intact.

This post will go through the Ceph architecture, how to set up your own Ceph storage cluster, and discuss the architectural decisions you will inevitably have to make. We will be deploying Ceph on a Kubernetes cluster using the cloud-native storage orchestrator Rook.

Architecture

First, a small introduction to Ceph's architecture.

A Ceph storage cluster can be accessed in a number of ways.

First, Ceph provides the LIBRADOS library that allows you to connect directly to your storage cluster using either C, C++, Java, Python, Ruby or PHP. Ceph also allows for object storage through a REST gateway that is accessible with S3 and Swift.

Using Kubernetes, the more common ways to use your storage cluster would be to either create persistent volume claims(PVCs) using .yaml files in Kubernetes or to create a POSIX-compliant distributed filesystem.

Underneath all of this lies a reliable, autonomous, distributed object storage(RADOS). RADOS is in charge of managing the underlying daemons that are deployed with Ceph.

A Ceph storage cluster has these types of daemons:

The object storage daemons(OSDs) handle read/write operations on the disks. They are also in charge of checking that the state of the disk is healthy and report back to the monitor daemons.
The monitor daemons keep a copy of the cluster map as well as monitor the state of the cluster. These daemons are what ensure high availability if any monitor fails. You will always need an odd number of monitor daemons to keep quorum and it is recommended to dedicate nodes for the monitor daemons to run on, separate from the storage nodes.
The manager daemons creates and manages a map of clients, as well as management of reweighting and rebalancing operations.
The metadata server manages additional metadata about the file system, specifically permissions, hierarchy, names, timestamps, and owners.

Deploying the cluster

Having acquired a rudimentary understanding of Ceph, we are now ready to build our storage cluster. A basic guide on how to set up a Kubernetes cluster on Ubuntu can be found here. We will be deploying Ceph on a 3-node cluster where each node will have 2 available drives for Ceph to mount. To confirm that the cluster is up and running, run:

# kubectl get nodes
NAME          STATUS   ROLES                  AGE    VERSION
k8s-master    Ready    control-plane,master   110m   v1.20.4
k8s-node-01   Ready    <none>                 105m   v1.20.4
k8s-node-02   Ready    <none>                 105m   v1.20.4

Rook

As previously stated, we will be using Rook as our storage orchestrator. Clone the newest version with:

git clone https://github.com/rook/rook.git

After cloning the repo, navigate to the right folder with:

cd rook/cluster/examples/kubernetes/ceph.

First, we got to create the necessary custom resource definitions(CRDs) and the RoleBindings. Run the command:

kubectl create -f crds.yaml -f common.yaml

I will not go through these two files as they are not relevant to the cluster configuration.

Now, it is time for the Rook operator to be deployed. The Rook operator will automate most of the deployment of Ceph. We will in this example enable the Rook operator to automatically discover any OSDs that are empty, mount them and thereby join them into the cluster. The Rook operator is found in operator.yaml. A multitude of things can be configured in the operator file. Most noteworthy is that resources can be limited, to ensure that certain parts of your cluster do not consume too many resources, thus slowing down other parts of the cluster. We will go with a standard configuration and only change the following from false to true:

- name: ROOK_ENABLE_DISCOVERY_DAEMON
  value: "false"

This will enable the operator to automatically discover the current OSDs in the cluster and any OSDs that might be added later, without any input from us as admins.

Now deploy the Rook operator

# kubectl create -f operator.yaml
configmap/rook-ceph-operator-config created
deployment.apps/rook-ceph-operator created

You should now be able to see the operator pod and the OSD discover pods running in the rook-ceph namespace in Kubernetes

# kubectl get pods -n rook-ceph
NAME                                  READY   STATUS    RESTARTS   AGE
rook-ceph-operator-678f988875-r6nc4   1/1     Running   0          83s
rook-discover-4w92b                   1/1     Running   0          41s
rook-discover-gw22p                   1/1     Running   0          41s
rook-discover-kskfx                   1/1     Running   0          41s

With the operator now running, we are ready to deploy our storage cluster. The storage cluster will be created with the cluster.yaml file.

Cluster configuration

Before deploying a storage cluster, we need to configure the cluster's behavior. A storage solution needs to ensure that data is not lost in case of disk failure and that the system is able to recover quickly if anything was to happen.

Changing the configurations in cluster.yaml should be done with caution as you can introduce severe overhead into your cluster and even create a cluster without any data security, safety, or reliability. We will be going through the configurations I find relevant for someone deploying their first cluster.

mon: 
  count: 3

A standard cluster will have 3 monitor daemons. There have been discussions of the optimal number of monitor daemons for clusters for a long time. The general consensus is that 1 monitor pod will leave your cluster in an unhealthy state if a single node goes down. This is obviously not a great choice if you would like to ensure any kind of data durability. The other choice could be to create 5 monitor daemons. This is often regarded as a good idea when a cluster expands to hundreds or thousands of nodes. However, since each monitor keeps an updated version of the crush map, you can experience problems in the cluster's speed if this is done on a small cluster. The community largely agrees that for most clusters, this should be 3. This introduces another problem, however. If we lose more than one node at the same time, we will lose quorum and thereby leave the cluster in an unhealthy state.

waitTimeoutForHealthyOSDInMinutes: 10

We have to configure how long we will wait for OSDs that are still in the cluster but are non-responsive. This is set in minutes. If you go too low, you will risk that a temporary unresponsive OSD will start a recovery process that might slow down your cluster unnecessarily. However, if you wait too long to check the OSDs, you run the risk of permanently losing data in the case that any other OSDs that hold the replicated data, fail.

There are more things to configure in the cluster.yaml file. If you would like to use the Ceph dashboard or perhaps monitor your cluster with a monitor-tool like Prometheus, you can also enable these. For now, we will leave the rest of the settings as is and deploy the cluster

kubectl create -f operator.yaml

To see the magic unfold, you can use the command:

watch kubectl get pods -n rook-ceph

In a couple of minutes, Kubernetes should have deployed all the necessary daemons to have your cluster up and running. You should be able to see the monitor daemons.

rook-ceph-mon-a-5588866567-vjg99                        1/1     Running     0          4m51s
rook-ceph-mon-b-9bc647c5b-fmbjf                         1/1     Running     0          4m27s
rook-ceph-mon-c-7cd784c4b7-qwwwb                        1/1     Running     0          4m1s

You should also be able to see the OSDs in the cluster. There should be six of them since we have two disks on each of our nodes.

rook-ceph-osd-0-7b884cfccb-qpqbd                        1/1     Running     0          4m49s
rook-ceph-osd-1-5d4c587cdb-bzstp                        1/1     Running     0          4m48s
rook-ceph-osd-2-857b8786bd-q8wqk                        1/1     Running     0          4m41s
rook-ceph-osd-3-443df7d8er-q9we3                        1/1     Running     0          4m41s
rook-ceph-osd-4-5d47f54f7d-tq6rd                        1/1     Running     0          4m41s
rook-ceph-osd-5-32jkjdkwk2-33jkk                        1/1     Running     0          4m41s

That was it! You have now created your very own Ceph storage cluster, in which you will be able to create a distributed filesystem and Kubernetes will be able to create PVCs.

This blog post will be continued next week with more on how Ceph ensures data durability and how to start using your Ceph cluster with Kubernetes.
To be continued...