Jesper Axelsen for IT Minds

Posted on Mar 12, 2021

Ceph data durability, redundancy, and how to use Ceph

#kubernetes #ceph

This blog post is the second in a series concerning Ceph.

Creating data redundancy

One of the main concerns when dealing with large sets of data is data durability. We do not want a cluster in which a simple disk failure will introduce a loss in data. What Ceph aims for instead is fast recovery from any type of failure occurring on a specific failure domain.

Ceph is able to ensure data durability by using either replication or erasure coding.

Replication

For those of you who are familiar with RAID, you can think of Ceph's replication as RAID 1 but with subtle differences.

The data is replicated onto a number of different OSDs, nodes, or racks depending on your cluster configuration. The original data and the replicas are split into many small chunks and evenly distributed across your cluster using the CRUSH-algorithm. If you have chosen to have three replicas on a 6-node cluster, these three replicas will be spread out onto all six nodes, not just three nodes containing the full replicas.

It is important to choose the right level of data replication. If you are running a single-node cluster, replication on the node level would be impossible and your cluster would lose data in the event of a single OSD failure. In this case, you would choose to replicate data across the OSDs you have available on the node.

On a multi-node cluster, your replication factor decides how many OSDs or nodes you can afford to lose in case of disk or node failure, without data loss. Of course, the replication of data introduces the problem of lowering your total amount of space available in your cluster. If you choose a replication factor of 3 on the node level, you will only have 1/3 of your total storage available in your cluster for you to use.

Replication in Ceph is fast and only limited by the read/write operations of the OSDs. However, some people are not content with "only" being able to use a small amount of their total space. Therefore, Ceph also introduced erasure coding.

Erasure Coding

Erasure coding encodes your original data in a way so that when you need to retrieve the data again, you only need a subset of the data to recreate the original information. It splits objects into k data fragments and then computes m parity fragments. I will provide an example.

Let us say that the value of our data is 52. We could split it into:
x = 5
y = 2

The encoding process will then compute a number of parity fragments. In this example, these will be equations:
x + y = 7
x - y = 3
2x + y = 12

Here, we have a k = 2 and m = 3. k is the number of data fragments and m is the number of parity fragments. In case of a disk or node failure and the data needs to be recovered, out of the 5 elements we will be storing (the two data fragments and the three parity fragments) we only require two of these five to recover. This is what ensures data durability when using erasure coding.

Now, why does this matter? It matters because these parity fragments take up significantly less space when compared to replicating the data. Here is a table that shows how much overhead there is on different erasure coding schemes. The overhead is calculated with m / k.

Erasure coding scheme (k+m)	Minimum number of nodes	Storage overhead
4+2	6	50%
6+2	8	33%
8+2	10	25%
6+3	9	50%

As we can see in the table, you can use the (8+2) scheme to make sure you can lose two of your nodes without losing any data, and this with only a 25% storage overhead.

If you look at this from a storage space optimization standpoint, this is a much better use of the storage. However, it is not without certain downsides. The parity fragments take time for the cluster to calculate and read/write operations are therefore slower than with replication. Therefore, erasure coding is usually recommended on clusters that deal with large amounts of cold data.

Using Ceph

A natural part of deployments on Kubernetes is to create persistent volume claims (PVCs). PVCs can claim a volume and use that as storage for data in the pod. In order to create a PVC you first need to define a StorageClass in Kubernetes.



apiVersion: ceph.rook.io/v1
kind: CephBlockPool
metadata:
  name: replicapool
spec:
  failureDomain: host
  replicated:
    size: 3
    requireSafeReplicaSize: true
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
   name: rook-ceph-block
provisioner: rook-ceph.rbd.csi.ceph.com
parameters:
    clusterID: rook-ceph # namespace:cluster
    pool: replicapool
    imageFormat: "2"
    imageFeatures: layering
    csi.storage.k8s.io/provisioner-secret-name: rook-csi-rbd-provisioner
    csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph # namespace:cluster
    csi.storage.k8s.io/controller-expand-secret-name: rook-csi-rbd-provisioner
    csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph # namespace:cluster
    csi.storage.k8s.io/node-stage-secret-name: rook-csi-rbd-node
    csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph # namespace:cluster
    csi.storage.k8s.io/fstype: ext4
allowVolumeExpansion: true
reclaimPolicy: Delete

In this StorageClass file, you can see that we first create a replica pool that creates 3 replicas in total and uses host as the failure domain. After that, we define whether or not we should allow volume expansion after a volume is created and what the reclaim policy should be. Reclaim policy determines whether the data that is stored in the volume should be deleted or retained when a pod ceases to exist. In this case, I have chosen delete.



# kubectl get storageclass -n rook-ceph
NAME              PROVISIONER                     RECLAIMPOLICY   VOLUMEBINDINGMODE   ALLOWVOLUMEEXPANSION   AGE
rook-ceph-block   rook-ceph.rbd.csi.ceph.com      Delete          Immediate           true                   10m

Now that the StorageClass has been created, we can now create a PVC:



---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: pvc
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
  storageClassName: rook-ceph-block

This creates a PVC that is now running on our Kubernetes cluster:



# kubectl get pvc -n rook-ceph
NAME      STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS      AGE
rbd-pvc   Bound    pvc-56c45f01-562f-4222-8199-43abb856ca94   1Gi        RWO            rook-ceph-block   37s

We will now deploy a pod that uses this PVC:



---
apiVersion: v1
kind: Pod
metadata:
  name: demo-pod
spec:
  containers:
   - name: web-server
     image: nginx
     volumeMounts:
       - name: mypvc
         mountPath: /var/lib/www/html
  volumes:
   - name: mypvc
     persistentVolumeClaim:
       claimName: pvc
       readOnly: false

After deploying this pod, you can see it in the pod list:



# kubectl get pods -n rook-ceph
NAME              READY   STATUS    RESTARTS   AGE
demo-pod          1/1     Running   0          118s

That is how you deploy pods that create persistent volume claims on your Ceph cluster!

Top comments (1)

Julien Laurenceau • Nov 27

Thank you very much for sharing.

I will also add that beyond durability, the erasure coding scheme also impact performance !
Imagine if you have 100 disks and an erasure set size = 100 a single write would make all 100 disks to spin ! Very bad idea, because in general you have a lot of concurrent access.
Behind that is that Write-IOPS are diluted in the erasure set size as can be understand by using for example this calculator :
docs.clyso.com/tools/erasure-codin...

I would also appreciate to find some fault tolerance tests on Ceph. In fact I came to ceph because I had to quit Minio after issues with fault tolerance as described here : dev.to/julienlau/minio-a-critical-...

DEV Community

Ceph data durability, redundancy, and how to use Ceph

Creating data redundancy

Replication

Erasure Coding

Using Ceph

Top comments (1)

Read next

Understanding Kubernetes Namespaces: Types and Working with Examples

Sync Kubernetes Secrets to AWS Secrets Manager Using external-secrets PushSecret

Kubernetes Control Plane Components

Prometheus vs CloudWatch for Cloud Native Applications (Updated in 2024)