Elasticsearch Backup and Restore with AWS S3 in Kubernetes

#aws #elasticsearch #s3 #kubernetes

In my day job, I get a chance of working with things like Docker, Kubernetes, Terraform, and various cloud components across cloud providers. We have multiple Elasticsearch clusters running inside our Kubernetes cluster (EKS). These Elasticsearch clusters have been installed using the well-known package manager for Kubernetes -- Helm as Helm charts. Recently, I had to set up a disaster-recovery strategy for these Elasticsearch clusters to restore these clusters to a previous stable state in case of a failure.

The process involved taking regular snapshots of the Elasticsearch cluster and backing them up in an S3 bucket. These backups can later be used to restore the cluster state at a given point in time in case of a disaster. Although the process was not that complicated and was more or less documented, I still had to google some configuration options for it to get to work properly, so I thought of just mentioning the exact necessary steps in a small blog post.

NOTE: If you are using Elasticsearch version 7.5 and above, Elasticsearch has a pretty great module called Snapshot Lifecycle Management and I suggest you check that out.

The main idea behind the setup goes like the following:

Configure the S3 repository plugin for the Elasticsearch cluster
Call the ES snapshot API at regular intervals to take incremental snapshots
Use the restore API to restore the indexes or cluster state from these backups

The steps for setting for achieving the above-mentioned goals can be divided into 3 main parts:

Enable the S3 repository plugin

Enabling plugins in Elasticsearch requires a restart of the ES cluster. Therefore, the official documentation suggests creating a custom Docker image that installs the S3 plugin inside the docker image itself. According to the docs:

There are a couple of reasons we recommend this.

Tying the availability of Elasticsearch to the download service to install plugins is not a great idea or something that we recommend. Especially in Kubernetes where it is normal and expected for a container to be moved to another host at random times.

Mutating the state of a running Docker image (by installing plugins) goes against best practices of containers and immutable infrastructure.

So, to build a docker image with s3 repository plugin enabled, you can use the following Dockerfile:

ARG elasticsearch_version
FROM docker.elastic.co/elasticsearch/elasticsearch:${elasticsearch_version}

RUN bin/elasticsearch-plugin install --batch repository-s3

Enabling plugins in ES requires extra permissions, the --batch flag tells ES to give any required permissions for the installation of the plugin without prompting for confirmation.

Configure Elasticsearch to use S3 bucket for storing snapshots

There are many parameters you can adjust while registering an S3 bucket for storing Elasticsearch snapshots and for a complete set of features you can take a look at the official documentation. For a basic setup, you can register the S3 bucket by making a curl call to the repository endpoint of ES:

PUT _snapshot/my_s3_repository
{
  "type": "s3",
  "settings": {
    "bucket": "my_bucket_name",
    "another_setting": "setting_value"
  }
}

Configure permissions that allow Elasticsearch pod to access the S3 bucket

Thanks to amazing projects like kube2iam that help you easily provide required IAM access to individual Kubernetes objects, this job has become quite easy. The helm chart for Elasticsearch has the provision of taking podAnnotations as an input. These annotations are applied to the Elasticsearch pods and can leverage the full functionality of kube2iam for accessing the S3 bucket.

podAnnotations:  
  iam.amazonaws.com/role: "my-iam-role"

The corresponding IAM role can be easily generated using AWS clients like boto3 or AWS plugins in Terraform, or any other AWS client at your disposal.

Informing the Elasticsearch Helm chart about ES version

This was one of the settings that were not mentioned in the plugins documentation in a straightforward manner and I had to search around a bit to figure this out. You need to set the esMajorVersion flag as well in case you are using a custom image and not running the default Elasticsearch version. For example, I had to set esMajorVersion: 6 as I was running version 6.3.1 of Elasticsearch.
You can have a look at the Elasticsearch statefulset for checking the exact usage of this flag.

That's it, now we are ready to take Elasticsearch snapshots or restore from them.

Taking Snapshots

This part is pretty straightforward. Elasticsearch provides a snapshot API which can be triggered to take backups of the entire cluster state or specific indexes.

For snapshots of the entire cluster, you can use the following curl call

PUT /_snapshot/my_backup/snapshot_1?wait_for_completion=true

You can also specify exact indexes that you want to take backup of:

PUT /_snapshot/my_backup/snapshot_2?wait_for_completion=true
{
  "indices": "index_1,index_2",
  "ignore_unavailable": true,
  "include_global_state": false,
  "metadata": {
    "taken_by": "kimchy",
    "taken_because": "backup before upgrading"
  }
}

Once a snapshot is created information about this snapshot can be obtained using the following command:

GET /_snapshot/my_backup/snapshot_1

Also, for automating the process of taking regular backups, you can use Kubernetes cronjobs for periodically making these API calls to the Elasticsearch backup endpoint.

Restoring from a snapshot

The restore API is pretty simple as well. By default, all indices in the snapshot are restored, and the cluster state is not restored. You can make the following curl call for restoring from a snapshot:

POST /_snapshot/my_backup/snapshot_1/_restore

You can also provide index level information while restoring from a snapshot:

POST /_snapshot/my_backup/snapshot_1/_restore
{
  "indices": "index_1,index_2",
  "ignore_unavailable": true,
  "include_global_state": false,              
  "rename_pattern": "index_(.+)",
  "rename_replacement": "restored_index_$1",
  "include_aliases": false
}

The restore operation can be performed on a functioning cluster. However, an existing index can be only restored if it’s closed and has the same number of shards as the index in the snapshot.

That's All Folks!

Happy Coding! Cheers :)