DEV Community

PaulOpu
PaulOpu

Posted on

ElasticSearch: Switch Index like a Pro

Introduction

One of the most used databases to search for large documents is Elastic Search. There are specific mechanisms in place to speed up the search significantly. Nevertheless, the configurations must be set wisely, otherwise, your search takes much longer than expected.

Background (Elastic Search Concepts)

Elastic Search Cluster with Index and Shards

Node

The data is distributed across multiple computational instances, called nodes to parallelize the search and make it quicker. Thereby, each instance only has a fraction of the entire database.

Primary Shards

The data is organized in indices, which can be compared with tables in SQL databases. For example, if you are a sports website, you have indices for players, teams, or tournaments. The data of each index is split into primary shards so that one shard can stay on one node.
You already realize that the number of primary shards should not be higher than the number of nodes, otherwise we have 2 separate shards on one node and that is less efficient.

Replica Shards

Replica shards add redundancy to the index

Consider we have a player index with 5 shards in a cluster with 5 nodes, where shard 1 is on node 1, and so on. Assume that node 1 is busy searching in another index. If you now want to search for players, shard 1 (is on the busy node 1) might respond with a delay.
To overcome this, we introduce replica shards: each shard can have 0-n replicas, which contain exactly the same data as the corresponding shard. They should lie on a different node. In our example considering the replica 1 for shard 1 is on node 2, we can also address node 2 and skip the busy node 1.

Problem

We experienced that the CPU utilization of our nodes increased drastically during peak times and elastic search was not able to handle all search requests. The time to resolve a search request is split into dividing the request on each node, the actual search in the shards, and collecting the results to aggregate them.
The actual search cannot be the problem, as each primary shard just contains a few hundred MBs and the desired size should be at least a few GBs. Therefore, we assumed that the splitting and aggregation might be a bigger overhead for the system. As a result, we needed to reduce the number of nodes and shards to see if our hypothesis was correct.

Unfortunately, it is not possible to update the number of shards of an existing index.

Solution

Current Setup

  • 6 Nodes
  • 6 indices with < 1Gb data
    • 5 primary shards
    • 1 replica per shard

Requirements

Changing the number of shards is not just flipping a switch. Therefore, we need to make sure that the productive system is still responsive.

Next to zero downtime, the data amount increases in the future, and the cluster should be able to scale up again

Implementation

There are several ways to deal with this situation. In the end, we decided to use the concept of aliases in Elastic Search.

Alias

An alias is a pointer to an existing index. For example, the alias member points toward the player index. Now any request to the member alias is redirected to the player index. Creating and deleting an index is very easy compared to an index. You cannot create an alias with the same name as an index. Nevertheless, creating an alias and deleting an index can happen at the same time. We use this behavior.

Process

How to change an index in 3 steps

  1. Create a new index with the name and the desired replica and primary shard size.

    PUT /new_index
    {
      "settings": {
        "index": {
          "number_of_shards": 3,  
          "number_of_replicas": 2 
        }
      }
    }
    
  2. Copy the data from to . Use the reindex API call from Elastic Search.

    POST _reindex
    {
      "source": {
        "index": "old_index"
      },
      "dest": {
        "index": "new_index"
      }
    }
    
  3. Redirection and Cleaning up

    a. Delete and create a alias with the name

    ```json
    POST /_aliases
      "actions": [
        {
          "add": {
            "index": "new_index",
            "alias": "old_index"
          }
        },
        {
          "remove_index": {
            "index": "old_index"
          }
        }
      ]
    }
    ```
    

    b. If you already have an alias, redirect it to

    ```json
    POST /_aliases
      "actions": [
        {
          "add": {
            "index": "new_index",
            "alias": "old_index"
          }
        },
            {
          "remove": {
            "index": "other_index",
            "alias": "old_index"
          }
        },
        {
          "remove_index": {
            "index": "old_index"
          }
        }
      ]
    }
    ```
    

Flags

To make the approach testable you can add flags to your script making each part of the process optional. Thereby, you can create a new temporary index and switch from that one. That doesn’t affect your production environment, but you can test each step of the process. Here is an example:

  1. Add a new temporary index and reindex from an existing one, but don’t switch (1. and 2. step of the process)
  2. Execute all steps on the temporary index. You can optionally run a load test on that index to see if there was no downtime while switching the indices.
  3. See if everything worked as expected and if your service experienced any downtime
  4. Delete the new index, as it was just for testing (just step 3)

Switch the index again

You might ask yourself: if I want to change the number of shards again how can I do that? No problem, just use step 3.b, which takes into account that you already created an alias.

Conclusion

I showed you how to change the settings of an index without interrupting your service. You create a new index, copy all data, and create an alias that points to your new index while deleting the old one.

  • You have to downtime, as switching the index and deleting the old index happens at the same moment.
  • Testing is no problem, as you can enable and disable each step and create temporary indices.
  • The process can be applied multiple times, as it also works with existing aliases.
  • Use the cat endpoint to monitor the process in real-time

Don’t be afraid to scale down

We reduced the number of shards to 1 with 2 replicas and 3 nodes (before 5 primary shards with 1 replica and 5 nodes). The CPU utilization was reduced by 75% and the costs by 20%.

I hope I could help you to get the most out of your Elastic Search Cluster. If you have any questions let me know and I’m happy to answer them. I would appreciate any feedback that you have, as I like to learn every day. Thanks a lot!

Reference

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.