DEV Community

Cover image for Advantages of using time-based indices in OpenSearch
Sandeep Kanabar for AWS Community Builders

Posted on

Advantages of using time-based indices in OpenSearch

This post lists a few advantages of using time-based indices in OpenSearch Cluster.

  1. Increasing / Decreasing the number of shards becomes easy
  2. Helps to plan cluster capacity and growth size
  3. Easily determine optimum number of shards
  4. Avoids having to reindex entire data
  5. Efficient Deletion and application of ISM
  6. Easy to include / exclude indices based on alias
  7. Snapshot and Restore becomes a breeze with day-wise indices
  8. Apply best_compression to day-wise indices
  9. Force-merge past indices
1. Increasing / Decreasing the number of shards becomes easy

Say, an index template that makes use of day-wise indices is configured with 1 shard in index settings. In case the indexing rate is slow or the shard size becomes too large (> 50 GB), the index template can be easily modified to increase the number_of_ shards to 3 or 5. And this gets effected from the next day. Similarly, if a day-wise index pattern is configured with more than required number of shards (oversharded), reducing it becomes easy.

2. Helps to plan cluster capacity and growth size

Let's say 100 events per second flow into an OpenSearch cluster and each event averages about 1 KB in size. Thus, per day, there would be:
86400 seconds * 100 events/second = 8,640,000 events.

Since each event averages about 1 KB, the total size of 8,640,000 events = 8,640,000 * 1 KB = 8,640,000 KB / (1024 * 1024) = ~8.24 GB.

Thus, with a day-wise index, we could see that the day-wise index size would be ~9 GB per day without any replicas. Considering 1 replica, the size per day would be ~18 GB and size for 30 days would be ~540 GB. This helps with capacity planning and estimating cluster growth rate.

3. Easily determine optimum number of shards

With data set of about 9GB per day, for a day-wise index, we could start by setting "number_of_shards" : 1 in the index template since each primary shard would be about 9 GB which is pretty reasonable for a single shard. Shards for time-based indices can be in the range of 40-50 GB.

4. Avoids having to reindex entire data

If the data influx increases, we could easily set "number_of_shards": 3 in the index template and this would get effected for tomorrow's day-wise index. Without the need to reindex any data, the number of shards could be easily changed.

5. Efficient Deletion and application of ISM

Let's say we need to retain data upto 90 days. Thus, for a day-wise index which is older than 90 days, that entire index can be purged / deleted. This is far more efficient than purging records from indices.
Also, application of index state management becomes simplified with time-based indices.

6. Easy to include / exclude indices based on alias

Let's assume the cluster needs to retain 90 days data but needs to search only on the last 60 days data. Alias to the rescue. In this case, define an alias in index template that gets mapped to newly created day-wise indices. As soon as a past index becomes older than 60 days, the alias is removed from that index. This ensures that at any given point of time, the alias will point to a maximum of 60 day-wise indices.

7. Snapshot and Restore becomes a breeze with day-wise indices

Say you have an index named my_index-2021.11.04 created on Nov 04, 2021. On Nov 05, 2021 at say 00:45 hours when data is no longer being written to the my_index-2021.11.04, a snapshot, snap-my_index-2021.11.04 could be triggered for that index. This snapshot would contain just the my_index-2021.11.04. In case the index is deleted and needs to be restored, it can be easily restored from the snapshot snap-my_index-2021.11.04.

8. Apply best_compression to day-wise indices

The index template can be modified to set "codec": "best_compression" in index settings i.e.

    "settings": {
      "codec": "best_compression"
    }
Enter fullscreen mode Exit fullscreen mode

Depending on the use case, this could help to save disk space from 10% to 30% or even more. The mileage would vary.

"codec": "best_compression" CANNOT be dynamically applied on existing open indices. The index needs to closed first, then the setting applied dynamically and then the index needs to be opened.

9. Force-merge past indices

Since the data gets written only to current day's index, in case no updation happens to past data, all past indices are effectively read-only. Thus, such indices can be forcemerged by setting "max_num_segments":1. This boosts search speed tremendously.

Discussion (0)