Sandeep Kanabar for AWS Community Builders

Posted on Nov 6, 2021

Advantages of using time-based indices in OpenSearch

#aws #productivity #performance #opensearch

This post lists a few advantages of using time-based indices in OpenSearch Cluster.

Increasing / Decreasing the number of shards becomes easy
Helps to plan cluster capacity and growth size
Easily determine optimum number of shards
Avoids having to reindex entire data
Efficient Deletion and application of ISM
Easy to include / exclude indices based on alias
Snapshot and Restore becomes a breeze with day-wise indices
Apply best_compression to day-wise indices
Force-merge past indices

1. Increasing / Decreasing the number of shards becomes easy

Say, an index template that makes use of day-wise indices is configured with 1 shard in index settings. In case the indexing rate is slow or the shard size becomes too large (> 50 GB), the index template can be easily modified to increase the number_of_ shards to 3 or 5. And this gets effected from the next day. Similarly, if a day-wise index pattern is configured with more than required number of shards (oversharded), reducing it becomes easy.

2. Helps to plan cluster capacity and growth size

Let's say 100 events per second flow into an OpenSearch cluster and each event averages about 1 KB in size. Thus, per day, there would be:
86400 seconds * 100 events/second = 8,640,000 events.

Since each event averages about 1 KB, the total size of 8,640,000 events = 8,640,000 * 1 KB = 8,640,000 KB / (1024 * 1024) = ~8.24 GB.

Thus, with a day-wise index, we could see that the day-wise index size would be ~9 GB per day without any replicas. Considering 1 replica, the size per day would be ~18 GB and size for 30 days would be ~540 GB. This helps with capacity planning and estimating cluster growth rate.

3. Easily determine optimum number of shards

With data set of about 9GB per day, for a day-wise index, we could start by setting "number_of_shards" : 1 in the index template since each primary shard would be about 9 GB which is pretty reasonable for a single shard. Shards for time-based indices can be in the range of 40-50 GB.

4. Avoids having to reindex entire data

If the data influx increases, we could easily set "number_of_shards": 3 in the index template and this would get effected for tomorrow's day-wise index. Without the need to reindex any data, the number of shards could be easily changed.

5. Efficient Deletion and application of ISM

Let's say we need to retain data upto 90 days. Thus, for a day-wise index which is older than 90 days, that entire index can be purged / deleted. This is far more efficient than purging records from indices.
Also, application of index state management becomes simplified with time-based indices.

6. Easy to include / exclude indices based on alias

Let's assume the cluster needs to retain 90 days data but needs to search only on the last 60 days data. Alias to the rescue. In this case, define an alias in index template that gets mapped to newly created day-wise indices. As soon as a past index becomes older than 60 days, the alias is removed from that index. This ensures that at any given point of time, the alias will point to a maximum of 60 day-wise indices.

7. Snapshot and Restore becomes a breeze with day-wise indices

Say you have an index named my_index-2021.11.04 created on Nov 04, 2021. On Nov 05, 2021 at say 00:45 hours when data is no longer being written to the my_index-2021.11.04, a snapshot, snap-my_index-2021.11.04 could be triggered for that index. This snapshot would contain just the my_index-2021.11.04. In case the index is deleted and needs to be restored, it can be easily restored from the snapshot snap-my_index-2021.11.04.

8. Apply best_compression to day-wise indices

The index template can be modified to set "codec": "best_compression" in index settings i.e.

    "settings": {
      "codec": "best_compression"
    }

Depending on the use case, this could help to save disk space from 10% to 30% or even more. The mileage would vary.

"codec": "best_compression" CANNOT be dynamically applied on existing open indices. The index needs to closed first, then the setting applied dynamically and then the index needs to be opened.

9. Force-merge past indices

Since the data gets written only to current day's index, in case no updation happens to past data, all past indices are effectively read-only. Thus, such indices can be forcemerged by setting "max_num_segments":1. This boosts search speed tremendously.

Top comments (2)

Andrea Florio • Sep 13 '22

I fully agree with everything described in here but I have a challenge to daily assign my policy to the newly created index.

With elasticsearch i can leverage logstash output plugin for elasticsearch and configure the policy I want in there, but how do I achieve the same thing with opensearch?

My indexes are created without policy and we must daily apply the policy to the new index.

Thank you

Sandeep Kanabar AWS Community Builders • Nov 18 '22

Hey @anubisg1 sorry I missed your reply amidst a plethora of notifs. Did you figure it out? You'll need to create indices with policy and then it will auto apply to all matching indices. Is there a reason you cannot create a policy?