Best Practice of Using ElasticSearch

#elasticsearch #database #tutorial #architecture

Last time, we have introduced some tips for boosting ElasticSearch performance.

https://medium.com/better-programming/boosting-elasticsearch-cluster-performance-3-proven-tips-9b718a9114bc

In addition, that article explains the underlying details of ElasticSearch. This time, we are going to talk a little more about best practices for using ElasticSearch.

These practices are general recommendations and can be applied to any use cases. Let's go.

Bulk Requests: The Bulk API makes it possible to perform many index/delete operations in a single API call. This can greatly increase the indexing speed. Each subrequest is executed independently, so the failure of one subrequest won’t affect the success of the others. If any of the requests fail, the top-level error flag is set to true and the error details will be reported under the relevant request.
Multithread clients to Index Data: A single thread sending bulk requests is unlikely to be able to max out the indexing capacity of an ElasticSearch cluster. In order to use all resources of the cluster, you should send data from multiple threads or processes. In addition to making better use of the resources of the cluster, this should help reduce the cost of each fsync. Both the index data and transaction log are periodically flushed to disk. If there are more data with multithread, the more data is synced to disk to reduce I/O to improve performance.
index.refresh_interval: By default, ElasticSearch periodically refreshes indices every second, but only on indices that have received one search request or more in the last 30 seconds.This is the optimal configuration if you have no or very little search traffic (e.g. less than one search request every 5 minutes) and want to optimize for indexing speed. This behavior aims to automatically optimize bulk indexing in the default case when no searches are performed. In order to opt out of this behavior set the refresh interval explicitly. On the other hand, if your index experiences regular search requests, this default behavior means that ElasticSearch will refresh your index every 1 second. If you can afford to increase the amount of time between when a document gets indexed and when it becomes visible, increasing the index.refresh_interval to a larger value, e.g. 30s, might help improve indexing speed.
Auto generated IDs: When indexing a document that has an explicit id, ElasticSearch needs to check whether a document with the same id already exists within the same shard, which is a costly operation and gets even more costly as the index grows. By using auto- generated ids, ElasticSearch can skip this check, which makes indexing faster.
index.translog.sync_interval: This parameter determines how often the translog is fsynced to disk and committed, regardless of write operations. Defaults to 5s. Values less than 100ms are not allowed.
index.translog.flush_threshold_size: The translog stores all operations that are not yet safely persisted in Lucene (i.e., are not part of a Lucene commit point). Although these operations are available for reads, they will need to be replayed if the shard was stopped and had to be recovered. This setting controls the maximum total size of these operations, to prevent recoveries from taking too long. Once the maximum size has been reached a flush will happen, generating a new Lucene commit point. Defaults to 512mb.
Large Documents: Large documents put more stress on network, memory usage and disk. Indexing large document can use an amount of memory that is a multiplier of the original size of the document. Proximity search (phrase queries for instance) and highlighting also becomes more expensive since their cost directly depends on the size of the original document.
Set Index Mapping Explicitly: ElasticSearch can create mapping dynamically, but it might be not suitable for all scenarios. For example, the default string field mappings in ElasticSearch 5.x are both "keyword" and "text" types. It's unnecessary in a lot of scenarios.
Index Mapping - Nested Types: Querying on nested fields is slower compared to fields in parent document. Retrieval of matching nested fields adds an additional slowdown. Once you update any field of a document containing nested fields, independent of whether you updated a nested field or not, all the underlying Lucene documents (parent and all its nested children) need to be marked as deleted and rewritten. In addition to slowing down your updates, such an operation also creates garbage to be cleaned up by segment merging later on.
Index Mapping: Disable the _all field concatenates the values of all other fields into one string. It requires more CPU and disk space than other fields. Most use cases don't require the _all field. You can concatenate multiple fields using the copy_to parameter. The _all field is disabled by default in ElasticSearch versions 6.0 and later. To disable the _all field in earlier versions, set enabled to false.
Leverage Index Templates: Index templates define settings like number of shards, replicas and mappings that you can automatically apply when creating new indices. ElasticSearch applies templates to new indices based on an index pattern that matches the index name.
Use Replicas for Scalability & Resilience: ElasticSearch is built to be always available and to scale with your needs. It does this by being distributed in nature. You can add nodes to a cluster to increase capacity and ElasticSearch automatically distributes your data and query load across all of the available nodes. For ElasticSearch to be highly available, its indices needs to have fault tolerance in place. This can be achived using replica shards. A replica shard is a copy of a primary shard. Replicas provide redundant copies of your data to protect against hardware failure and increase capacity to serve read requests like searching or retrieving a document.
Shard Sizing: A shard is a Lucene index under the covers, which uses file handles, memory, and CPU cycles. Default shard strategy for an index in ES is 5 primary shards with a replica. The goal of choosing a number of shards is to distribute an index evenly across all data nodes in the cluster. However, these shards shouldn't be too large or too numerous. A good rule of thumb is to try to keep shard size between 10–50 GB. Large shards can make it difficult for ElasticSearch to recover from failure, but because each shard uses some amount of CPU and memory, having too many small shards can cause performance issues and out of memory errors.
Keep shard count of index in multiple of data nodes with equivalent size and distributed across nodes. Set a primary shard count, via a template, by targeting 50GB max per primary shard (log analytics) or 30GB max (search use cases) Shard assignment across data nodes will happen based on 2 important rules.
- Primary and replica shard of same index will not be assigned on same data nodes.
- A shard is placed on a node based on how many shards are available on that node or to equalize the number of shards per index across all nodes in the cluster. Also note there can be chances of bigger shards allocated to some node and smaller to others. It is recommended that shard count of an index (primary + replica) should be multiple of data node count. Let's say, you have a 4 node cluster, the total shards (primary + Replica) for your index should be either 4 or 8 or 12 etc. This ensures that the data is evenly distributed across the nodes.
Index State Management: ISM lets you define custom management policies to automate routine tasks and apply them to indices and index patterns. You no longer need to set up and manage external processes to run your index operations. A policy contains a default state and a list of states for the index to transition between. Within each state, you can define a list of actions to perform and conditions that trigger these transitions. A typical use case is to periodically delete old indices after a certain period of time.
Organize the data in index by date: For most logging or monitoring use cases, we can organize indices to be daily, weekly, or monthly, and then we can get an index list by a specified date range. ElasticSearch only needs to query on a smaller dataset instead of the whole dataset. In addition, it would be easy to shrink/delete the old indices when data has expired.
Use Curator to rotate data: Curator offers numerous filters to help you identify indices and snapshots that meet certain criteria, such as indices created more than 60 days ago or snapshots that failed to complete

DEV Community

Best Practice of Using ElasticSearch

Top comments (0)

Read next

Aurora Limitless - Sequences

Building and Deploying TypeScript Microservices to Kubernetes

A conversation with your architecture

Adding new columns - lowCalAlt_update5