Viktor Ardelean

Posted on Jul 5, 2023

Unleashing OpenSearch: Best Practices for 1 Billion Documents on AWS

#aws #cloud #awscommunitybuilders #elasticsearch

1. Introduction:

Setting up an OpenSearch cluster in AWS to handle big data volumes is crucial in ensuring optimal performance, scalability, and availability.

In this blog post, we will explore important considerations and recommendations for configuring an OpenSearch cluster specifically designed to manage 1 billion documents, each sized at 2KB.

By following these recommendations, such as properly allocating resources and implementing appropriate indexing strategies, we can ensure efficient ingestion and management of the 2TB data set.

This will enable us to harness the full potential of OpenSearch and effectively process and analyze the extensive data at hand.

2. Optimizing Performance and Scalability

2.1 Dedicated Master Nodes

It is recommended to use Multi-AZ with Standby and deploy three dedicated master nodes for optimal stability and fault tolerance.

Avoid choosing an even number of dedicated master nodes to ensure the necessary quorum for electing a new master in case of failures. Three dedicated master nodes offer two backup nodes and the required quorum, providing a reliable configuration.

The following table provides recommended minimum dedicated master instance types for efficient management of OpenSearch clusters, considering factors like instance count and shard count.

Instance count	Master node RAM size	Maximum supported shard count	Recommended minimum dedicated master instance type
1–10	8 GiB	10K	m5.large.search or m6g.large.search
11–30	16 GiB	30K	c5.2xlarge.search or c6g.2xlarge.search
31–75	32 GiB	40K	r5.2xlarge.search or r6g.2xlarge.search
76–125	64 GiB	75K	r5.2xlarge.search or r6g.2xlarge.search
126–200	128 GiB	75K	r5.4xlarge.search or r6g.4xlarge.search

Based on our specific use case, we can opt for a configuration consisting of three m6g.large.search master nodes.

2.2 Scaling Data Nodes

When dealing with substantial data volumes, such as the massive 2TB dataset we have, it becomes crucial to appropriately scale the number of data nodes in our OpenSearch cluster.

Scaling the data nodes allows us to distribute the data across multiple nodes, ensuring efficient storage, retrieval, and processing of the extensive dataset.

To efficiently manage a 2TB dataset in OpenSearch, we can start with three data nodes and conduct thorough performance testing.

Monitoring metrics such as indexing throughput and response times during testing will determine if additional nodes are necessary.

By gradually adding nodes based on performance analysis, we can effectively distribute the workload and ensure optimal dataset management.

2.3 Shard Count

To optimize search performance in OpenSearch, careful consideration of the shard count in our index is crucial. Increasing the number of shards can significantly improve efficiency, particularly when dealing with large datasets. For search-focused workloads, aiming for shard sizes between 10-30 GB is recommended, while sizes between 30-50 GB work well for write-heavy scenarios.

For instance, let's consider a use case where both reads and writes occur equally. In such cases, aiming for a shard size of around 30 GB is ideal.

We can approximate the number of primary shards required using the formula (source_data) * (1 + indexing_overhead) / desired_shard_size.

For our example, the approximate number of primary shards would be (2000) * 1.1 / 30 ≈ 73.33

To ensure an even distribution of shards across our three data nodes, a suitable shard count would be 72. This allocation helps balance the workload and maximizes the utilization of resources within our OpenSearch cluster.

As our data volume grows, adjusting the shard count to maintain optimal performance becomes essential. Adapting the shard count based on the evolving data volume ensures efficient distribution and processing of data across the cluster. We can achieve optimal search performance in OpenSearch by continuously monitoring and optimizing the shard count.

2.4 Adding Replicas

In OpenSearch, replica shards are exact copies of primary shards within an index distributed across data nodes in a cluster. The presence of replica shards ensures data redundancy and increases read capacity.

When indexing data, requests are sent to all data nodes containing primary and replica shards. At the same time, search queries are directed to either primary or replica shards, resulting in a different number of shards being involved in each operation.

Using at least one replica is strongly recommended, as it enhances redundancy and improves read capacity. Additional replicas increase these benefits, offering even greater data redundancy and improving the ability to handle read operations efficiently.

3. Storage Considerations:

3.1 Estimating Storage Requirements:

To account for replicas and various storage considerations in OpenSearch, we must allocate approximately 6 TB of total storage (2 TB for the primary shard and two replicas of 2 TB each).

However, it's important to consider additional factors that affect storage requirements. OpenSearch Service reserves 20% of storage space for segment merges, logs, and other internal operations, with a maximum of 20 GiB per instance. This means that the total reserved space can vary depending on the number of instances in your domain.

To calculate the minimum storage requirement for our example with 2 replicas and 2TB of source data, we can use the simplified formula provided:

Source data * (1 + number of replicas) * 1.45 = minimum storage requirement

Substituting the values:

2TB * (1 + 2) * 1.45 = 8.7TB

Therefore, the minimum storage requirement for replicas and other factors would be approximately 8.7TB.

3.2 Handling Full Reindex

Handling a full reindex is an important aspect of managing data in Elasticsearch. In scenarios where we need to reindex our data, it becomes necessary to have two indices concurrently: one containing the current data and another for the new data.

This approach allows us to perform the reindexing process seamlessly without any downtime. However, it's essential to consider that this scenario requires allocating twice the storage estimated in the previous chapter.

To simultaneously accommodate both indices, we need to ensure sufficient storage capacity to store the existing and newly reindexed data.

Considering the increased storage requirement, we must allocate a minimum of 2 * 8.7TB = 17.4TB.

Additionally, it's worth noting that Elasticsearch allows for dynamic scaling of storage during a full reindex. This means we can add storage capacity specifically for the reindexing process and later scale it down once it is complete. This dynamic allocation prevents us from continuously paying for unused storage and optimizes cost efficiency.

3.3 Instance Type Selection:

When selecting hardware for our OpenSearch cluster, it's crucial to consider storage requirements, shard count, and workload characteristics. The number of shards per data node should align with the node's JVM heap memory, typically aiming for 25 shards or fewer per GiB.

To ensure efficient processing, it's recommended to have an initial scale point of 1.5 vCPUs per shard. For instance, with 72 shards per node, we would need approximately 108 vCPUs.

To accommodate this, scaling our data nodes to 5 would be suitable, resulting in a shard count of approximately 43.2 per node. In this case, selecting a robust instance type like the m6g.12xlarge.search with 48 CPUs and 192 RAM would be advisable.

However, additional instances may be required if performance falls short of expectations, tests fail, or CPUUtilization or JVMMemoryPressure indicators are high. As instances are added, OpenSearch automatically redistributes the shard distribution throughout the cluster, helping to balance the workload and optimize performance.

4. Monitoring:

4.1 Monitor Resource Utilization

It is essential to implement continuous monitoring of CPU, memory, and storage usage in our OpenSearch cluster to ensure optimal performance. By monitoring these metrics, we can identify any resource bottlenecks or imbalances and take appropriate actions to address them.

Regularly reviewing resource utilization allows us to make informed decisions regarding resource allocation, ensuring that our cluster operates efficiently and effectively.

4.2 Configure CloudWatch Alarms

Setting up CloudWatch alarms is a proactive approach to staying informed about the health and performance of our OpenSearch cluster.

By defining thresholds for key metrics such as CPU utilization, storage usage, and search latency, we can receive timely alerts when any of these metrics breach the specified limits. These alarms enable us to quickly identify and address potential issues before impacting overall cluster performance.

Regularly reviewing the cluster and instance metrics provides valuable insights into the behavior and patterns of our cluster, allowing us to fine-tune and optimize its configuration for optimal performance.

5. Deployment Best Practices

5.1 Multi-AZ Deployment

We should deploy our data nodes across multiple Availability Zones (AZs) to achieve high availability and fault tolerance.

By evenly distributing our data nodes across AZs, we ensure redundancy and resilience in our OpenSearch cluster.

This approach safeguards against AZ failures, reducing the risk of downtime and ensuring the continuous operation of our cluster.

5.2 Subnet Distribution:

Dividing our data nodes into multiple subnets distributed across different AZs is advisable to ensure high availability and fault tolerance.

Distributing our data nodes across subnets enhances the cluster's resilience to network-related issues within a specific AZ. This practice improves fault isolation capabilities and minimizes the impact of potential subnet-level disruptions.

6. Enable Log Publishing

To effectively troubleshoot performance, stability, and user activity in our OpenSearch cluster, enabling log publishing and utilizing relevant logs for analysis is crucial. We can direct OpenSearch error logs, search slow logs, indexing slow logs, and audit logs to CloudWatch Logs by enabling log publishing.

7. Restrict Wildcard Usage

We must implement access controls and permissions in our OpenSearch cluster to ensure data security and prevent accidental data loss.

Specifically, we must enforce restrictions on destructive wildcard operations by requiring explicit action names.

This measure mitigates the risk of unintentional data loss by ensuring that users must explicitly specify the action name when performing destructive operations.

By making the following API call, we enable the setting that ensures a specific action name is required for destructive operations:

PUT /_cluster/settings
{
  "persistent": {
    "action.destructive_requires_name": true
  }
}

8. Conclusion

Careful planning and configuration are essential when setting up our OpenSearch cluster in AWS to handle 1 billion documents. By following these best practices, we can optimize our cluster's performance, scalability, and availability, ensuring efficient management of large data volumes. Implementing dedicated master nodes, scaling data nodes, configuring shards and replicas, estimating storage requirements, and monitoring resource utilization are key factors for a robust and reliable OpenSearch deployment.

Customizing the calculations and configurations mentioned in this blog post is important according to our specific requirements and workload characteristics. Regular monitoring, log analysis, and optimization will enable us to maintain a high-performing OpenSearch cluster capable of effectively handling extensive data volumes.

DEV Community