Skip to content

Amazon OpenSearch Service - Data node's Maximum Memory Utilization (Percent) stays around 90%

0

I previously used t3.small.search and t3.medium.search, and in both cases, the data was lost (both index and all documents) approximately 3 days after the instances were created. The data was small — only one index with about 15 documents, totaling less than 1 GB.

After that, I tried the following configuration:

  • 1 Availability Zone
  • 1 Data Node
  • 0 Master Nodes
  • 0 Coordinating Nodes
  • No indexes or documents created

With this setup, I tested the following instance types:

m7g.medium.search, m7g.large.search, m7g.xlarge.search, and m7g.2xlarge.search

As the instance type increased in capacity, I observed three things:

  1. The Maximum CPU Utilization (Percent) of the data node decreased, which aligns with my expectations.
  2. Young GC Time continued to increase steadily. Although the growth rate slowed down with larger instance types, it still kept increasing.
  3. Most surprisingly, the Maximum Memory Utilization (Percent) remained around 90% regardless of the instance type.

This is concerning to me. Fortunately, our team is still in the development and testing phase. However, I’m worried that we might face the same issue in a production environment.

I would like to clarify the following:

  1. Under what conditions can data loss occur? Is it related to CPU or memory usage?
  2. Does continuously increasing Young GC Time impact data persistence? How can I prevent this from happening?
  3. Does having Maximum Memory Utilization constantly over 90% pose a risk to data persistence? How can I prevent this?
  4. Why does Maximum Memory Utilization stay around 90% even when no indexes or documents have been created?

Any insights would be greatly appreciated. Thank you!

asked 10 months ago383 views
1 Answer
0

The memory utilization behavior you're observing with Amazon OpenSearch Service is actually expected and not necessarily a cause for concern. Let me address your questions:

  1. Data loss conditions: Data loss in OpenSearch typically doesn't occur simply due to high memory or CPU usage. The data loss you experienced with t3.small and t3.medium instances was likely due to these being burstable instance types that aren't recommended for production workloads. These smaller instances have limited resources that can be exhausted quickly, potentially leading to cluster instability.

  2. Young GC Time: Increasing Young Garbage Collection time is normal as the JVM manages memory. While excessive GC can impact performance, it doesn't directly cause data loss. As you observed, larger instances show slower growth in GC time, which is expected.

  3. Memory utilization at 90%: High memory utilization in OpenSearch is by design. OpenSearch is built to use available memory efficiently, with a significant portion allocated to the JVM heap for caching and query processing. The 90% utilization you're seeing is likely the JVM using its allocated memory, which is normal even without data. This doesn't necessarily pose a risk to data persistence if you're using appropriately sized instances.

  4. High memory utilization without data: OpenSearch pre-allocates memory for its operations, including the JVM heap, field data cache, and system processes. Even without indexes or documents, the service reserves memory for these functions, which is why you see high utilization.

For your use case, I would recommend:

  • Consider using r-type instances (like r6g or r7g) which are memory-optimized and better suited for OpenSearch workloads
  • Ensure you have at least 3 data nodes for production environments to provide redundancy
  • Enable dedicated master nodes for clusters that will handle significant workloads
  • Monitor performance metrics beyond just memory utilization, such as cluster health, indexing rate, and search latency

The m7g instances you've been testing are general-purpose instances powered by AWS Graviton processors. While they can work for OpenSearch, r-type instances might be more appropriate for your production environment to ensure data persistence and optimal performance.
Sources
Data-Sorting Process Takes time with OpenSearch | AWS re:Post
Valkey-, Memcached-, and Redis OSS-Compatible Cache – Amazon ElastiCache Previous Generation - AWS
CacheNode - Amazon ElastiCache
How ZS built a clinical knowledge repository for semantic search using Amazon OpenSearch Service and Amazon Neptune | AWS Big Data Blog

answered 10 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.