Skip to content

Why is there a memory imbalance between shards in my ElastiCache for Valkey or ElastiCache for Redis OSS self-managed cluster?

5 minute read
0

My Amazon ElastiCache for Valkey or Amazon ElastiCache for Redis self-managed cluster with cluster mode enabled has uneven memory usage across shards.

Short description

By default, Valkey and Redis OSS clusters with cluster mode enabled try to evenly distribute the cache key space across shards in a cluster. For more information, see How to work with cluster mode on Amazon ElastiCache for Redis.

The following reasons can cause imbalanced memory usage that causes some shards to store more data than others:

  • Uneven key distribution
  • Keys that are too large
  • "Hot" keys or shards
  • Uneven hash tag usage
  • Increased client output buffers

Resolution

Note: If you receive errors when you run AWS Command Line Interface (AWS CLI) commands, then see Troubleshooting errors for the AWS CLI. Also, make sure that you're using the most recent AWS CLI version.

Uneven key distribution

If you don't evenly distribute the hash slots across the shards, then some shards might handle more keys than others. To resolve this issue, rebalance the slots across the cluster.

Note: The Slot rebalance option tries to evenly distribute the 16384 hash slots between the available shards. This option doesn't rebalance based on memory usage or the data volume in each shard.

Some keys are too large

If some keys are much larger than others, then the shards that host these keys might see higher memory usage. To resolve this issue, you can break up large keys into smaller key-value pairs. Or, remove unnecessary large keys to free up space.

To scan the dataset for big keys, use the valkey-cli --bigkeys or valkey-cli --memkeys command. For more information, see Scanning for big keys on the Valkey website.

It's a best practice to use efficient key naming strategies, data structures, and compression techniques to optimize the memory usage of the keys.

Hot keys or shards

When you frequently access some keys more than others, the load distributes unevenly and strains the serving host's memory utilization. The frequently accessed keys are known as hot keys or hot shards.

To find hot keys, run the valkey-cli --hotkeys command to review the key access patterns to identify the keys. In some cases, a single hot cache key can create a hot spot that overwhelms the cache node. The hot spot can affect the CPU, memory, and network resources of the node.

Note: The hotkeys command works only when the maxmemory-policy is set to *lfu.

To resolve this issue, take the following actions:

  • Vertically scale your cluster and provide more resources.
  • Spread read traffic to read replicas. For more information, see READONLY on the Valkey website.
  • Modify the client application to reduce the volume of writes to the keys.

Uneven hash tag usage

In a cluster mode enabled environment, you must use hash tags to implement multi-key operations in a Valkey cluster. When you increase your use of hash tags, some hash slots store more keys than others and you get a memory imbalance between shards.

To resolve this issue, review the Key space and hash tag usage and spread the data across more hash slots. For more information, see the Hash tags section in Cluster specification on the Valkey website.

Increased client output buffers

When a client's commands produce output faster than what Valkey can send to the client, then the client output buffer grows and uses more memory. For more information, see Output buffer limits on the Valkey website.

To identify the cause of buffer issues, connect to the affected node and run the CLIENT LIST command to identify clients that use buffer space. For more information, see CLIENT LIST on the Valkey website.

To determine the cause of high client output buffer, review the following key parameters in the output:

  • obl: Output buffer length
  • omem: Output buffer memory usage
  • tot-mem: Total memory used by the client

You can also review the DatabaseMemoryUsageCountedForEvictPercentage and DatabaseMemoryUsagePercentage metrics in Amazon CloudWatch. If there's a significant difference in memory usage between the two metrics, then the cause of the memory usage is client output buffers.

Note: The DatabaseMemoryUsagePercentage metric also includes connection overhead and client output buffer memory usage.

Best practices

To reduce memory imbalance issues, use the following best practices.

Configure TTL settings

Set appropriate Time to Live (TTL) values for keys. When you configure appropriate TTL values, the Valkey node automatically removes keys that run out of TTL and optimizes memory usage. For more information, see TTL on the Valkey website.

Review your memory metrics

It's a best practice to regularly review the following key memory metrics across your shards to identify imbalances early and take proactive measures:

  • DatabaseMemoryUsagePercentage: Track overall memory utilization on the node.
  • DatabaseMemoryUsageCountedForEvictPercentage: To detect high buffer and overhead usage, compare against DatabaseMemoryUsagePercentage.
  • BytesUsedForCache: Monitor actual memory that cached data uses.
  • CurrItems: Track the number of items that's stored in each shard.
  • SwapUsage: Track the amount of swap that's used on a host.
    Note: It's common for ElastiCache to have some SwapUsage. Normal usage doesn't cause latency issues. If SwapUsage crosses 300 MB, then check for memory pressure. For more information, see How much reserved memory do you need?

Upgrade your nodes

To handle large keys and frequent access patterns, temporarily scale up your ElastiCache cluster to provide additional CPU and memory resources.

Related information

How do I check memory usage in my ElastiCache for Redis self-designed cluster and implement best practices to control high memory usage?

Key distribution model on the Valkey website

Amazon ElastiCache update- Online resizing for Redis clusters

How do I resolve the increase in swap activity in my ElastiCache instances?

AWS OFFICIALUpdated 7 months ago