How to configure Redis Client (Lettuce) to handle primary/replica failover ?

0

We have ElastiCache Cluster (5 shards, primary+replica in every shard), auto-failover enabled. Yesterday we encountered failover situation, one replica was promoted to primary, however our redis client (lettuce) was not able to discover new topology, and a lot of (all for that one particular shard?) PUT and DEL operations were failing with: Caused by: io.lettuce.core.RedisCommandTimeoutException: Command timed out after 2 second(s)

How to configure client to handle such situation? Is enabling

ClusterTopologyRefreshOptions.builder()
            .enablePeriodicRefresh(Duration.ofSeconds(30))
            .enableAllAdaptiveRefreshTriggers()

enough ?

To connect, we use name configured in DNS CNAME record in route53 that points to cluster url (Configuration endpoint from redis configuration)

luktol
asked a year ago316 views
1 Answer
0

Hello,

A successful failover flags the malfunctioning node as failing and promotes the replica to primary. Clients are supposed to refresh cluster topology in periodic intervals (enablePeriodicRefresh) and in response to specific events (enableAllAdaptiveRefreshTriggers). Your configurations look good in these aspects.

The Elasticache cluster endpoint returns all nodes in the cluster (10 nodes in your case), and any healthy node can be used to refresh the client topology. I am not an authority on Lettuce, but based on your description, it looks like the client persisted trying to contact the failing node instead of refreshing the topology and route requests to the healthy nodes.

You may want to set lower timeouts in your socket options and make sure that dynamicRefreshSources is true.

Additionally, make sure that you have an up-to-date Lettuce version. I've found bug reports addressing issues while refreshing cluster topology in versions not that old.

Maybe not directly related to your question, but I would advise against using custom DNS names in front of the Elasticache DNS endpoint. Improper caching may keep the records for longer than ideal and cause trouble in case of scaling or even failover. Although not common, nodes IPs may change in exception conditions during failovers. Same is valid for client-side DNS caching, on Operating System or JVM level.

The following blog post provides additional best practices for Redis clients, and provides examples for Lettuce: https://aws.amazon.com/blogs/database/best-practices-redis-clients-and-amazon-elasticache-for-redis/

I hope my response has been helpful to you.

AWS
SUPPORT ENGINEER
Tulio_M
answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions