- Newest
- Most votes
- Most comments
Hello,
A successful failover flags the malfunctioning node as failing and promotes the replica to primary. Clients are supposed to refresh cluster topology in periodic intervals (enablePeriodicRefresh) and in response to specific events (enableAllAdaptiveRefreshTriggers). Your configurations look good in these aspects.
The Elasticache cluster endpoint returns all nodes in the cluster (10 nodes in your case), and any healthy node can be used to refresh the client topology. I am not an authority on Lettuce, but based on your description, it looks like the client persisted trying to contact the failing node instead of refreshing the topology and route requests to the healthy nodes.
You may want to set lower timeouts in your socket options and make sure that dynamicRefreshSources is true.
Additionally, make sure that you have an up-to-date Lettuce version. I've found bug reports addressing issues while refreshing cluster topology in versions not that old.
Maybe not directly related to your question, but I would advise against using custom DNS names in front of the Elasticache DNS endpoint. Improper caching may keep the records for longer than ideal and cause trouble in case of scaling or even failover. Although not common, nodes IPs may change in exception conditions during failovers. Same is valid for client-side DNS caching, on Operating System or JVM level.
The following blog post provides additional best practices for Redis clients, and provides examples for Lettuce: https://aws.amazon.com/blogs/database/best-practices-redis-clients-and-amazon-elasticache-for-redis/
I hope my response has been helpful to you.
Relevant content
- asked 6 years ago
- asked 8 months ago
- asked 7 months ago
- AWS OFFICIALUpdated 3 months ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 8 months ago