How to configure Redis Client (Lettuce) to handle primary/replica failover ?

0

We have ElastiCache Cluster (5 shards, primary+replica in every shard), auto-failover enabled. Yesterday we encountered failover situation, one replica was promoted to primary, however our redis client (lettuce) was not able to discover new topology, and a lot of (all for that one particular shard?) PUT and DEL operations were failing with: Caused by: io.lettuce.core.RedisCommandTimeoutException: Command timed out after 2 second(s)

How to configure client to handle such situation? Is enabling

ClusterTopologyRefreshOptions.builder()
            .enablePeriodicRefresh(Duration.ofSeconds(30))
            .enableAllAdaptiveRefreshTriggers()

enough ?

To connect, we use name configured in DNS CNAME record in route53 that points to cluster url (Configuration endpoint from redis configuration)

luktol
feita há um ano347 visualizações
1 Resposta
0

Hello,

A successful failover flags the malfunctioning node as failing and promotes the replica to primary. Clients are supposed to refresh cluster topology in periodic intervals (enablePeriodicRefresh) and in response to specific events (enableAllAdaptiveRefreshTriggers). Your configurations look good in these aspects.

The Elasticache cluster endpoint returns all nodes in the cluster (10 nodes in your case), and any healthy node can be used to refresh the client topology. I am not an authority on Lettuce, but based on your description, it looks like the client persisted trying to contact the failing node instead of refreshing the topology and route requests to the healthy nodes.

You may want to set lower timeouts in your socket options and make sure that dynamicRefreshSources is true.

Additionally, make sure that you have an up-to-date Lettuce version. I've found bug reports addressing issues while refreshing cluster topology in versions not that old.

Maybe not directly related to your question, but I would advise against using custom DNS names in front of the Elasticache DNS endpoint. Improper caching may keep the records for longer than ideal and cause trouble in case of scaling or even failover. Although not common, nodes IPs may change in exception conditions during failovers. Same is valid for client-side DNS caching, on Operating System or JVM level.

The following blog post provides additional best practices for Redis clients, and provides examples for Lettuce: https://aws.amazon.com/blogs/database/best-practices-redis-clients-and-amazon-elasticache-for-redis/

I hope my response has been helpful to you.

AWS
ENGENHEIRO DE SUPORTE
Tulio_M
respondido há um ano

Você não está conectado. Fazer login para postar uma resposta.

Uma boa resposta responde claramente à pergunta, dá feedback construtivo e incentiva o crescimento profissional de quem perguntou.

Diretrizes para responder a perguntas