Skip to content

Aurora Read Replica Scale In Connection Issue with RDS Proxy and Lambda

0

Environment

  • AWS Aurora PostgreSQL
  • RDS Proxy
  • Node.js Lambda functions

Current Setup

We are using AWS Aurora PostgreSQL with RDS Proxy, and our application logic is implemented in Node.js Lambda functions.

Issue Description

We're experiencing connection interruptions in our running Lambda functions when Aurora Read Replicas scale in. Here are the specifics:

  1. Our Node.js Lambda functions do not use connection pooling. Instead, they create a new connection for each query, as recommended when using RDS Proxy.

  2. When Auto Scaling initiates a scale-in event and removes a Read Replica, it causes errors in Lambda functions that are still executing queries.

Questions

  1. How does scale-in work with RDS Proxy and Aurora Read Replicas?

  2. What criteria does AWS use to determine if it's safe to remove a Read Replica?

  3. Is there a way to ensure that all ongoing queries are completed before a Read Replica is removed during scale-in?

  4. Are there any best practices or configurations we should implement to prevent these connection interruptions?

  5. How can we solve our current issue of Lambda functions failing due to Read Replica removal during execution?

Any insights or recommendations would be greatly appreciated. Thank you in advance for your help!

1 Answer
2
Accepted Answer

Thank you for providing detailed information about your environment and the issue you're facing. I'll address your questions and provide some insights and recommendations.

  1. Scale-in with RDS Proxy and Aurora Read Replicas: When a scale-in event occurs, Aurora Auto Scaling removes Read Replicas based on the Auto Scaling policy. RDS Proxy maintains a connection pool to the Aurora cluster, including the Read Replicas. When a Read Replica is removed, RDS Proxy adjusts its connection pool accordingly.

  2. Criteria for safe Read Replica removal: AWS considers several factors when determining if it's safe to remove a Read Replica:

  • The current load on the replica
  • The overall cluster performance
  • The target metrics specified in the Auto Scaling policy

However, there isn't a specific mechanism to ensure all ongoing queries are completed before removal.

  1. Ensuring query completion before Read Replica removal: Unfortunately, there isn't a built-in way to guarantee all ongoing queries are completed before a Read Replica is removed during scale-in. Aurora Auto Scaling focuses on maintaining the desired performance metrics rather than individual query states.

  2. Best practices and configurations to prevent connection interruptions:

  • Use RDS Proxy's connection pooling feature effectively. While you're already using RDS Proxy, ensure it's configured optimally.
  • Implement retry logic in your Lambda functions to handle temporary connection issues.
  • Consider using a shorter connection timeout in your Lambda functions to detect and react to issues more quickly.
  • Monitor the Amazon CloudWatch ReplicaLag metric to track the lag between Aurora Replicas and the writer instance.
  1. Solving the current issue of Lambda functions failing:
  • Implement robust error handling and retry mechanisms in your Lambda functions. This can help mitigate temporary connection issues during scale-in events.
  • Consider using a circuit breaker pattern to prevent cascading failures.
  • Optimize your Auto Scaling policy to balance between performance and stability. You might want to adjust the cooldown periods or scaling thresholds to reduce the frequency of scale-in events.
  • If possible, design your application to be more tolerant of temporary read failures, perhaps by caching frequently accessed data or implementing fallback mechanisms.

Additional recommendations:

  • While creating a new connection for each query is generally a good practice with RDS Proxy, you might want to experiment with short-lived connection pooling within your Lambda functions. This could potentially reduce the impact of individual connection failures.
  • Regularly review and optimize your database queries to ensure they complete as quickly as possible, reducing the window of vulnerability during scale-in events.
  • Consider using Aurora Serverless v2 if your workload is suitable, as it provides more granular scaling without the need for discrete Read Replicas.

By implementing these recommendations and best practices, you should be able to significantly reduce the impact of Read Replica scale-in events on your Lambda functions. Remember to thoroughly test any changes in a non-production environment before applying them to your production system.
Sources
Propagating Aurora Cluster Tags to Read Replicas | AWS re:Post
Replication with Amazon Aurora PostgreSQL - Amazon Aurora
Using Amazon RDS Proxy for Aurora - Amazon Aurora
Managing performance and scaling for Aurora DB clusters - Amazon Aurora

answered a year ago
EXPERT
reviewed a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.