跳至內容

Aurora Read Replica Scale In Connection Issue with RDS Proxy and Lambda

0

Environment

  • AWS Aurora PostgreSQL
  • RDS Proxy
  • Node.js Lambda functions

Current Setup

We are using AWS Aurora PostgreSQL with RDS Proxy, and our application logic is implemented in Node.js Lambda functions.

Issue Description

We're experiencing connection interruptions in our running Lambda functions when Aurora Read Replicas scale in. Here are the specifics:

  1. Our Node.js Lambda functions do not use connection pooling. Instead, they create a new connection for each query, as recommended when using RDS Proxy.

  2. When Auto Scaling initiates a scale-in event and removes a Read Replica, it causes errors in Lambda functions that are still executing queries.

Questions

  1. How does scale-in work with RDS Proxy and Aurora Read Replicas?

  2. What criteria does AWS use to determine if it's safe to remove a Read Replica?

  3. Is there a way to ensure that all ongoing queries are completed before a Read Replica is removed during scale-in?

  4. Are there any best practices or configurations we should implement to prevent these connection interruptions?

  5. How can we solve our current issue of Lambda functions failing due to Read Replica removal during execution?

Any insights or recommendations would be greatly appreciated. Thank you in advance for your help!

已提問 2 年前檢視次數 537 次
1 個回答
2
已接受的答案

Thank you for providing detailed information about your environment and the issue you're facing. I'll address your questions and provide some insights and recommendations.

  1. Scale-in with RDS Proxy and Aurora Read Replicas: When a scale-in event occurs, Aurora Auto Scaling removes Read Replicas based on the Auto Scaling policy. RDS Proxy maintains a connection pool to the Aurora cluster, including the Read Replicas. When a Read Replica is removed, RDS Proxy adjusts its connection pool accordingly.

  2. Criteria for safe Read Replica removal: AWS considers several factors when determining if it's safe to remove a Read Replica:

  • The current load on the replica
  • The overall cluster performance
  • The target metrics specified in the Auto Scaling policy

However, there isn't a specific mechanism to ensure all ongoing queries are completed before removal.

  1. Ensuring query completion before Read Replica removal: Unfortunately, there isn't a built-in way to guarantee all ongoing queries are completed before a Read Replica is removed during scale-in. Aurora Auto Scaling focuses on maintaining the desired performance metrics rather than individual query states.

  2. Best practices and configurations to prevent connection interruptions:

  • Use RDS Proxy's connection pooling feature effectively. While you're already using RDS Proxy, ensure it's configured optimally.
  • Implement retry logic in your Lambda functions to handle temporary connection issues.
  • Consider using a shorter connection timeout in your Lambda functions to detect and react to issues more quickly.
  • Monitor the Amazon CloudWatch ReplicaLag metric to track the lag between Aurora Replicas and the writer instance.
  1. Solving the current issue of Lambda functions failing:
  • Implement robust error handling and retry mechanisms in your Lambda functions. This can help mitigate temporary connection issues during scale-in events.
  • Consider using a circuit breaker pattern to prevent cascading failures.
  • Optimize your Auto Scaling policy to balance between performance and stability. You might want to adjust the cooldown periods or scaling thresholds to reduce the frequency of scale-in events.
  • If possible, design your application to be more tolerant of temporary read failures, perhaps by caching frequently accessed data or implementing fallback mechanisms.

Additional recommendations:

  • While creating a new connection for each query is generally a good practice with RDS Proxy, you might want to experiment with short-lived connection pooling within your Lambda functions. This could potentially reduce the impact of individual connection failures.
  • Regularly review and optimize your database queries to ensure they complete as quickly as possible, reducing the window of vulnerability during scale-in events.
  • Consider using Aurora Serverless v2 if your workload is suitable, as it provides more granular scaling without the need for discrete Read Replicas.

By implementing these recommendations and best practices, you should be able to significantly reduce the impact of Read Replica scale-in events on your Lambda functions. Remember to thoroughly test any changes in a non-production environment before applying them to your production system.
Sources
Propagating Aurora Cluster Tags to Read Replicas | AWS re:Post
Replication with Amazon Aurora PostgreSQL - Amazon Aurora
Using Amazon RDS Proxy for Aurora - Amazon Aurora
Managing performance and scaling for Aurora DB clusters - Amazon Aurora

已回答 2 年前
專家
已審閱 2 年前

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。