EKS pod sometimes success connect to RDS cluster, but sometimes failed. How can I fix it?

0

Hi, I register an issue to https://github.com/aws/amazon-vpc-cni-k8s/issues/2046 . So I wrote this support case.

(the issue text: What happened:

eks cluster and rds (mysql) cluster are in same vpc. I added my eks security group(eks-cluster-sg-MYCLUSTERNAME-*) to rds security group's inbound rule (port 3306) I found rds connection sometimes succeed but sometimes failed. (it is timeout error) I set timeout seconds to 300s, so I think it's not matter. It has some weird pattern. When connection success, it tooks < 1 sec or it tooks more than 2 minutes. (very fast or long) I tested in local, there is no problem (connection tooks < 1sec). but same code in pod shows above things.

I don't know why connection sometimes success, sometimes failed. how to fix it? any ideas? thank you.

Environment:

Kubernetes version (use kubectl version): 1.22.0 CNI Version )

My problem happens when my EKS pods trying connect to RDS cluster. I think it is EKS network problem. because it happens only in EKS pod. Local connection test (my pc to RDS DB) always success. And our service using RDS didn't have any issue.

Can I solve this? thank you.

(I tried create VPC flow logs, but cloudwatch log group store nothing :( )

1 Answer
0

How frequently does it fail? Does it fail from every POD or from a particular POD? If it is from a particular POD, I would take a deeper look at the code running in the POD and the configuration as well as the Kubernetes network policies for that particular POD.

One thing I will recommend strongly is to implement retry in your connection? If the average connection time is 1-2 seconds, put a 5 sec timeout and implement retries in your code.

If this is happening regularly, then it could be a more serious issue, that needs to be looked at by AWS support.

profile pictureAWS
EXPERT
answered 2 years ago
  • I think every pods have same problem. I can't resolve this so I made retry policy until pod(job) sucess. (30s timeout, 5 retries)

    This situation happens every day( because pods are made from cronjobs(0 0 * * *) everyday). but luckily somedays has no connection fail.

  • This does not sound right to me. I can understand occasional failure under load but regular failure should not happen. Please file a support ticket for them to take a look and investigate what could be causing the failures.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions