Application Load Balancer 504 errors with weighted target group

0

We have implemented a Blue/Green deployment using an Application Load Balancer with a single listener configured with two weighted target groups. When we deployed this infrastructure in production we noticed that when we change the weights of the target groups from 0 to 100 the ALB returns a lot of 504 errors for roughly 10 seconds while the target group reports the same amount of target connection errors. We don't have this problem in the testing enviroment but we can't reproduce the full load we have in production. The service hosted on the EC2 istances is warmed up and the response time is normal as soon as the new targets start responding to the requests, so it doesn't seems to be a problem of load of the instances. To change the weights we use the modify-listener command found in this blog and we do that when all the targets inside the new target group are in healthy status.

How can we debug this problem?

1 Answer
0

ALBs will generally take few seconds to propagate any changes, but the actual times can vary.

As you are seeing 504 errors when changing the weights of the target groups from 0 to 100. Here are some troubleshooting suggestions:

  1. Try moving the weights slowly/incrementally and test.
  2. Make sure both the target groups have healthy targets.
  3. Check access logs for that time.

From the error it looks like there is some error in the application servers. Let me know if this help in troubleshooting.

AWS
Swasti
answered 2 months ago
profile picture
EXPERT
reviewed a month ago
  • Thank you for your reply, we're implementing a slower change in the target groups weights as a workaround to see if it helps. We are sure that both target groups have healthy targets and are able to take requests. We are also going to eanble ALB logs and check them with the logs from the instances.

    I was able to create a production like load on the test environment but I couldn't reproduce the problem, when I change the weights from 0 to 100 all the traffic switch to the new target group after roughly 10 seconds but without any error. Could the problem be related to some VPC or subnet configuration? In the test environment we recrete everything every time but the production enviroment is still using some old vpc/subnet.

  • We've managed to did some tests in the production environment and we noticed this: if the active color is blue the 504 errors happen when the green target group start to register the ASG instances, the errors last until all the green instances are healthy (this take almost 2 minutes in our case). All of this happens even if the weight of the green target group is 0 while the blue one is 100. So the errors happen before we change the weights, and chaging them afterward does not cause any more errors.

  • Thank you for your email. This looks like an application issue. Seems like ALB is working as expected.

    I see you mention that for Blue target group 504 errors last for 2 minutes and at that time 100% of the request is going to blue target group. To troubleshoot this further:

    1. Check the ALB Access logs for the blue target group. Check what kind of status code it is generating.
    2. On the access logs itself check the creation time of the request and the time at which ALB generated the 504 error.
    3. When you are seeing 504 errors from the Blue target group, during that duration try and send request directly to the targets (in blue target group) from an instance in VPC and see what is the response it is receiving.
    4. In future we might need to see Cloud Trail logs as well to see if de-register target call is not being made by the autoscaling group.
    5. Lastly can you please share a access logs for one of the request that is failing here? (Please make sure to HIDE ALL the SENSITIVE DATA from the REQUEST.)

    Please let me know if above troubleshooting steps help and feel free to share your result (after hiding sensitive data from result) and we can dive deep into it further.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions