Getting 503 while doing load test via aws alb

0

Issue

When we are trying to do load testing on the apis using apachebench with 10k requests(-n 10000) and concurrency of 1k (-c 1000) on health apis via alb we are getting many failures(503), response time of some requests are more than 10sec aswell for just health api during load test.

Test results

When ran by directly accessing the container without alb

configuration

  • 1 server instance(2vcpu, 4gb ram) on fargate.
  • Test agent(vm in same vpc).

When running this test directly connecting to container using its private ip, we are getting 10k requests successful without any 503 or other failures with good response time.

ab -n 10000 -c 1000 'http://<private-ip>/core/api/v1/health'

Results

This is ApacheBench, Version 2.3 <$Revision: 1913912 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking <private-ip> (be patient)
Completed 1000 requests
Completed 2000 requests
Completed 3000 requests
Completed 4000 requests
Completed 5000 requests
Completed 6000 requests
Completed 7000 requests
Completed 8000 requests
Completed 9000 requests
Completed 10000 requests
Finished 10000 requests


Server Software:        envoy
Server Hostname:        <private-ip>
Server Port:            3000

Document Path:          /core/api/v1/health
Document Length:        16 bytes

Concurrency Level:      1000
Time taken for tests:   10.313 seconds
Complete requests:      10000
Failed requests:        0           <------------------- no failures
Non-2xx responses:      10000
Total transferred:      1660000 bytes
HTML transferred:       160000 bytes
Requests per second:    969.67 [#/sec] (mean)
Time per request:       1031.276 [ms] (mean)
Time per request:       1.031 [ms] (mean, across all concurrent requests)
Transfer rate:          157.19 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    2   4.6      0      22
Processing:   997 1018  22.1   1006    1096
Waiting:        0   15  18.7      4      82
Total:        997 1020  24.6   1006    1096

Percentage of the requests served within a certain time (ms)
  50%   1006
  66%   1019
  75%   1035
  80%   1044
  90%   1061
  95%   1072
  98%   1082
  99%   1086
 100%   1096 (longest request)   <-------------------- good response time

When ran via alb

  • With just 1 instance in one of the az, results were much worse with high number of failures. So below is the result of 2 instance both in different az.

configuration

  • 2 server instance(2vcpu, 4gb ram each) behind ALB, both in different availability zone on fargate
  • test agent(vm in same vpc).

Results

ab -n 10000 -c 1000 'http://<alb>/core/api/v1/health'
This is ApacheBench, Version 2.3 <$Revision: 1913912 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking <alb-url> (be patient)
Completed 1000 requests
Completed 2000 requests
Completed 3000 requests
Completed 4000 requests
Completed 5000 requests
Completed 6000 requests
Completed 7000 requests
Completed 8000 requests
Completed 9000 requests
Completed 10000 requests
Finished 10000 requests


Server Software:        envoy
Server Hostname:        <alb-url>
Server Port:            80

Document Path:          /core/api/v1/health
Document Length:        7 bytes

Concurrency Level:      1000
Time taken for tests:   12.254 seconds
Complete requests:      10000
Failed requests:        919        <---------------- failures
   (Connect: 0, Receive: 0, Length: 919, Exceptions: 0)
Non-2xx responses:      919
Total transferred:      8850666 bytes
HTML transferred:       147196 bytes
Requests per second:    816.08 [#/sec] (mean)
Time per request:       1225.365 [ms] (mean)
Time per request:       1.225 [ms] (mean, across all concurrent requests)
Transfer rate:          705.36 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    2   5.0      0      24
Processing:     1 1121 3084.6    122   11936
Waiting:        1 1121 3084.6    122   11936
Total:          1 1123 3088.3    122   11945

Percentage of the requests served within a certain time (ms)
  50%    122
  66%    175
  75%    205
  80%    215
  90%   1709
  95%  11074
  98%  11441
  99%  11715
 100%  11945 (longest request)     <---------- bad response time more than 11sec for just returning healthy in response

Will be helpful if we get some direction to look at for resolving this issue.

1 Answer
1

Hello.

Can I check the CloudWatch metric "RejectedConnectionCount" for ALB?
This metric is recorded when there is a sudden increase in access and there are connections that could not be processed due to the limit on the number of connections on the ALB side.
Therefore, if this metric is recorded, it is possible that the number of concurrent accesses has increased and could not be processed.
https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-cloudwatch-metrics.html

The number of connections that were rejected because the load balancer had reached its maximum number of connections.

ALB automatically scales according to the number of accesses, etc., but if a large amount of access occurs before scaling, it may not be able to scale in time and an error may occur.
https://docs.aws.amazon.com/elasticloadbalancing/latest/application/application-load-balancers.html

To ensure that your load balancer can scale properly, verify that each Availability Zone subnet for your load balancer has a CIDR block with at least a /27 bitmask (for example, 10.0.0.0/27) and at least eight free IP addresses per subnet. These eight IP addresses are required to allow the load balancer to scale out if needed. Your load balancer uses these IP addresses to establish connections with the targets. Without them your Application Load Balancer could experience difficulties with node replacement attempts, causing it to enter a failed state.

Note: If an Application Load Balancers subnet runs out of usable IP addresses while attempting to scale, the Application Load Balancer will run with insufficient capacity. During this time old nodes will continue to serve traffic, but the stalled scaling attempt may cause 5xx errors or timeouts when attempting to establish a connection.

profile picture
EXPERT
answered a month ago
profile picture
EXPERT
reviewed a month ago
profile picture
EXPERT
reviewed a month ago
  • hi, I checked for RejectedConnectionCount metric it didn't had any datapoint recorded so assuming it as 0.

    On further investigation found that the test which i was doing using apache bench on container was giving non 200 response(ie 426 status code but using curl it was working). I then tried using another tool(k6) to test directly against the container, turns out 503 and high latency was present on direct use of container as well so looks like its not an issue with load balancer.

    To find the root cause i tried to create another service without any alb with the same task definition in isolation, i found that whenever i am enabling service connect this issue of high latency and 503 was happening.

    Issue of hight latency got resolved when i added the option of mode of logs collected to non-blocking or disabling the log option presented while adding service connect.

    Issue of some requests giving 503 when service connect is enabled is still a mystery, or is it a normal thing when using service connect(tried using 4vcpu and 8gb ram as well but no improvement in numbers, around 1% req fired using load test tool are 503)

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions