- Newest
- Most votes
- Most comments
Without knowing more information, including the specific maximum number of Lambda functions and the maximum number of service instances involved, it is difficult to tell exactly where you may be able to increase throughput.
I suspect your architecture has some other limit than networking that has your throughput maximum of 180K email per minute. Are your Lambda Functions configured to communicate inside your VPC? You might be limited by available IP space in the subnets. You could also be limited by the number of concurrent Lambda Functions. How does your ECS task scale? Connections? CPU? Memory? Something else? Are you seeing rejected connections from the Lambda functions that are being retried? Are the tasks themselves being limited by the calls out to send the emails, having multiple reties, etc? You should instrument your application to get more information.
Here are the quotas for NLB: https://docs.aws.amazon.com/elasticloadbalancing/latest/network/load-balancer-limits.html. The quotas that are adjustable may be impacting, but I don't think it is the first place to look. It's possible that Targets Per Availability Zone per Network Load Balancer is impacting to some effect since adding an AZ increased the throughput (and likely the size of the task group in ECS).
Relevant content
- asked a year ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated 6 months ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 years ago
Thanks for the attention. In one of test scenarious 72 instance of ECS tasks with 2 az and with 3 az. and the same when testing without load balancer by sending directly to ecs tasks. and 360 lambdas (limit 1000 concurrent lambdas) and they are in the same vpc as NLB and ECS Cluster. By sending directly to tasks they are able to handle more. also that limit of ~120K on 2az and ~180K on 3az is not changing by increasing ECS capacity or increasing lambdas concurrency. In test scenarious there is no autoscaling. So my understanding that something limit that on NLB side...
I recommend you look at the potential bottlenecks I called out above first before settling on the NLB as the source.
That being said, have you looked at the NLB metrics, specifically ActiveFlowCount, PeakPacketsPerSecond, TCP_Client_Reset_Count, TCP_ELB_Reset_Count, and TCP_Target_Reset_Count. You will likely want to use the Availability Zone dimension to get some further insight.
Thanks.
It appears there was an issue with the test setup. The Lambda function had a VPC (Virtual Private Cloud) configuration, and removing the VPC setup from the Lambda function resolved the issue.
Good to know!