Unable to open more than ~50k connections between two EC2 instances in the same VPC

0

I have two linux instances in my custom VPC, and I'm trying to test to see how many concurrent TCP connections I can get between them. One is in a public subnet with an elastic IP, and one is in a private subnet and only has a private IP.

Connections between these two servers work normally. I am using a tool that sets up one instance as a simple server, and the other as a client. The client will continually make TCP connections to the server and hold them open. After about 51k connections have been made, the client stops connecting. After a few minutes, I get "connection timed out" and the client dies.

I've raised the file descriptor limit on both servers and verified that ulimit -a shows the increased number. I have checked for ephemeral port exhaustion and see no evidence of that (the tool uses multiple ports for the server to listen on so it should easily be able to get more than 51k connections.)

When I run the client externally (outside of AWS on a linode droplet) I can get up to about 250k simultaneously connections to my private ec2 instance. Whatever is causing the issue doesn't seem to apply to traffic inbound from the internet. However, going the other way (from the public ec2 instance to the external server) I again hit the 51k limit.

I'm tearing my hair out trying to figure this out. Anybody have any ideas?

1 Answer
1

There are two things to check here:

First, ephemeral port numbers are a 16-bit number which means that (all other things aside) you can open a maximum of 65,536 TCP connections from A to B if you are using the same destination port on B (say, port 80). You can open a lot more than that (kernel limits permitting) if you are using different destination ports.

I'm guessing from your comment about 250k connections that you are using different source and destination port numbers - but I'm a little confused about how you are reaching a private EC2 instance from an external source.

Second: It sounds like you're using NAT Gateway - and my guess is that your routing within the VPC between the two private instances is also (somehow) using NAT Gateway - but it probably shouldn't. NAT Gateways support up to 55,000 connections to each unique destination - it's called out in our documentation - and that number is suspiciously close to your 51k number you're hitting. So it's worthwhile checking that out.

profile pictureAWS
EXPERT
answered a year ago
  • The test program I'm using opens connections to multiple ports (about 100 different ports) on the server. I've done tests to see the number of ephemeral ports in use and it never gets about about 510 per (src ip, src port, dest ip, dest port) combo. Plus I never get any errors in logs about ephemeral port exhaustion so I'm pretty sure it's not that.

    I thought the NAT gateway might be the issue as well, but I can't see any traffic being dropped by the NAT gateway in the flow logs.

    As for the inbound connections, I'm not testing those against the private server. I test those using the public instance as the "server". So I just open the pots and use the elastic IP to access to instance. I've also tried going the other way between the ec2 instances (using the public instance as the server and the private one as the client) and I have the same problem.

  • Flow Logs won't show the traffic being dropped by NAT Gateway so that might still be happening.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions