- Newest
- Most votes
- Most comments
Just as a quick remark on the topic of performance between your on-premises compute platform and Lambda, did you fully consider the sizing of your Lambda runtime environments? Specifically, while Lambda functions are sized in terms of memory, there's a direct relationship between the amount of memory and compute capacity, meaning you can linearly increase the latter by increasing the former. I just wanted to point this out, in case the need for parallel invocations might be substantially reduced by running more powerful Lambda runtime environments.
https://docs.aws.amazon.com/lambda/latest/operatorguide/computing-power.html
If the trace you added is complete, it shows the initial TCP SYN segment (SYN=synchronise=request to open a connection) getting sent at time index 17.193426 seconds and retrying for nearly 10 seconds until receiving a SYN+ACK segment (=acknowledgement of the connection request) at index 36.225522 seconds. After that, everything else happens nearly instantly.
This is happening at the low network layer of TCP, below even TLS, let alone anything that would allow the regional Lambda service even to know who's calling or which function's concurrency attributes to consider. It sounds like a network issue, but there are so many acceleration features in modern network stacks that it's hard to say if there might not be an interaction with something in the operating system of your orchestrator component.
Your diagram shows the orchestrator still residing on premises, so you have some sort of firewall or router in between to translate your local 192.168.1.* addresses to one or more internet-routable, public IPs. I'd guess that a limit is being hit on that firewall/NAT device. Have you logs available on it?
Thanks for your help, I am behind a BT Business Hub router while in development which is incapable of logging to that extent. Moving my orchestrator over to our actual business office which is behind a ubiquiti dream machine appears to allow much higher throughput (Getting the full 1k concurrent calls) so it must be something to do with the router. I guess that'll have to do for now!
Since you ran the Wireshark trace already, did you see that after the TLS negotiation, an encrypted payload roughly the expected size of the request got sent to Lambda, and aside from TCP acknowledgements, the response took long to start to arrive? Or does a response appear to arrive over the network much more quickly than the Java program is receiving/processing it? If this distinction can be made, it should be a reliable indication on whether it's the SDK (or something else on the client) or the Lambda platform that is taking time to respond.
Hi, thanks for the comment. I'm a bit of a noob at wireshark I must admit, but i've seen something that looks suspicious when looking at a flow graph for interactions on one of the streams. Ive attached it to the main question. See the first few messages in the stream have
[TCP Retransmission]
and seem to retry every second, then backing off to every 2, 4, 8 seconds, before it finally appears to execute normally. Is this expected?
Relevant content
- asked 3 years ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 3 years ago
- AWS OFFICIALUpdated 3 years ago
- AWS OFFICIALUpdated 3 years ago
Thanks, yes I had considered that as my main route when we want to scale at more than 100k sims. Just want to prove that it's possible to run 1000 concurrently before assuming I can do the same with beefier lambdas.