Calling 1000 lambda instances at once with Java AWS SDK

0

Hi all, I'm writing a java spring application which utilises monte carlo simulations to predict outcomes from a statistical model. I have previously used on-premises "nodes" to perform the simulations but am now looking to scale with AWS Lambda.

My nodes are powerful machines capable of running 50,000 sims in ~1 second. With 2 nodes I can run 100k simulations in 1 second.

The lambda is less powerful and is capable of 100 simulations in ~1s, so if I want to perform 100,000 simulations, I should spread that across 1000 lambda instances.

See diagrams below for a better understanding:

Previous: Previous setup

Proposed: Proposed setup

The "Orchestrator" spring boot app has the V2 java sdk (software.amazon.awssdk), and i'm trying to kick off 1000 asynchronous calls at once. At best, I'm getting around 300 concurrent calls and the whole process is taking much longer than expected (about 30s). Average completion time is about 2s per lambda.

Concurrency and Duration Metrics

I've been reading various documentation to try and tune the client in order to allow me to make these 1000 calls in parallel. Even providing my own thread pool with poolSize of 1000. But still no luck.

LambdaAsyncClient client = LambdaAsyncClient.builder()
                    .region(Region.BLA_BLA)
                    .httpClientBuilder(
                            NettyNioAsyncHttpClient
                                    .builder()
                                    .maxConcurrency(1000)
                                    .maxPendingConnectionAcquires(1000)
                                    .connectionAcquisitionTimeout(Duration.of(20000, ChronoUnit.MILLIS))
                                    .connectionTimeout(Duration.of(20000, ChronoUnit.MILLIS))
                                    .useNonBlockingDnsResolver(true)
                    )
                    .credentialsProvider(StaticCredentialsProvider.create(awsCredentials))
                    .asyncConfiguration(
                        ClientAsyncConfiguration.builder()
                        .advancedOption(SdkAdvancedAsyncClientOption.FUTURE_COMPLETION_EXECUTOR, threadpool)
                        .build())
                    .build();

Could it be related to the concurrency scaling rate? It says 1000 per 10s, but does that mean in the first second I can only scale up to 100 lambda instances?

Firing up wireshark I can see the requests being made to Amazon and it's clear they're taking a long time to be processed: Is this the amazon java sdk blocking while waiting for the response for some reason? Or could it just be that my computer can't process that many concurrent connections (1000 isn't that much traffic!)?

timesrc ipdst ipprotocollengthinfo
4.007815192.168.1.693.8.129.52TLSv1.3498Client Hello (SNI=lambda.eu-west-2.amazonaws.com)
4.007817192.168.1.693.8.129.52TLSv1.3498Client Hello (SNI=lambda.eu-west-2.amazonaws.com)
4.007818192.168.1.693.8.129.8TLSv1.3498Client Hello (SNI=lambda.eu-west-2.amazonaws.com)
4.007819192.168.1.693.8.129.56TLSv1.3498Client Hello (SNI=lambda.eu-west-2.amazonaws.com)
4.00782192.168.1.693.8.129.56TLSv1.3498Client Hello (SNI=lambda.eu-west-2.amazonaws.com)
4.029959192.168.1.693.8.129.56TLSv1.3498Client Hello (SNI=lambda.eu-west-2.amazonaws.com)
4.029962192.168.1.693.8.129.52TLSv1.3498Client Hello (SNI=lambda.eu-west-2.amazonaws.com)
4.035489192.168.1.693.8.129.9TLSv1.3498Client Hello (SNI=lambda.eu-west-2.amazonaws.com)
4.035686192.168.1.693.8.129.9TLSv1.3498Client Hello (SNI=lambda.eu-west-2.amazonaws.com)
..................
25.010523192.168.1.693.8.129.8TLSv1.3498Client Hello (SNI=lambda.eu-west-2.amazonaws.com)
25.010526192.168.1.693.8.129.54TLSv1.3498Client Hello (SNI=lambda.eu-west-2.amazonaws.com)
25.013757192.168.1.693.8.129.27TLSv1.3498Client Hello (SNI=lambda.eu-west-2.amazonaws.com)
25.023627192.168.1.693.8.129.56TLSv1.3650Client Hello (SNI=lambda.eu-west-2.amazonaws.com)
25.023776192.168.1.693.8.129.52TLSv1.3498Client Hello (SNI=lambda.eu-west-2.amazonaws.com)
25.045933192.168.1.693.8.129.36TLSv1.3650Client Hello (SNI=lambda.eu-west-2.amazonaws.com)
25.109813192.168.1.693.8.129.27TLSv1.3498Client Hello (SNI=lambda.eu-west-2.amazonaws.com)
25.120254192.168.1.693.8.129.9TLSv1.3650Client Hello (SNI=lambda.eu-west-2.amazonaws.com)
25.14439192.168.1.693.8.129.30TLSv1.3498Client Hello (SNI=lambda.eu-west-2.amazonaws.com)

Thanks!

EDIT!!

Lots of retransmission at the start, waiting ~20s before the request is handled normally.

numbertimesrc ipdest ipprotocollengthcomment
4027917.193426192.168.1.693.8.129.55TCP7849174 → 443 [SYN] Seq=0 Win=65535 Len=0 MSS=1460 WS=64 TSval=2021385807 TSecr=0 SACK_PERM
6782518.194271192.168.1.693.8.129.55TCP78[TCP Retransmission] 49174 → 443 [SYN] Seq=0 Win=65535 Len=0 MSS=1460 WS=64 TSval=2021386807 TSecr=0 SACK_PERM
9534619.194267192.168.1.693.8.129.55TCP78[TCP Retransmission] 49174 → 443 [SYN] Seq=0 Win=65535 Len=0 MSS=1460 WS=64 TSval=2021387808 TSecr=0 SACK_PERM
12293420.195233192.168.1.693.8.129.55TCP78[TCP Retransmission] 49174 → 443 [SYN] Seq=0 Win=65535 Len=0 MSS=1460 WS=64 TSval=2021388809 TSecr=0 SACK_PERM
15199521.195689192.168.1.693.8.129.55TCP78[TCP Retransmission] 49174 → 443 [SYN] Seq=0 Win=65535 Len=0 MSS=1460 WS=64 TSval=2021389809 TSecr=0 SACK_PERM
18198322.195368192.168.1.693.8.129.55TCP78[TCP Retransmission] 49174 → 443 [SYN] Seq=0 Win=65535 Len=0 MSS=1460 WS=64 TSval=2021390809 TSecr=0 SACK_PERM
23979824.19601192.168.1.693.8.129.55TCP78[TCP Retransmission] 49174 → 443 [SYN] Seq=0 Win=65535 Len=0 MSS=1460 WS=64 TSval=2021392810 TSecr=0 SACK_PERM
34578528.196351192.168.1.693.8.129.55TCP78[TCP Retransmission] 49174 → 443 [SYN] Seq=0 Win=65535 Len=0 MSS=1460 WS=64 TSval=2021396810 TSecr=0 SACK_PERM
52419636.197391192.168.1.693.8.129.55TCP78[TCP Retransmission] 49174 → 443 [SYN] Seq=0 Win=65535 Len=0 MSS=1460 WS=64 TSval=2021404811 TSecr=0 SACK_PERM
52482536.2255223.8.129.55192.168.1.69TCP74443 → 49174 [SYN, ACK] Seq=0 Ack=1 Win=26847 Len=0 MSS=1452 SACK_PERM TSval=856648105 TSecr=2021404811 WS=256
52484336.225861192.168.1.693.8.129.55TCP6649174 → 443 [ACK] Seq=1 Ack=1 Win=132480 Len=0 TSval=2021404839 TSecr=856648105
52484636.226384192.168.1.693.8.129.55TLSv1.3498Client Hello (SNI=lambda.eu-west-2.amazonaws.com)
52545936.2510073.8.129.55192.168.1.69TCP66443 → 49174 [ACK] Seq=1 Ack=433 Win=28160 Len=0 TSval=856648128 TSecr=2021404839
52546036.2510083.8.129.55192.168.1.69TLSv1.31506Server Hello, Change Cipher Spec, Application Data
52546236.2510093.8.129.55192.168.1.69TCP1506443 → 49174 [ACK] Seq=1441 Ack=433 Win=28160 Len=1440 TSval=856648128 TSecr=2021404839 [TCP segment of a reassembled PDU]
52546336.251013.8.129.55192.168.1.69TCP1506443 → 49174 [ACK] Seq=2881 Ack=433 Win=28160 Len=1440 TSval=856648128 TSecr=2021404839 [TCP segment of a reassembled PDU]
52546436.2510113.8.129.55192.168.1.69TLSv1.31263Application Data, Application Data, Application Data
Ed
asked 4 months ago373 views
3 Answers
1

Just as a quick remark on the topic of performance between your on-premises compute platform and Lambda, did you fully consider the sizing of your Lambda runtime environments? Specifically, while Lambda functions are sized in terms of memory, there's a direct relationship between the amount of memory and compute capacity, meaning you can linearly increase the latter by increasing the former. I just wanted to point this out, in case the need for parallel invocations might be substantially reduced by running more powerful Lambda runtime environments.

https://docs.aws.amazon.com/lambda/latest/operatorguide/computing-power.html

EXPERT
Leo K
answered 4 months ago
  • Thanks, yes I had considered that as my main route when we want to scale at more than 100k sims. Just want to prove that it's possible to run 1000 concurrently before assuming I can do the same with beefier lambdas.

0
Accepted Answer

If the trace you added is complete, it shows the initial TCP SYN segment (SYN=synchronise=request to open a connection) getting sent at time index 17.193426 seconds and retrying for nearly 10 seconds until receiving a SYN+ACK segment (=acknowledgement of the connection request) at index 36.225522 seconds. After that, everything else happens nearly instantly.

This is happening at the low network layer of TCP, below even TLS, let alone anything that would allow the regional Lambda service even to know who's calling or which function's concurrency attributes to consider. It sounds like a network issue, but there are so many acceleration features in modern network stacks that it's hard to say if there might not be an interaction with something in the operating system of your orchestrator component.

Your diagram shows the orchestrator still residing on premises, so you have some sort of firewall or router in between to translate your local 192.168.1.* addresses to one or more internet-routable, public IPs. I'd guess that a limit is being hit on that firewall/NAT device. Have you logs available on it?

EXPERT
Leo K
answered 4 months ago
  • Thanks for your help, I am behind a BT Business Hub router while in development which is incapable of logging to that extent. Moving my orchestrator over to our actual business office which is behind a ubiquiti dream machine appears to allow much higher throughput (Getting the full 1k concurrent calls) so it must be something to do with the router. I guess that'll have to do for now!

0

Since you ran the Wireshark trace already, did you see that after the TLS negotiation, an encrypted payload roughly the expected size of the request got sent to Lambda, and aside from TCP acknowledgements, the response took long to start to arrive? Or does a response appear to arrive over the network much more quickly than the Java program is receiving/processing it? If this distinction can be made, it should be a reliable indication on whether it's the SDK (or something else on the client) or the Lambda platform that is taking time to respond.

EXPERT
Leo K
answered 4 months ago
  • Hi, thanks for the comment. I'm a bit of a noob at wireshark I must admit, but i've seen something that looks suspicious when looking at a flow graph for interactions on one of the streams. Ive attached it to the main question. See the first few messages in the stream have [TCP Retransmission] and seem to retry every second, then backing off to every 2, 4, 8 seconds, before it finally appears to execute normally. Is this expected?

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions