Sudden increase in Lambda start errors and other issues

0

Around 5 AM EST I started receiving a large amount of Lambda failures that kind of seem all over the place, some examples:

panic: failed to list aws ssm parameters: RequestError: send request failed caused by: Post "https://ssm.us-east-1.amazonaws.com/": dial tcp 67.220.245.25:443: i/o timeout goroutine 1 [running]: github.com/MyOrg/go-smg/env/ssm/autoload.init.0() /home/runner/work/go-smg/go-smg/env/ssm/autoload/main.go:30 +0x2b4

INIT_REPORT Init Duration: 120707.48 ms Phase: invoke Status: error Error Type: Runtime.ExitError

Error: Runtime exited with error: exit status 2 Runtime.ExitError

And lots of these

INIT_REPORT Init Duration: 10009.41 ms Phase: init Status: timeout

This is across my 10+ lambda functions, most of which are Go custom runtime arm64. Also receiving some PUT timeout failures to S3 for a couple of them that look a lot like the parameter store errors

caused by: Put "https://s3.amazonaws.com/logs.smg.gg/sterling/138934064": dial tcp 54.231.194.216:443: i/o timeout RequestId: 87d9fe03-474d-4112-b415-224cdd5b6a29 Error: Runtime exited with error: exit status 1 Runtime.ExitError

This looked like an AWS outage error to me, since it happened out of no where, and I haven't updated these lambdas for over a day. I did update all my Go AWS dependencies though in an attempt to just throw things at the wall and see what sticks. I've also upped concurrency limits, upped memory allocation, redeployed my functions a few times, but nothing seems to be changing my behavior.

There was a small service outage from Github this morning that was described as very similar to what I'm experiencing but they've resolved it and my issue remains.

asked 6 months ago236 views
1 Answer
0
Accepted Answer

So I will answer my own question, with a "who knows what went wrong".

Starting with the init errors: my function had requests to things like parameter store in the cold start portion of the program, and those requests were timing out causing the function init to fail.

Those requests were timing out because of VPC config issues, and adding VPC endpoints for ssm fixed that. I also added endpoints for a bunch of services I make web requests to, and also added VPC gateways for S3 and Dynamo DB.

I was then still having intermittent web request failures to 3rd partys though, and it seemed like VPC issues was still the culprit, so I reorganized my VPC to look exactly like the template VPC that the new VPC wizard gives you

VPC Wizard

The important things here are the NAT gateway per AZ, and the NAT gateways are part of the public subnets. Then the routing table points the private subnets to these NAT gateways. The public subnets get routed to the Internet Gateway.

No idea why my config stopped working after 7 years, but it did, and those errors are definitely gone after correctly setting up my VPC and subnets.

answered 6 months ago
profile picture
EXPERT
reviewed a month ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions