ECS Tasks Randomly Lose Access To RDS/Redis but not Internet

0

So we have an Amazon ECS cluster that randomly completely drops all traffic, it happens about once every two weeks, we had it happen yesterday. What we see happen is that ALL ECS tasks (12) will all lose connectivity to S3, RDS and MemoryDB at the same time. They'll randomly recover and then all lose access at the same time again and again over the next 30-40 minutes (sometimes it can be shorter though it appears). When connectivity is interrupted, health checks fail and create new tasks, these tasks connect, fail, restart again and again the entire time. RDS shows peer connection reset errors during the time in waves (it may work for anywhere from 30s to about maybe two minutes before a wave of connection resets happen. We'll see our application's health checks bounce a bit over this period of time until they flat-line for a few minutes then suddenly recover. It appears the connection resets happen with quicker and quicker frequency until suddenly it just... clears up (last time after about 40m)

From our train of thought:

  • We ruled out the container, since the task restart news up a new container, and also the container just randomly starts working against after a period of time
  • We ruled out any sort of AWS configuration, since we're not changing anything, security groups, network ACLs, routes, subnets,
  • We were considering it was a DNS sticky issue, but that doesn't explain why it recovers for a small period of time and then fails again
  • It appears RDS is getting TCP RST packets, but there is zero reason for our app to be sending these randomly, but basically any layer of AWS's software defined networking could be sending these
  • We have an extremely simple network setup, 5 subnets, the ECS cluster is on US-WEST-2d, the RDS system is on US-WEST-2d, the MemoryDB instance is on 2a and 2b, they're all locally routed within the same subnet
  • We've talked to numerous AWS specialists, and at this point roped in AWS's own support, who are currently elevating stuff because they just checked all of our configuration and agree -- it shouldn't be happening
  • We checked various parts of our application, available threadpool, sockets, etc. everything is crazy under utilized leading up to and during these events, also -- task restarts would clear it out, also it doesn't explain why a task running for 1-2 minutes will fail, but the same task running for 3 weeks of nearly constant 3-4k requests/minute will not fail.
William
asked 7 months ago158 views
No Answers

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions