I have a number of users connecting to a database, some IaM enabled and others which use password authentication. Generally connection and authentication is fine for both types of users and our applications using the database are absolutely fine.
We have started getting intermittent auth issues (seemingly just IaM users) and connection timeouts.
For our IaM enabled users, We generate auth tokens using RdsSigner in a Lambda, which I have read relies upon the STS service to get the auth token we need, so we don't control this other than passing it along during connection.
Occasionally, We seem to be getting instances where rdsauthproxy
indicates a failure and then other clients of the database experience connection terminations. I realise that the connections are effectively unrelated but it seems a consistent pattern of auth failures and then connection terminations.
I appreciate I am sending across parts of a log so this may be hard to follow, but I'll try anway,
We receive something like this first:
* We are completely uploaded and fine
2024-01-03 07:33:41 UTC:x.x.x.x(40004):[x]@[y]:[4657]:LOG: could not accept SSL connection: EOF detected
2024-01-03 07:33:43 UTC::@:[396]:LOG: checkpoint starting: time
< HTTP/1.1 503 Service Unavailable
< Content-Type: text/html;charset=utf-8
< HTTP/1.1 503 Service Unavailable
< Content-Type: text/html;charset=utf-8
< Content-Length: 0
< Content-Length: 0
< <
* Connection #0 to host rdsauthproxy left intact
* Connection #0 to host rdsauthproxy left intact
And then a little later (for another request):
2024-01-03 07:34:01 UTC:172.31.83.57(47075):x@y[4606]:LOG: pam_authenticate failed: Permission denied
2024-01-03 07:34:02 UTC:172.31.83.57(47075):x@y[4606]:FATAL: canceling authentication due to timeout
We then also see:
2024-01-03 07:34:26 UTC:x.x.x.x(36200):[x]@[y]:[4682]:LOG: could not accept SSL connection: EOF detected
2024-01-03 07:34:26 UTC:x.x.x.x(37276):[x]@[y]:[4673]:LOG: could not receive data from client: Connection reset by peer
And finally a whole cluster of empty replies (I guess because these were all done at same instant)
2024-01-03 07:35:05 UTC:x.x.x.x(43842):[x]@[y]:[4893]:LOG: connection received: host=x.x.x.x port=43842
* Empty reply from server
* Empty reply from server
* * Empty reply from server
Empty reply from server
* Empty reply from server
* Empty reply from server
* Empty reply from server
* Empty reply from server
* Empty reply from server
* Empty reply from server
* Empty reply from server
* Empty reply from server
* Empty reply from server
* Empty reply from server
* Empty reply from server
* Empty reply from server
* Empty reply from server
* Empty reply from server
* Empty reply from server
2024-01-03 07:35:07 UTC:x.x.x.x(35529):[x]@[y]:[4894]:LOG: connection received: host=x.x.x.x port=35529
* * Closing connection
Closing connection
* Closing connection
* Closing connection
* Closing connection
* Closing connection
* Closing connection
* Closing connection
* Closing connection
* Closing connection
* Closing connection
* Closing connection
* Closing connection
* Closing connection
* Closing connection
* Closing connection
* Closing connection
* Closing connection
* Closing connection
* Empty reply from server
* Empty reply from server
Can anyone shed any light as to what is happening?
Thank-you for taking the time to reply.
Everything you have said is absolutely correct and can't be refuted.
Additional information for onlookers is that each and every connection used a new token, so it being caused by token expiry is not the cause of the issues in our case.
It turns out that the appearance of SSL connection timeouts in the logs and then further analysis of low
Freeable Memory
gave a strong indication that the database SKU should be upgraded so that there was more free memory. We upgraded the SKU from db.t3.micro to medium and the issue has gone, completely.