liveness and readiness timeouts

Question

We are currently running EKS 1.24 with amazon-k8s-cni-init:v1.12.6 and amazon-k8s-cni:v1.12.6. We have a problem with several application pods continuously undergoing restarts. Upon close inspection it appears the pods are getting terminated with the below reason:

Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137

While this is happening, we are putting efforts to check if there are any memory leaks in the application. However we also notice there are errors/warning events in the namespace where the app is deployed:

Warning   Unhealthy   pod/   Readiness probe failed: Get "http://": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Warning   Unhealthy   pod/   Liveness probe failed: Get "http://": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

The current timeout in seconds is set to 1 second (both readiness and liveness). The question is what should the value be set ideally? Are there any cases where the pods were killed and restarted because of premature liveness probes timeout? Additionally have there been any known cases where memory utilization and OOM has caused the liveness probes to fail? (as there are chances OOM can prevent requests from creating additional sockets)

Answer

Hello hvb,

It is possible that your liveness/readiness probes are timing out prematurely before your application is fully in Ready state (started responding to health checks). The timeout setting for your liveness/readiness probes have to be decided based on your application expected performance.

Increase the `timeoutSeconds` period and see if the probes are successful. If they are, you can conclude that the reason for the timeouts is the result of probes prematurely timing out, and figure out why your application is unable to respond within the duration expected. If the probes still fail after increasing the `timeoutSeconds`, it could be result of another underlying problem, and has to be dealt separately.

You can also try to increase the `initialDelaySeconds` parameter to provide enough time for your application to startup, before starting the probes.

I hope this info is helpful to you. Please comment if you have further questions, and I will be happy to help!

liveness and readiness timeouts

Relevant content