How do I troubleshoot the pod status in Amazon EKS?
My Amazon Elastic Kubernetes Service (Amazon EKS) pods that are running on Amazon Elastic Compute Cloud (Amazon EC2) instances or on a managed node group are stuck. I want to get my pods in the "Running" or "Terminated" state.
Resolution
Important: The following steps apply only to pods launched on Amazon EC2 instances or in a managed node group. These steps don't apply to pods launched on AWS Fargate.
Find the status of your pod
To troubleshoot the pod status in Amazon EKS, complete the following steps:
-
To get the status of your pod, run the following command:
$ kubectl get pod
-
To get information from the Events history of your pod, run the following command:
$ kubectl describe pod YOUR_POD_NAME
-
Based on the status of your pod, complete the steps in the following section.
Your pod is in the Pending state
Note: The example commands in the following steps are in the default namespace. For other namespaces, append the command with -n YOURNAMESPACE.
Pods can be stuck in a Pending state because of insufficient resources or because you defined a hostPort. For more information, see Pod phase on the Kubernetes website.
If you have insufficient resources on the worker nodes, then delete unnecessary pods. You can also add more resources on the worker nodes. When you don't have enough resources in your cluster, use the Kubernetes Cluster Autoscaler to automatically scale your worker node group.
Insufficient CPU example:
$ kubectl describe pod frontend-cpu Name: frontend-cpu ... Status: Pending ... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 22s (x14 over 13m) default-scheduler 0/3 nodes are available: 3 Insufficient cpu.
Insufficient Memory example:
$ kubectl describe pod frontend-memory Name: frontend-memory ... Status: Pending ... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 80s (x14 over 15m) default-scheduler 0/3 nodes are available: 3 Insufficient memory.
If you defined a hostPort for your pod, then follow these best practices:
- Because the hostIP, hostPort, and protocol combination must be unique, specify a hostPort only when it's necessary.
- If you specify a hostPort, then schedule the same number of pods as there are worker nodes.
Note: When you bind a pod to a hostPort, there are a limited number of places that you can schedule a pod.
The following example shows the output of the describe command for a pod that's in the Pending state, frontend-port-77f67cff67-2bv7w. The pod is unscheduled because the requested host port isn't available for worker nodes in the cluster:
$ kubectl describe pod frontend-port-77f67cff67-2bv7w Name: frontend-port-77f67cff67-2bv7w ... Status: Pending IP: IPs: <none> Controlled By: ReplicaSet/frontend-port-77f67cff67 Containers: app: Image: nginx Port: 80/TCP Host Port: 80/TCP ... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 11s (x7 over 6m22s) default-scheduler 0/3 nodes are available: 3 node(s) didn't have free ports for the requested pod ports.
If you can't schedule the pods because the nodes have taints that the pod doesn't allow, then the example output is similar to the following:
$ kubectl describe pod nginx Name: nginx ... Status: Pending ... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 8s (x10 over 9m22s) default-scheduler 0/3 nodes are available: 3 node(s) had taint {key1: value1}, that the pod didn't tolerate.
To check your nodes taints, run the following command:
$ kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints
To retain your node taints, specify a toleration for a pod in the PodSpec. For more information, see Concepts on the Kubernetes website. Or, append - to the end of the taint value to remove the node taint:
$ kubectl taint nodes NODE_Name key1=value1:NoSchedule-
If your pods are still in the Pending state, then complete the steps in the Additional troubleshooting section.
Your container is in the Waiting state
Your container might be in the Waiting state because of an incorrect Docker image or an incorrect repository name. Or, your pod might be in the Waiting state because the image doesn't exist or you lack permissions.
To confirm that the image and repository name are correct, log in to Docker Hub, Amazon Elastic Container Registry (Amazon ECR), or another container image repository. Compare the repository or image from the repository with the repository or image name that's specified in the pod specification. If the image doesn't exist or you lack permissions, then complete the following steps:
-
Verify that the image that's specified is available in the repository and that the correct permissions are configured to allow you to pull the image.
-
To confirm that you can pull the image and that there aren't general networking and repository permission issues, manually pull the image. You must use Docker to pull the image from the Amazon EKS worker nodes:
$ docker pull yourImageURI:yourImageTag
-
To verify that the image exists, check that both the image and tag are in either Docker Hub or Amazon ECR.
Note: If you use Amazon ECR, then verify that the repository policy allows image pull for the NodeInstanceRole. Or, verify that the AmazonEC2ContainerRegistryReadOnly role is attached to the policy.
The following example shows a pod in the Pending state with the container in the Waiting state because of an image pull error:
$ kubectl describe po web-test Name: web-test ... Status: Pending IP: 192.168.1.143 Containers: web-test: Container ID: Image: somerandomnonexistentimage Image ID: Port: 80/TCP Host Port: 0/TCP State: Waiting Reason: ErrImagePull ... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 66s default-scheduler Successfully assigned default/web-test to ip-192-168-6-51.us-east-2.compute.internal Normal Pulling 14s (x3 over 65s) kubelet, ip-192-168-6-51.us-east-2.compute.internal Pulling image "somerandomnonexistentimage" Warning Failed 14s (x3 over 55s) kubelet, ip-192-168-6-51.us-east-2.compute.internal Failed to pull image "somerandomnonexistentimage": rpc error: code = Unknown desc = Error response from daemon: pull access denied for somerandomnonexistentimage, repository does not exist or may require 'docker login' Warning Failed 14s (x3 over 55s) kubelet, ip-192-168-6-51.us-east-2.compute.internal Error: ErrImagePull
If your containers are still in the Waiting state, then complete the steps in the Additional troubleshooting section.
Your pod is in the CrashLoopBackOff state
If you receive the "Back-Off restarting failed container" output message, then your container might have exited soon after Kubernetes started the container.
To look for errors in the logs of the current pod, run the following command:
$ kubectl logs YOUR_POD_NAME
To look for errors in the logs of the previous pod that crashed, run the following command:
$ kubectl logs --previous YOUR-POD_NAME
For a multi-container pod, append the container name at the end. For example:
$ kubectl logs [-f] [-p] (POD | TYPE/NAME) [-c CONTAINER]
If the liveness probe doesn't return a Successful status, then verify that the liveness probe is correctly configured for the application. For more information, see Configure probes on the Kubernetes website.
The following example shows a pod in a CrashLoopBackOff state because the application exits after it starts:
$ kubectl describe pod crash-app-b9cf4587-66ftw Name: crash-app-b9cf4587-66ftw ... Containers: alpine: Container ID: containerd://a36709d9520db92d7f6d9ee02ab80125a384fee178f003ee0b0fcfec303c2e58 Image: alpine Image ID: docker.io/library/alpine@sha256:e1c082e3d3c45cccac829840a25941e679c25d438cc8412c2fa221cf1a824e6a Port: <none> Host Port: <none> State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Completed Exit Code: 0 Started: Tue, 12 Oct 2021 12:26:21 +1100 Finished: Tue, 12 Oct 2021 12:26:21 +1100 Ready: False Restart Count: 4 ... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Started 97s (x4 over 2m25s) kubelet Started container alpine Normal Pulled 97s kubelet Successfully pulled image "alpine" in 1.872870869s Warning BackOff 69s (x7 over 2m21s) kubelet Back-off restarting failed container Normal Pulling 55s (x5 over 2m30s) kubelet Pulling image "alpine" Normal Pulled 53s kubelet Successfully pulled image "alpine" in 1.858871422s
The following is an example of liveness probe that fails for the pod:
$ kubectl describe pod nginx Name: nginx ... Containers: nginx: Container ID: containerd://950740197c425fa281c205a527a11867301b8ec7a0f2a12f5f49d8687a0ee911 Image: nginx Image ID: docker.io/library/nginx@sha256:06e4235e95299b1d6d595c5ef4c41a9b12641f6683136c18394b858967cd1506 Port: 80/TCP Host Port: 0/TCP State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Completed Exit Code: 0 Started: Tue, 12 Oct 2021 13:10:06 +1100 Finished: Tue, 12 Oct 2021 13:10:13 +1100 Ready: False Restart Count: 5 Liveness: http-get http://:8080/ delay=3s timeout=1s period=2s #success=1 #failure=3 ... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Pulled 2m25s kubelet Successfully pulled image "nginx" in 1.876232575s Warning Unhealthy 2m17s (x9 over 2m41s) kubelet Liveness probe failed: Get "http://192.168.79.220:8080/": dial tcp 192.168.79.220:8080: connect: connection refused Normal Killing 2m17s (x3 over 2m37s) kubelet Container nginx failed liveness probe, will be restarted Normal Pulling 2m17s (x4 over 2m46s) kubelet Pulling image "nginx"
If your pods are still in the CrashLoopBackOff state, then complete the steps in the Additional troubleshooting section.
Your pod is in the Terminating state
If your pods are stuck in a Terminating state, then check the health of the node where that pod is running and the finalizers. A finalizer is a function that performs termination processing before the pod transitions to Terminated. For more information, see Finalizers on the Kubernetes website. To check the finalizer for the terminating pod, run the following command:
$ kubectl get po nginx -o yaml apiVersion: v1 kind: Pod metadata: ... finalizers: - sample/do-something ...
In the preceding example, the pod transitions to Terminated only after the finalizer sample/do-something is removed. Generally, a custom controller processes the finalizer and then removes it. The pod then transitions to the Terminated state.
To resolve this issue, check if the custom controller's pod correctly runs. Resolve any issues with the controller's pod, and let the custom controller complete the finalizer process. The pod then automatically transitions to the Terminated state. Or, run the following command to directly delete the finalizer:
$ kubectl edit po nginx
Additional troubleshooting
If your pod is still stuck, then complete the following steps:
-
To confirm that worker nodes are in the cluster and are in Ready status, run the following command:
$ kubectl get nodes
If the nodes' status is NotReady, then see How can I change the status of my nodes from NotReady or Unknown status to Ready status? If the nodes can't join the cluster, then see How can I get my worker nodes to join my Amazon EKS cluster?
-
To check the version of the Kubernetes cluster, run the following command:
$ kubectl version --short
-
To check the version of the Kubernetes worker node, run the following command:
$ kubectl get node -o custom-columns=NAME:.metadata.name,VERSION:.status.nodeInfo.kubeletVersion
-
Confirm that the Kubernetes server version for the cluster matches the version of the worker nodes within an acceptable version skew. For more information, see Version skew policy on the Kubernetes website.
Important: The patch versions can be different between the cluster and worker node, such as v1.21.x for the cluster and v1.21.y for the worker node. If the cluster and worker node versions are incompatible, then use eksctl or AWS CloudFormation to create a new node group. Or, use a compatible Kubernetes version to create a new managed node group, such as Kubernetes: v1.21, platform: eks.1 and above. Then, delete the node group that contains the incompatible Kubernetes version. -
Confirm that the Kubernetes control plane can communicate with the worker nodes. Check firewall rules against required rules in Amazon EKS security group requirements and considerations. Then, verify that the nodes are in the Ready status.
Related videos
The explanations here are detailed
Relevant content
- asked 2 years agolg...
- asked a year agolg...
- asked 2 years agolg...
- asked 2 years agolg...
- asked 3 years agolg...
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated 9 months ago
- AWS OFFICIALUpdated 5 months ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 20 days ago