Skip to content

How do I troubleshoot Pod scheduling issues that are related to node availability in Amazon EKS?

15 minute read
1

I experience node availability issues or errors when I try to schedule my Amazon Elastic Kubernetes Service (Amazon EKS) worker Pods. My Pods are stuck in the Pending state.

Resolution

Note: If you receive errors when you run AWS Command Line Interface (AWS CLI) commands, then see Troubleshooting errors for the AWS CLI. Also, make sure that you're using the most recent AWS CLI version.

Insufficient memory or CPU errors

You receive the following error messages when the available CPU or memory on worker nodes isn't enough for your Pod to reach the Running state:

  • "Warning FailedScheduling 16m default-scheduler 0/2 nodes are available: 1 Too many pods, 2 Insufficient cpu, 2 Insufficient memory. preemption: 0/2 nodes are available: 2 No preemption victims found for incoming pod"
  • "Warning FailedScheduling 11m default-scheduler 0/2 nodes are available: 1 Insufficient cpu, 1 Insufficient memory, 1 node(s) were unschedulable. preemption: 0/2 nodes are available: 1 No preemption victims found for incoming pod, 1 Preemption is not helpful for scheduling"

To resolve insufficient memory or CPU issues, complete the following steps:

  1. Start a Session Manager (a capability of AWS Systems Manager) SSH session with your worker nodes.

  2. Run the following kubectl top node command to get the current CPU and memory usage percentage for each node:

    kubectl top node

    Note: For more information about the preceding command, see kubectl top node on the Kubernetes website.

  3. Run the following kubectl describe command to get information about the total available resources on your node:

    kubectl describe node your_node_name

    Note: Replace your_node_name with the name of your node. For more information about the preceding command, see kubectl describe on the Kubernetes website.
    The following example output shows the resources on a node that you can allocate to a Pod:

    Allocatable:
      cpu: 1930m
      ephemeral-storage: 18242267924
      hugepages-1Gi: 0
      hugepages-2Mi: 0
      memory: 3388304Ki
      pods: 17
  4. Identify the Amazon Elastic Compute Cloud (Amazon EC2) instance class that your worker nodes use. Then, review the default CPU and memory that's available on the instances.

  5. Review the resource limits and requests that you defined in your Deployment. For more information, see Deployments on the Kubernetes website. If you configure limits, then make sure that they meet your workload requirements. The following example output shows resource limits and requests:

    Limits:
      memory: 170Mi
    Requests:
      cpu: 100m
      memory: 70Mi

    Note: The correct values for Limits and Requests can help you determine whether you must scale up your worker node capacity. For more information, see Resource management for Pods and containers on the Kubernetes website.

Network interface quota issues

You receive the following error message when you reach the elastic network interface quota on your node:

"Warning FailedScheduling 117s default-scheduler 0/2 nodes are available: 1 Too many pods. preemption: 0/2 nodes are available: 2 No preemption victims found for incoming pod"

Note: Each Pod in a Kubernetes cluster has its own IP address. The number of IP addresses that an instance type supports helps determine the maximum number of running Pods on a worker node. If you reach the maximum number of Pods that can run on a worker node, then your Pods might get stuck in the Pending state.

To check the maximum number of Pods that can run on each node, run the following describe-instance-types command:

aws ec2 describe-instance-types --filters "Name=instance-type,Values=c5.*" --query "InstanceTypes[].{ Type: InstanceType, MaxENI: NetworkInfo.MaximumNetworkInterfaces, IPv4addr: NetworkInfo.Ipv4AddressesPerInterface}" --output table

The command's output shows information about the maximum number of network interfaces and IPv4 addresses per interface for each instance type. The output of the preceding command shows one less Pod than the maximum number of Pods because the output excludes the first network interface (eth0).

For more information, see Amazon EKS recommended maximum Pods for each Amazon EC2 instance type. Also, see Maximum IP addresses per network interface.

For information about the maximum number of Pods that an instance can support, see amazon-vpc-cni-k8s on the GitHub website.

If the available IP addresses on your current worker node instances aren't sufficient, then take the following actions:

  • Scale up the instance class to one that supports more Pods. It's a best practice to check network interface quotas before you scale up.
  • Use Karpenter and Cluster Autoscaler to scale out the node count.

Taint or toleration errors

You receive the following error message when there's an issue with taints or tolerations:

"Warning FailedScheduling 12s default-scheduler 0/2 nodes are available: 2 node(s) had untolerated taint {NodeType: MemoryOptimized}. preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling"

The Pod that you want to schedule requires a node with a specific taint (NodeType: MemoryOptimized). However, the scheduling fails because the node isn't available in your cluster.

To resolve issues with taints or tolerations, complete the following steps:

  1. Run the following kubectl describe command to get the taints that you configured on your node:

    kubectl describe node your_node_name

    Note: Replace your_node_name with your node name. You can schedule a Pod on the node only when the Pod has a matching toleration.

  2. Run the following kubectl describe command to get the tolerations on a Pod:

    kubectl describe pod your_pod_name -n your_namespace

    Note: Replace your_pod_name with your Pod name and your_namespace with your namespace.

Affinity errors

You receive the following error message when you have an issue with a node affinity:

"Warning FailedScheduling 13s default-scheduler 0/2 nodes are available: 2 node(s) didn't match Pod's node affinity/selector. preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling"

To resolve affinity issues, complete the following steps:

  1. Update the Pod's node affinity configuration to use preferredDuringSchedulingIgnoredDuringExecution instead of requiredDuringSchedulingIgnoredDuringExecution.

    Note: To schedule a Pod, you must use requiredDuringSchedulingIgnoredDuringExecution. The Kubernetes scheduler schedules the Pod only on a node that complies with the affinity rule.

  2. Add nodes to your cluster that have the matching affinity so that the Kubernetes scheduler can find a node to schedule the Pod on.

  3. Remove the node selector terms from your Deployment, or update your node group to a value that meets the condition from your Deployment. Choose the appropriate option based on your requirements and the availability of nodes on your Kubernetes cluster.

    The following example output shows a node selector configuration in a Deployment:

    affinity:
      nodeAffinity:
         requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                - key: instance-az
                  operator: In
                  values:
                   - az1
                  - az2
  4. Run the following kubectl describe command to get the details of the node:

    kubectl describe node your_node_name

    Note: Replace your_node_name with your node name.

    The following example shows node details:

    Name: ip-10-1-1-172.ap-south-1.compute.internal
    Roles: <none>
    Labels: beta.kubernetes.io/arch=amd64
                 beta.kubernetes.io/instance-type=t3.medium
                 beta.kubernetes.io/os=linux
                 instance-az=az1

Note: The key value that you defined in your Deployment must be on your node so that Kubernetes can place the Pod on your node.

Pod security group errors

You receive the following error message when there are issues with your Pod security groups:

"Warning FailedScheduling 3s default-scheduler 0/3 nodes are available: 1 Insufficient vpc.amazonaws.com/pod-eni, 2 node(s) had untolerated taint {eks.amazonaws.com/compute-type: fargate}. preemption: 0/3 nodes are available: 1 No preemption victims found for incoming Pod, 2 Preemption is not helpful for scheduling"

Note: When you use security groups for Pods in Amazon EKS, the VPC Resource Controller creates a branch network interface for each Pod. The VPC Resource Controller also creates a trunk network interface in the worker node to manage the branch network interface.

Important: To use Pod security groups, you must launch your worker nodes on Nitro instance types, such as the C5, M5, R5, and T3 families. Nitro instance types support the necessary network interface functionality to accommodate the branch network interface for the Pods.

To resolve the Pod security group issue, complete the following steps:

  1. Run the following kubectl describe command to confirm that your worker node instance supports a trunk interface:

    kubectl describe node your_node_name

    Note: Replace your_node_name with your node name.

  2. In the node description that's in the output from the preceding command, check the events from the VPC Resource Controller.
    The following output shows an example of an incompatible instance:

    Events:
      Type     Reason                   Age                    From                             Message
      ----     ------                   ----                   ----                             -------
      Normal   RegisteredNode           22m                    node-controller                  Node ip-192-168-59-29.us-east-2.compute.internal event: Registered Node ip-192-168-59-29.us-east-2.compute.internal in Controller
      Normal   ControllerVersionNotice  21m                    vpc-resource-controller          The node is managed by VPC resource controller version v1.4.9
      Normal   NodeReady                21m                    kubelet                          Node ip-192-168-59-29.us-east-2.compute.internal status is now: NodeReady
      Warning  Unsupported              13m (x16 over 21m)     vpc-resource-controller          The instance type t3.small is not supported for trunk interface (Security Group for Pods)
        

Fargate Pod support errors

You receive the following error message when AWS Fargate doesn't support your Pod:

"Warning FailedScheduling <unknown> fargate-scheduler Pod not supported on Fargate: volumes not supported: mysql-db not supported because: PVC mysql-db not bound"

Pods that run on Fargate and use an Amazon Elastic File System (Amazon EFS) volume have the same requirements as regular Pods. Fargate doesn't schedule a Pod when the necessary PersistentVolume and PersistentVolumeClaim resources aren't in the cluster.

To resolve the Fargate issue, complete the following steps:

  1. Create and correctly configure the required PersistentVolume and PersistentVolumeClaim resources in the cluster. For more information, see Configure a Pod to use a PersistentVolume for storage on the Kubernetes website.
  2. Check for issues with your PersistentVolume and PersistentVolumeClaim resources when the volume is bound to the Pod. For example, you might have EFS file system, permissions, or other configuration issues.

Fargate support for volume type errors

You receive the following error message when your Pod contains a definition for an unsupported volume type:

"Warning FailedScheduling <unknown> fargate-scheduler Pod not supported on Fargate: volumes not supported: admin-panel is of an unsupported volume Type"

Important: You can't mount Amazon Elastic Block Store (Amazon EBS) volumes on Fargate Pods.

To resolve the unsupported volume type issue, take the following actions:

  • Make sure that the Pod definition specifies an EFS volume and not an EBS volume or other unsupported volume type.
  • Verify that you correctly configured your PersistentVolume and PersistentVolumeClaim resources and they reference an EFS file system.
  • Check for EFS file system, permissions, or configuration issues that might cause the volume not to attach to the Pod.

Security context on Fargate errors

You receive the following error message when a security context isn't valid:

"Warning FailedScheduling 109s fargate-scheduler Pod not supported on Fargate: invalid SecurityContext fields: Privileged"

To resolve the security context issue, make sure that your Fargate Pods adhere to the following requirements:

  • The Pod definition can't specify the privileged container security context. Fargate doesn't accept privileged containers.
  • The Pod definition can't mount the host's root filesystem (/) into the container. Fargate isolates containers from the host's root filesystem.

Incorrect scheduler on Fargate errors

If you use Amazon EKS with Fargate profiles or custom schedulers, then you might not have the correct Kubernetes scheduler to manage your Pods.

To resolve scheduler issues, complete the following steps:

  1. Review your Pod events.
    The following example shows that the default scheduler can't find a node for the Pod because a taint is associated with the Fargate compute type:

    Events:
    Type     Reason            Age        From               Message
    ----     ------            ----       ----               -------
    Warning  FailedScheduling  109s       default-scheduler  0/2 nodes are available: 2 node(s) had untolerated taint {eks.amazonaws.com/compute-type: fargate}. preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling..

    Note: The fargate-scheduler must schedule Fargate Pods.

  2. Run the following kubectl get mutatingwebhookconfigurations command to verify that you installed the Fargate webhook on the cluster:

    kubectl get mutatingwebhookconfigurations 0500-amazon-eks-fargate-mutation.amazonaws.com

If the output shows the webhook configuration and the fargate-scheduler can't pick up the Pod, then check for other custom mutating webhook configurations that might conflict. For example, a conflicting webhook might cause the issue because Fargate processes the webhooks in alphabetical order.

Port conflict errors

You receive the following error message when there's a port conflict on the host machine where the Kubernetes scheduler is scheduling the Pod:

"Warning FailedScheduling 10s default-scheduler 0/2 nodes are available: 1 node(s) didn't have free ports for requested pod ports. 1 node(s) didn't match the requested hostname"

The error occurs because you configured multiple Pods with the same hostPort value, or another process on the host machine is already running the port.

When you configure a Pod with hostNetwork: true, the containers running inside the Pod have direct access to the network interfaces of the host machine. As a result, the container ports are directly exposed to the external network at the corresponding host ports.

To resolve port conflict issues, complete the following steps:

  1. Run the following kubectl describe command to check the port configurations:

    kubectl describe deploy nginx-deployment

    The following example output shows where the port information is:

        spec:
          hostNetwork: true
          containers:
          - name: nginx
            image: nginx:latest
            ports:
            - containerPort: 80
              hostPort: 80
  2. Modify your host network or host port. For example, you can set hostNetwork to the default value of false to assign a virtual network interface to the Pod. You can also update the Pod configuration to use a different host port that isn't already in use on the host machine. To allow the kernel to automatically assign a random host port to your Pods, set hostPort to 0 such as in the following example:

      Containers:
       nginx:
        Image:        nginx:latest
        Port:         80/TCP
        Host Port:    0/TCP
  3. Schedule the pod on a different host. If the port conflict is specific to the host machine where you want to schedule your Pod, then schedule the Pod on a different node. The node must be in your Kubernetes cluster. You can also use affinity logic. For more information, see Place Kubernetes Pods on Amazon EKS by using node affinity, taints, and tolerations.

Conflicting node affinity errors

You receive the following error message when you have conflicting node affinity rules:

"Warning FailedScheduling 57s (x3 over 3m3s) default-scheduler 0/15 nodes are available: 1 node(s) were unschedulable, 14 node(s) had volume node affinity conflict"

This error occurs when the Pod's volume requirements aren't met for the available nodes because of conflicting node affinity rules.

To resolve the conflicting node affinity issue, complete the following steps:

  1. Identify the conflicting node affinity rules in your Pod and volume definitions. For example, your pod might have a node affinity rule that requires the node-type=high-storage label, but the volume requires the node-type=ssd-storage label.

  2. To resolve the rule conflict, update the node affinity rules for either the Pod or the volume. For example, update the Pod's node affinity rule to match the volume's requirement, such as in the following example:

    affinity:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
          - matchExpressions:
            - key: node-type
              operator: In
              values:
              - ssd-storage
  3. Confirm that your node labels match your affinity rules. Then, verify that the available nodes in your cluster have the correct labels to meet the new requirements. To list the nodes and their labels, run the following kubectl get nodes command:

    kubectl get nodes --show-labels

    Note: Make sure that you have available nodes with the required node-type=ssd-storage label.

  4. Modify your node groups, or add new node groups. If you use node groups and the nodes don't have the required labels, then update your existing node groups. Or, create new node groups with the appropriate labels and instance types.

No available volume zone errors

You receive the following error messages when you try to deploy a Pod in your cluster and you don't have an available volume zone:

  • "Warning FailedScheduling 60s (x5 over 60s) default-scheduler Pod has unbound PersistentVolumeClaims (repeated 2 times)"
  • "Warning FailedScheduling 2s (x16 over 59s) default-scheduler 0/2 nodes are available: 2 node(s) had no available volume zone"

The Pod's volume requirements aren't met because the available volume zones and the Pod's volume zone constraints don't match.

To resolve the volume zone issue, complete the following steps:

  1. Review your Pod and volume definitions to identify the specific volume zone constraints that cause your volume zone issue. For example, check whether you set your Pod or volume to use an Availability Zone where you aren't running worker nodes.

  2. Run the following kubectl get pv command to check what volume zones are available in your cluster:

    kubectl get pv --all-namespaces -o jsonpath='{range .items[*]}{.spec.awsElasticBlockStore.availabilityZone}{"\n"}{end}' | sort | uniq

    Note: Make sure that the output list includes the volume zones that your Pod or volume require.

  3. Update the volume zone constraints. If the required volume zone isn't available, then update the Pod or volume configuration. The configuration must restrict the creation of volumes only to Availability Zones where worker nodes are running.

  4. Confirm that node groups span multiple Availability Zones. If your node groups aren't spanning multiple zones, then create new node groups that include the required volume zones.

AWS OFFICIALUpdated 8 months ago