Introduction
Amazon EKS is a fully managed service that you can use to run and manage Kubernetes clusters in the AWS Cloud. Amazon EKS offloads the operational complexity of managing the Kubernetes control plane to AWS. That way, teams can focus on building and deploying applications rather than managing the underlying infrastructure.
Kubernetes networking is a crucial component of the platform that supports communication between various components that you manage, such as Pods, services, and external endpoints. However, the complexity involved makes troubleshooting networking issues difficult even for experienced administrators, particularly at a larger scale. For example, if networking isn't properly configured, then events, such as scaling up and scaling down, might lead to intermittent network connectivity, throttling, and disk issues.
The AWS Support team solves complex problems for customers across industries. This article explores some of the potential recurring issues that can happen when you don't properly configure Kubernetes networking. The article also outlines a preemptive approach to address these issues, regardless of your workload size or cloud footprint.
Problem
The customer noticed the following error in the application Pod logs:
Error: connect ETIMEDOUT 192.0.2.26:443
The customer informed us that they scaled down worker nodes and workload replicas during the weekend to reduce costs. Then, at the start of the week, they scaled up several nodes and Pods, overloading network resources. During the initial review, we identified that connection failures between services might be causing the issue. Some Pods might be stuck in the ContainerCreating state because of IP address assignment errors.
To address this issue, we suggested that the customer restart the Amazon Virtual Private Cloud (Amazon VPC) CNI Pods that manage Pod networking and Istio service mesh components on all nodes. However, this is a workaround to temporarily solve the connectivity issue. For a permanent solution, we needed to investigate further.
Solution overview
To resolve connectivity issues, we investigated both the control plane and data plane components of the customer's Amazon EKS cluster. To reduce Amazon EKS service issues, we investigated the control plane component. Specifically, we examined the API server to identify potential API throttling issues at the control plane level. Then, we investigated the data plane and its logs to check for issues on the kubelet level. The logs provided information that indicated throttling in the Amazon Elastic Compute Cloud (Amazon EC2) API call, Amazon VPC CNI, and I/O throttling on the node's Amazon Elastic Block Store (Amazon EBS) volumes.
Step 1: Investigate the Amazon EKS control plane
First, we examined the control plane components, focusing on API throttling. We analyzed the API Priority and Fairness (APF) to check if a high number of simultaneously starting Pods overloaded the API server. In our review of the Amazon EKS API server, we found no evidence of throttling. This indicated that the control plane didn't cause the issue. For more information, see API Priority and Fairness on the EKS Best Practices Guides. Also, the control plane scales automatically when metrics, such as the number of worker nodes and size of the etcd database, exceed the defined limits. For more information, see Amazon EKS improves control plane scaling and update speed by up to 4x.
Step 2: Investigate the Amazon EKS data plane
The EKS Logs Collector is a useful tool to troubleshoot worker node issues in Amazon EKS. This script collects relevant logs and system information from worker nodes that you can use for problem identification and resolution. The information collected is transparent because this script is hosted on the GitHub website. The next section discusses the different issues that we identified during the investigation of the Amazon EKS data plane.
Investigation and best practices
Step 1: Check node bootstrapping and user data script logs
We identified the following error that was related to Transport Layer Security (TLS) handshake failures from the cloud-init logs:
http: TLS handshake error from 198.51.100.139:45160: no serving certificate available for the kubelet
We noticed that the hostname was hardcoded on the bootstrap script and caused issues in the initialization process. This hardcoded hostname might have led to issues with the node setup and the Kubelet's CSR process. It could have potentially caused delays or failures in getting the node properly registered and added to the Kubernetes cluster.
Although this issue wasn't directly related to the main problem that the customer reported, we followed a comprehensive approach to help them. Our objective was not only to resolve the specific issue at hand, but also to identify and address any potential future issues that we could help proactively prevent.
Other user data script issues included handling of the iptables rules and conflicts between the Docker and Containerd services. The user data script flushed the rules in iptables and managed the Docker service, although Containerd was the default runtime. These disrupted networking components, such as kube-proxy configuring the iptables rules during Kubernetes worker node bootstrapping. This disruption affected pod communication, routing, and IP assignment functionality of kube-proxy and VPC CNI.
Customer node logs indicated that the Amazon EC2 user data script included commands that resulted in errors. The VPC CNI plugin used the custom networking mode to increase the number of IPv4 addresses that are available for the Pods in the Amazon EKS clusters. When you turn on this mode, you must update the use-max-pods and max-pods values for the kubelet. This update prevents scheduling that exceeds the IP address resources that are available to the kubelet. This is because one of the elastic network interface attachments is used for the node and can't share the allocated IPs with Pods.
We noticed that use-max-pods was set to the default value of true and max-pods value was set to the default value of 110 in the customer's Amazon EC2 user data script. These values must be updated based on the instance type that the customer uses when the nodes are in a self-managed node group. Amazon EKS provides a script that you can download and run to determine the recommended maximum number of Pods for each instance type. For more information, see Amazon EKS recommended maximum Pods for each Amazon EC2 instance type.
Best practices for node bootstrapping
It's important to make sure that the bootstrap script includes only the necessary configurations. Unnecessary customizations and hostname overrides might cause complexity and potential issues. Simplify the bootstrap script to reduce the risk of misconfigurations and improve node initialization reliability.
For the kubelet max-pods calculation, the customer used Karpenter as a cluster autoscaler. Therefore, we recommended that they configure the karpenter-global-settings configmap with aws.reservedENIs set to 1. For more information, see Pod density in Karpenter documentation.
Step 2: Check kubelet logs
Kubernetes has a default quota of five queries per second for pulling container images from registries. However, when many Pods from the same deployment are simultaneously scheduled on a node, the pods try to concurrently pull the same container image. The customer's kubelet logs indicated that this issue occurred and caused the number of queries per second to exceed the default quota of five. This resulted in some Pods failing to start and the customer getting the "ErrImagePull: pull QPS exceeded" error.
Best practices for kubelet configuration
To resolve the "QPS exceeded" error, we suggested that the customer increase the kubelet configuration's registryPullQPS setting to a higher value, such as 50. Increasing the registry pulls per second limit can prevent issues when multiple Pods concurrently request image pulls.
Step 3: Check Amazon EC2 API call throttling
We checked for potential throttling at the Amazon EC2 API level that might be caused by the simultaneous launch of numerous Pods. The Amazon VPC CNI calls APIs, such as AssignPrivateIpAddresses and CreateNetworkInterface, to set up Pod networking, and can potentially cause throttling.
AWS CloudTrail events and VPC CNI logs confirmed throttling on API calls during the launch of the Pods. The CloudTrail events showed ThrottlingException errors, while the VPC CNI logs reported simultaneous throttling issues.
Best practices to mitigate Amazon EC2 API Calls throttling
To mitigate Amazon EC2 API call throttling, we created a support case to request an increase in the rate quotas for the throttled APIs. For more information, see How do I increase the service quota of my Amazon EC2 resources?
To learn strategies for managing API throttling and to get guidance on creating dashboards for API metrics that are available in Amazon CloudWatch, see Managing and monitoring API throttling in your workloads.
Step 4: Check Amazon VPC CNI logs
Further log analysis revealed an issue with the Amazon VPC CNI's communication with the Kubernetes API server. During initialization, the L-IPAM daemon (IPAMD) requires access to the Kubernetes API server through the ClusterIP service. However, IPAMD relies on kube-proxy to establish this connectivity through iptables rules and routing.
The issue was caused because kube-proxy was initializing iptables rules when IPAMD tried to connect to the ClusterIP service. The iptables rules weren't set up in the instance yet and caused timeouts between IPAMD and the API server. We noticed the following error in the logs:
Unable to reach API Server (...) net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
We also noticed the following error in the logs:
Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused
This error indicated that IPAMD wasn't reachable by other components that tried communication over gRPC port 50051.
Best practices for the Amazon VPC CNI add-on
The cluster had the Amazon VPC CNI self-managed add-on version installed. To resolve the communication issues, we suggested that the customer configure the CLUSTER_ENDPOINT environment variable. This parameter specifies the cluster endpoint that was used for connecting to the API server without relying on kube-proxy. Specifying this optional parameter might improve the initialization time for Amazon VPC CNI. This environment variable is provided by default for the managed Amazon VPC CNI add-on version.
We also suggested that the customer configure another Amazon VPC CNI parameter IP_COOLDOWN_PERIOD. This parameter manages the release of IP addresses that are associated with Pods in Kubernetes clusters. Reducing the cooldown period releases IP addresses faster, particularly in scenarios where you frequently create and stop Kubernetes jobs. However, it's crucial to balance promptly releasing IP addresses and maintaining cluster stability.
We also recommended configuring the WARM_ENI_TARGET parameter to allocate sufficient IP addresses on worker node elastic network interfaces. This configuration helps reduce the delay in obtaining IP addresses from the Amazon EC2 APIs when you launch new Pods.
Also, we suggested that the customer upgrade the Kubernetes components and add-ons to versions that are aligned with the supported Kubernetes version. These upgrades make sure that the cluster benefits from the latest features, bug fixes, security updates, and performance improvements that AWS provides.
Step 5: Check Amazon EBS volume throttling
On further investigation, we also identified issues that were caused by Amazon EBS volume throttling during the Pod startup phase. The logs bundle's io_throttling.txt file indicated IO throttling on the Amazon EBS root volumes that might cause delays and disruptions during Pod startup. This throttling affected the Amazon VPC CNI Pods within the Kubernetes cluster. The CNI binary that sets up the Pod network for communication runs on the node root file system. The kubelet invokes the binary when you add a new Pod or remove an existing Pod from the node.
The IO Use Volume Percent metric represents the average percentage of IO operations per second (IOPS) that an Amazon EBS volume uses compared to the maximum IOPS that's provisioned for the volume. The following graph indicates 99% of Amazon EBS IO usage during the node boot time. The spike in the graph correlates with the Amazon VPC CNI starting and the Pods failing with the IPAMD "connection refused" error.

Best practices to mitigate Amazon EBS throttling
We suggested the best practice of distributing application workloads across multiple EBS volumes that are attached to each worker node's container state folder. Distributing IO operations across volumes reduces the likelihood of hitting individual volume throughput or IOPS limits and minimizes the risk of throttling.
Lessons learned
Managing a Kubernetes cluster is challenging, but following best practices and keeping your cluster and add-ons up to date can significantly simplify the process. Adhering to the official documentation and recommended guidelines might help with a smooth and secure operation, minimizing potential issues and vulnerabilities. Keeping add-ons and worker nodes updated to the latest supported Kubernetes version maintains stability, security, and access to the latest features and bug fixes.
API throttling is a concern in large-scale deployments, requiring proactive monitoring and management to prevent performance bottlenecks. Closely tracking API usage patterns is crucial. In some cases, you might consider a multi-account strategy to isolate workloads and mitigate the effect of excessive API requests on critical components.
Finally, collaborating with AWS Support can be helpful, especially when you encounter complex challenges or need expert guidance. Our AWS Support team's expertise can help you accelerate problem-solving, gain best practice insights, and enhance the overall efficiency and reliability of your environment.
Conclusion
This article explored a customer journey through issues with Kubernetes networking, and shared solutions that AWS Support implemented to resolve these issues. It showcased the complexity of managing large-scale Kubernetes deployments and the importance of using expert guidance and best practices. The collaborative efforts of customers and the AWS Support team resulted in the successful resolution of these issues. We identified the root causes, implemented effective mitigations, and shared valuable lessons learned.
AWS Support engineers and Technical Account Managers (TAMs) can help you with general guidance, best practices, troubleshooting, and operational support on AWS. To learn more about our plans and offerings, see AWS Support.
About the authors

Henrique Santana
Henrique Santana is a Containers Specialist with over 15 years of experience in infrastructure operations. He is skilled in automating workflows and solving problems through user-centered design and emerging technologies. He is currently focusing on containers and container orchestration. Henrique is adept at optimizing resource utilization for high availability and implementing CI/CD pipelines. He is interested in opportunities to further promote container adoption.

Ahmed Derbel
Ahmed Derbel is an Amazon EKS subject matter expert (SME) at AWS. He uses his experience in container's technologies to help Enterprise customers with technical issues, particularly in production Kubernetes clusters. His expertise influences his approach to resolving customer challenges effectively.