Ir para o conteúdo

Troubleshooting Amazon EKS networking issues at scale in an Enterprise scenario

13 minuto de leitura
Nível de conteúdo: Especialista
6

This article discusses the journey of a customer with the Enterprise Support plan through issues with Amazon Elastic Kubernetes Service (Amazon EKS) networking. The article also shares troubleshooting techniques and solutions that AWS Support implemented to resolve these issues.

Introduction

Amazon EKS is a fully managed service that you can use to run and manage Kubernetes clusters in the AWS Cloud. Amazon EKS offloads the operational complexity of managing the Kubernetes control plane to AWS. That way, teams can focus on building and deploying applications rather than managing the underlying infrastructure.

Kubernetes networking is a crucial component of the platform that supports communication between various components that you manage, such as Pods, services, and external endpoints. However, the complexity involved makes troubleshooting networking issues difficult even for experienced administrators, particularly at a larger scale. For example, if networking isn't properly configured, then events, such as scaling up and scaling down, might lead to intermittent network connectivity, throttling, and disk issues.

The AWS Support team solves complex problems for customers across industries. This article explores some of the potential recurring issues that can happen when you don't properly configure Kubernetes networking. The article also outlines a preemptive approach to address these issues, regardless of your workload size or cloud footprint.

Problem

The customer noticed the following error in the application Pod logs:

Error: connect ETIMEDOUT 192.0.2.26:443

The customer informed us that they scaled down worker nodes and workload replicas during the weekend to reduce costs. Then, at the start of the week, they scaled up several nodes and Pods, overloading network resources. During the initial review, we identified that connection failures between services might be causing the issue. Some Pods might be stuck in the ContainerCreating state because of IP address assignment errors.

To address this issue, we suggested that the customer restart the Amazon Virtual Private Cloud (Amazon VPC) CNI Pods that manage Pod networking and Istio service mesh components on all nodes. However, this is a workaround to temporarily solve the connectivity issue. For a permanent solution, we needed to investigate further.

Solution overview

To resolve connectivity issues, we investigated both the control plane and data plane components of the customer's Amazon EKS cluster. To reduce Amazon EKS service issues, we investigated the control plane component. Specifically, we examined the API server to identify potential API throttling issues at the control plane level. Then, we investigated the data plane and its logs to check for issues on the kubelet level. The logs provided information that indicated throttling in the Amazon Elastic Compute Cloud (Amazon EC2) API call, Amazon VPC CNI, and I/O throttling on the node's Amazon Elastic Block Store (Amazon EBS) volumes.

Step 1: Investigate the Amazon EKS control plane

First, we examined the control plane components, focusing on API throttling. We analyzed the API Priority and Fairness (APF) to check if a high number of simultaneously starting Pods overloaded the API server. In our review of the Amazon EKS API server, we found no evidence of throttling. This indicated that the control plane didn't cause the issue. For more information, see API Priority and Fairness on the EKS Best Practices Guides. Also, the control plane scales automatically when metrics, such as the number of worker nodes and size of the etcd database, exceed the defined limits. For more information, see Amazon EKS improves control plane scaling and update speed by up to 4x.

Step 2: Investigate the Amazon EKS data plane

The EKS Logs Collector is a useful tool to troubleshoot worker node issues in Amazon EKS. This script collects relevant logs and system information from worker nodes that you can use for problem identification and resolution. The information collected is transparent because this script is hosted on the GitHub website. The next section discusses the different issues that we identified during the investigation of the Amazon EKS data plane.

Investigation and best practices

Step 1: Check node bootstrapping and user data script logs

We identified the following error that was related to Transport Layer Security (TLS) handshake failures from the cloud-init logs:

http: TLS handshake error from 198.51.100.139:45160: no serving certificate available for the kubelet

We noticed that the hostname was hardcoded on the bootstrap script and caused issues in the initialization process. This hardcoded hostname might have led to issues with the node setup and the Kubelet's CSR process. It could have potentially caused delays or failures in getting the node properly registered and added to the Kubernetes cluster.

Although this issue wasn't directly related to the main problem that the customer reported, we followed a comprehensive approach to help them. Our objective was not only to resolve the specific issue at hand, but also to identify and address any potential future issues that we could help proactively prevent.

Other user data script issues included handling of the iptables rules and conflicts between the Docker and Containerd services. The user data script flushed the rules in iptables and managed the Docker service, although Containerd was the default runtime. These disrupted networking components, such as kube-proxy configuring the iptables rules during Kubernetes worker node bootstrapping. This disruption affected pod communication, routing, and IP assignment functionality of kube-proxy and VPC CNI.

Customer node logs indicated that the Amazon EC2 user data script included commands that resulted in errors. The VPC CNI plugin used the custom networking mode to increase the number of IPv4 addresses that are available for the Pods in the Amazon EKS clusters. When you turn on this mode, you must update the use-max-pods and max-pods values for the kubelet. This update prevents scheduling that exceeds the IP address resources that are available to the kubelet. This is because one of the elastic network interface attachments is used for the node and can't share the allocated IPs with Pods.

We noticed that use-max-pods was set to the default value of true and max-pods value was set to the default value of 110 in the customer's Amazon EC2 user data script. These values must be updated based on the instance type that the customer uses when the nodes are in a self-managed node group. Amazon EKS provides a script that you can download and run to determine the recommended maximum number of Pods for each instance type. For more information, see Amazon EKS recommended maximum Pods for each Amazon EC2 instance type.

Best practices for node bootstrapping

It's important to make sure that the bootstrap script includes only the necessary configurations. Unnecessary customizations and hostname overrides might cause complexity and potential issues. Simplify the bootstrap script to reduce the risk of misconfigurations and improve node initialization reliability.

For the kubelet max-pods calculation, the customer used Karpenter as a cluster autoscaler. Therefore, we recommended that they configure the karpenter-global-settings configmap with aws.reservedENIs set to 1. For more information, see Pod density in Karpenter documentation.

Step 2: Check kubelet logs

Kubernetes has a default quota of five queries per second for pulling container images from registries. However, when many Pods from the same deployment are simultaneously scheduled on a node, the pods try to concurrently pull the same container image. The customer's kubelet logs indicated that this issue occurred and caused the number of queries per second to exceed the default quota of five. This resulted in some Pods failing to start and the customer getting the "ErrImagePull: pull QPS exceeded" error.

Best practices for kubelet configuration

To resolve the "QPS exceeded" error, we suggested that the customer increase the kubelet configuration's registryPullQPS setting to a higher value, such as 50. Increasing the registry pulls per second limit can prevent issues when multiple Pods concurrently request image pulls.

Step 3: Check Amazon EC2 API call throttling

We checked for potential throttling at the Amazon EC2 API level that might be caused by the simultaneous launch of numerous Pods. The Amazon VPC CNI calls APIs, such as AssignPrivateIpAddresses and CreateNetworkInterface, to set up Pod networking, and can potentially cause throttling.

AWS CloudTrail events and VPC CNI logs confirmed throttling on API calls during the launch of the Pods. The CloudTrail events showed ThrottlingException errors, while the VPC CNI logs reported simultaneous throttling issues.

Best practices to mitigate Amazon EC2 API Calls throttling

To mitigate Amazon EC2 API call throttling, we created a support case to request an increase in the rate quotas for the throttled APIs. For more information, see How do I increase the service quota of my Amazon EC2 resources?

To learn strategies for managing API throttling and to get guidance on creating dashboards for API metrics that are available in Amazon CloudWatch, see Managing and monitoring API throttling in your workloads.

Step 4: Check Amazon VPC CNI logs

Further log analysis revealed an issue with the Amazon VPC CNI's communication with the Kubernetes API server. During initialization, the L-IPAM daemon (IPAMD) requires access to the Kubernetes API server through the ClusterIP service. However, IPAMD relies on kube-proxy to establish this connectivity through iptables rules and routing.

The issue was caused because kube-proxy was initializing iptables rules when IPAMD tried to connect to the ClusterIP service. The iptables rules weren't set up in the instance yet and caused timeouts between IPAMD and the API server. We noticed the following error in the logs:

Unable to reach API Server (...) net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

We also noticed the following error in the logs:

Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused

This error indicated that IPAMD wasn't reachable by other components that tried communication over gRPC port 50051.

Best practices for the Amazon VPC CNI add-on

The cluster had the Amazon VPC CNI self-managed add-on version installed. To resolve the communication issues, we suggested that the customer configure the CLUSTER_ENDPOINT environment variable. This parameter specifies the cluster endpoint that was used for connecting to the API server without relying on kube-proxy. Specifying this optional parameter might improve the initialization time for Amazon VPC CNI. This environment variable is provided by default for the managed Amazon VPC CNI add-on version.

We also suggested that the customer configure another Amazon VPC CNI parameter IP_COOLDOWN_PERIOD. This parameter manages the release of IP addresses that are associated with Pods in Kubernetes clusters. Reducing the cooldown period releases IP addresses faster, particularly in scenarios where you frequently create and stop Kubernetes jobs. However, it's crucial to balance promptly releasing IP addresses and maintaining cluster stability.

We also recommended configuring the WARM_ENI_TARGET parameter to allocate sufficient IP addresses on worker node elastic network interfaces. This configuration helps reduce the delay in obtaining IP addresses from the Amazon EC2 APIs when you launch new Pods.

Also, we suggested that the customer upgrade the Kubernetes components and add-ons to versions that are aligned with the supported Kubernetes version. These upgrades make sure that the cluster benefits from the latest features, bug fixes, security updates, and performance improvements that AWS provides.

Step 5: Check Amazon EBS volume throttling

On further investigation, we also identified issues that were caused by Amazon EBS volume throttling during the Pod startup phase. The logs bundle's io_throttling.txt file indicated IO throttling on the Amazon EBS root volumes that might cause delays and disruptions during Pod startup. This throttling affected the Amazon VPC CNI Pods within the Kubernetes cluster. The CNI binary that sets up the Pod network for communication runs on the node root file system. The kubelet invokes the binary when you add a new Pod or remove an existing Pod from the node.

The IO Use Volume Percent metric represents the average percentage of IO operations per second (IOPS) that an Amazon EBS volume uses compared to the maximum IOPS that's provisioned for the volume. The following graph indicates 99% of Amazon EBS IO usage during the node boot time. The spike in the graph correlates with the Amazon VPC CNI starting and the Pods failing with the IPAMD "connection refused" error.

Enter image description here

Best practices to mitigate Amazon EBS throttling

We suggested the best practice of distributing application workloads across multiple EBS volumes that are attached to each worker node's container state folder. Distributing IO operations across volumes reduces the likelihood of hitting individual volume throughput or IOPS limits and minimizes the risk of throttling.

Lessons learned

Managing a Kubernetes cluster is challenging, but following best practices and keeping your cluster and add-ons up to date can significantly simplify the process. Adhering to the official documentation and recommended guidelines might help with a smooth and secure operation, minimizing potential issues and vulnerabilities. Keeping add-ons and worker nodes updated to the latest supported Kubernetes version maintains stability, security, and access to the latest features and bug fixes.

API throttling is a concern in large-scale deployments, requiring proactive monitoring and management to prevent performance bottlenecks. Closely tracking API usage patterns is crucial. In some cases, you might consider a multi-account strategy to isolate workloads and mitigate the effect of excessive API requests on critical components.

Finally, collaborating with AWS Support can be helpful, especially when you encounter complex challenges or need expert guidance. Our AWS Support team's expertise can help you accelerate problem-solving, gain best practice insights, and enhance the overall efficiency and reliability of your environment.

Conclusion

This article explored a customer journey through issues with Kubernetes networking, and shared solutions that AWS Support implemented to resolve these issues. It showcased the complexity of managing large-scale Kubernetes deployments and the importance of using expert guidance and best practices. The collaborative efforts of customers and the AWS Support team resulted in the successful resolution of these issues. We identified the root causes, implemented effective mitigations, and shared valuable lessons learned.

AWS Support engineers and Technical Account Managers (TAMs) can help you with general guidance, best practices, troubleshooting, and operational support on AWS. To learn more about our plans and offerings, see AWS Support.

About the authors

Enter image description here

Henrique Santana Henrique Santana is a Containers Specialist with over 15 years of experience in infrastructure operations. He is skilled in automating workflows and solving problems through user-centered design and emerging technologies. He is currently focusing on containers and container orchestration. Henrique is adept at optimizing resource utilization for high availability and implementing CI/CD pipelines. He is interested in opportunities to further promote container adoption.

Enter image description here

Ahmed Derbel Ahmed Derbel is an Amazon EKS subject matter expert (SME) at AWS. He uses his experience in container's technologies to help Enterprise customers with technical issues, particularly in production Kubernetes clusters. His expertise influences his approach to resolving customer challenges effectively.

2 Comentários

Hi,

I was wondering how long it took from ticket open to issue completely resolved. I am assuming this was for a customer with Enterprise Support? Did you have pair programming sessions or did you work separately? How did you interacted for communication, was it through the support ticket, through a call, a video meeting?

Thanks S

respondeu há um ano

Hello,

Thank you for your comment. The time required to resolve a support case varies based on the specific issue encountered. Application and service developers face a wide range of problems, making it challenging to predict resolution times accurately. However, we can make sure that our team will work closely with you to address your concern as promptly as possible. In this particular instance, the case involved a customer with an Enterprise Support plan.

For effective communication, we recommend utilizing the real-time interaction capabilities with our Cloud Support Engineers. They provide 24x7 global availability through phone, web, and chat for Enterprise Support customers. In this specific case, we conducted live meetings scheduled according to the customer's availability to enable seamless collaboration. Business, Enterprise On-Ramp, and Enterprise Support plans offer 24x7 phone, web, and chat access to Cloud Support Engineers. Meanwhile, the Developer support plan provides Business hours web access to Cloud Support Associates.

For more information on AWS Support Plans and frequently asked questions about AWS Support, please refer to the following resources:

AWS Support Plans: https://aws.amazon.com/premiumsupport/plans/

AWS Support FAQs: https://aws.amazon.com/premiumsupport/faqs/

These links provide comprehensive details and answers to common queries regarding AWS Support Plans and services.

AWS
ESPECIALISTA
respondeu há um ano