How can I troubleshoot etcdadm controller issues in EKS Anywhere?

4 minute read
0

I want to check the logs in etcdadm pods for clues to troubleshoot ETCD bootstrap failure issues.

Short description

If your Amazon Elastic Kubernetes Service (Amazon EKS) Anywhere nodes experience bootstrap failure, then check the logs for etcdadm pods. Also, check the logs for etcd pods that you created for a stacked etcd cluster.

To further debug, log in to any of the etcd virtual machines (VMs). If you didn't create an etcd VM before the bootstrap failure, then you can't use the VM to debug.

Resolution

Check the logs in etcdadm pods

Note: In the following commands, replace SUFFIX with your bootstrap provider's suffix. Replace KUBECONFIG_FILE with your kubeconfig file location.

After you create your cluster, you can use it with the generated KUBECONFIG file in your local directory:

KUBECONFIG=${PWD}/${CLUSTER_NAME}/${CLUSTER_NAME}-eks-a-cluster.kubeconfig

To check the logs for etcdadm-bootstrap-provider-controller, run the following command:

kubectl -n etcdadm-bootstrap-provider-system logs etcdadm-bootstrap-provider-controller-SUFFIX  
    --kubeconfig=KUBECONFIG_FILE

To check the logs for etcdadm-controller-controller-manager, run the following command:

kubectl -n etcdadm-controller-system logs etcdadm-controller-controller-manager-SUFFIX --kubeconfig=KUBECONFIG_FILE

Check the logs for a stacked etcd

If you created pods for a stacked etcd cluster, then run the following command to check those pods' logs:

kubectl logs -n kube-system etcd-$CONTROL_PLANE_VM_IP --kubeconfig=KUBECONFIG_FILE

Note: Replace KUBECONFIG_FILE with your kubeconfig file location.

Check VM logs

After you use the kubectl client to check for logs, you can also check for logs on the VMs:

1.    Log in to the VM's control plane:

ssh -i $PRIV_KEY ec2-user@$CONTROL_PLANE_VM_IP

Note: Replace PRIV_KEY with your private key. Replace CONTROL_PLANE_VM_IP with the IP address of your control plane's VM.

2.    For a BottleRocket operating system (OS): To get the root shell for a node that runs on BottleRocket OS, run the sudo sheltie command.

3.    After you log in to an etcd VM, you can check additional related container logs at the following location:

cd /var/log/containers

Check the logs for an unstacked or external etcd

For an unstacked or external etcd, log in to any etcd VM that you created.

1.    To log in to your etcd VM, run the following command:

ssh -i $PRIV_KEY ec2-user@$ETCD_VM_IP

Note: Replace PRIV_KEY with your private key. Replace ETCD_VM_IP with the IP address of your VM.

2.    For a BottleRocket operating system (OS): To get the root shell for a node that runs on BottleRocket OS, run the sudo sheltie command.

3.    After you log in to an etcd VM, you can check additional related container logs at the following location:

cd /var/log/containers  
bash-5.1# ls -lrt  
total 4  
lrwxrwxrwx. 1 root root 90 Apr 11 16:38 etcd-mgmt-etcd-4mt4g_kube-system_etcd-aa91be45434b920903e0630254cb5f319b86b50c56b87d8f992b0285272b93c4.log -> /var/log/pods/kube-system_etcd-mgmt-etcd-4mt4g_88b6dbc1be367b4ffc5a6bfab65e7376/etcd/0.log

Run a health check

ETCD_CONTAINER_ID=$(ctr -n k8s.io c ls | grep -w "etcd-io" | cut -d " " -f1)  
ctr -n k8s.io t exec -t --exec-id etcd ${ETCD_CONTAINER_ID} etcdctl \  
     --cacert=/var/lib/etcd/pki/ca.crt \  
     --cert=/var/lib/etcd/pki/server.crt \  
     --key=/var/lib/etcd/pki/server.key \  
     endpoint health  
  
127.0.0.1:2379 is healthy: successfully committed proposal: took = 10.241212ms

Example error messages

When you check your logs, you might see various error messages relating to your bootstrap failure. The following examples are some of the most common errors:

"Waiting for External ETCD to be ready."

To troubleshoot this error, refer to the troubleshooting documentation for Amazon EKS Anywhere.

"Kubelet of ETCD VMs Crashing."

Cluster provisioning doesn't continue after the creation of your ETCD VM. This issue also occurs with the following kubelet error message:

"Failed to start ContainerManager" err="invalid Node Allocatable configuration. Resource \"ephemeral-storage\" has an allocatable of {{1175812097 0} {<nil>} BinarySI}, capacity of {{-155109377 0} {<nil>} BinarySI}"

This indicates that the ephemeral storage on the node doesn't have sufficient free space. In each Kubernetes node, kubelet's root directory (/var/lib/kubelet by default) and log directory (/var/log) are on the root partition of the node.

To display the free disc space of a specific file system, run the following command:

"df -h"

To display a list of all files and their respective sizes, run the following command:

"sudo du -d 3 /var/lib/"

If you don't have enough free disc space, then clean up your node to free up space. Or, expand the storage capacity of your node’s root partition.

Related information

Install etcd (etcd website)

How to check cluster status (etcd website)

AWS OFFICIAL
AWS OFFICIALUpdated a year ago