We have been running our site with no problems for a while.
On July12, with no changes made to any AWS configuration, our site went down.
We are in the process of making an update, which has not been a problem.
Using CircleCI, the error indicated during Kubernetes deployment
Waiting for deployment rollout to finish: 1 old replicas are pending termination...
error: deployment exceeded its progress deadline
Looking for problems in EKS, we discovered that all pods indicated 0/1 nodes available
:
kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
aws-node-xxxxx 0/1 Init:ImageInspectError 0 3d
coredns-xxxxx-drb7p 0/1 Pending 0 3d
coredns-xxxxx-s58zq 0/1 Pending 0 3d
kube-proxyxxxxx 0/1 ImageInspectError 0 3d
spotinst-kubernetes-cluster-controller-xxxxxx-h2mn5 0/1 Pending 0 3d
Looking at the pod shows the following event:
Warning
FailedScheduling
default-scheduler
0/1 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/not-ready: }. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling..
Indeed, the Nodes show up as Not Ready
and the condition indicates:
Ready: False
runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
Looking at the EC2 instances show that the instances are running ok: Instance state: Running
We found a few articles with similar problems, but following along with the suggestions did not provide a solution. The most puzzling piece is that we have changed nothing in AWS. Everything was working ok last week but now we our site is down and has been down for 3 days.
We tried to get AWS Business support but they need to approve us and after 3 days, we are at a loss.