Skip to content

How do I troubleshoot DNS in Amazon EKS Auto Mode

6 minute read
Content level: Intermediate
1

This comprehensive guide addresses DNS troubleshooting in Amazon EKS Auto Mode, a managed Kubernetes service configuration where AWS handles core components including DNS management. The article is particularly relevant for DevOps engineers and Kubernetes administrators who need to diagnose DNS issues in environments where traditional debugging methods (like SSH access) aren't available due to enhanced security measures in Auto Mode-launched EC2 instances.

Short description

Amazon EKS Auto Mode integrates core Kubernetes capabilities like DNS as built-in components that would otherwise have to be managed as add-ons. For Auto Mode-launched EC2 Instances, AWS handles the complete lifecycle of nodes by leveraging Amazon EC2 managed instances. In EKS Auto Mode with Amazon EC2 managed instances, security is enhanced since direct host access via SSH or SSM is not available. Instead, when troubleshooting DNS issues, you can deploy debugging containers to access and analyze node logs, maintaining security while enabling effective diagnostics.

Resolution

Verify that the DNS resolution works in a Pod

  1. To run commands inside your application Pods, run the following command:
kubectl exec -it your-pod-name -- sh

Note: Replace your-pod-name with your Pod name. Make sure the pod’s image have an available shell binary.

  1. To verify that the kube-dns service's cluster IP address is in your Pod's /etc/resolv.conf file, run the following command in the Pod shell:
cat /etc/resolv.conf

The following example resolv.conf file shows a pod that's configured to point to 172.20.0.10 for DNS requests:

search default.svc.cluster.local svc.cluster.local cluster.local ec2.internal
nameserver 172.20.0.10
options ndots:5
  1. To verify that your Pod can use the nameserver value to resolve an internal domain, run the following command in the Pod shell:
nslookup kubernetes.default 172.20.0.10

Example output:

Server:         172.20.0.10
Address:        172.20.0.10#53

Name:   kubernetes.default.svc.cluster.local
Address: 172.20.0.1
  1. To verify that your Pod can use the nameserver value to resolve an internal domain with a fully qualified domain, run the following command in the Pod shell:
# dig +short tester.default.svc.cluster.local
172.20.100.90

Replace tester.default.svc.cluster.local with any service in your cluster. Where tester is the service name and default is the namespace.

Example output:

# nslookup tester.default.svc.cluster.local 172.20.0.10
Server:         172.20.0.10
Address:        172.20.0.10#53

Name:   tester.default.svc.cluster.local
Address: 172.20.100.90
  1. To verify that your Pod can use the nameserver value to resolve an public domain, run the following command in the Pod shell:
nslookup amazon.com 172.20.0.10

Example output:

Server:         172.20.0.10
Address:        172.20.0.10#53

Name:   kubernetes.default.svc.cluster.local
Address: 172.20.0.1

dns-test-pod:~# nslookup amazon.com 172.20.0.10
Server:         172.20.0.10
Address:        172.20.0.10#53

Name:   amazon.com
Address: 54.239.28.85
Name:   amazon.com
Address: 205.251.242.103
Name:   amazon.com
Address: 52.94.236.248

Additionally, you can use telnet tool to also troubleshoot.

telnet google.com 80
Trying 64.233.180.102...
Connected to google.com.
Escape character is '^]'.

Check CoreDNS logs from Auto Mode-launched EC2 nodes with debug containers and the kubectl CLI

If you have nodes that are short-lived or being terminated due to spot interruption, you can reproduce DNS queries to troubleshoot in other available nodes in the cluster.

  1. (Optional) Create a sample pod that can trigger DNS queries. See example below:
apiVersion: v1
kind: Pod
metadata:
  name: dns-test-pod
  labels:
    app: dns-test
spec:
  nodeName: <NODE_NAME>
  containers:
  - name: dns-tester
    image: curlimages/curl:latest
    command: ["/bin/sh"]
    args:
      - -c
      - |
        while true; do
          echo "Testing DNS resolution..."
          curl -Is --connect-timeout 5 http://google.com || echo "google.com failed"
          sleep 2
          curl -Is --connect-timeout 5 http://amazon.com || echo "amazon.com failed"
          sleep 2
          curl -Is --connect-timeout 5 http://example || echo "tester failed"
          sleep 2
          curl -Is --connect-timeout 5 http://api || echo "api failed"
          sleep 10
        done
  restartPolicy: Always

Replace NODE_NAME in the sample manifest with your own value.

  1. Launch a debug container. The following command uses NODE_NAME for the instance ID of the node, -it allocates a tty and attach stdin for interactive usage, and uses the sysadmin profile from the kubeconfig file. Replace the <NODE_NAME> with the same value of the node where the sample pod is running:
kubectl debug node/<NODE_NAME> -it --profile=sysadmin --image=public.ecr.aws/amazonlinux/amazonlinux:2023

An example output is as follows.

Creating debugging pod node-debugger-i-0285feeceecfa12af-mxxqp with container debugger on node i-0285feeceecfa12af.
If you don't see a command prompt, try pressing enter.
bash-5.2# 
  1. From the shell, you can now install util-linux-core which provides the nsenter command. Use nsenter to enter the mount namespace of PID 1 (init) on the host, and run the journalctl command to stream logs from the coredns:
yum install -y util-linux-core
nsenter -t 1 -m journalctl -f -u coredns

An example output is as follows.

Oct 02 10:11:39 ip-10-0-17-209.ec2.internal coredns[1644]: [INFO] 10.0.16.19:43366 - 43927 "AAAA IN amazon.com. udp 39 false 1232" NOERROR qr,rd,ra 125 0.000687584s
Oct 02 10:11:41 ip-10-0-17-209.ec2.internal coredns[1644]: [INFO] 10.0.16.19:56306 - 47907 "AAAA IN example.default.svc.cluster.local. udp 74 false 1232" NXDOMAIN qr,aa,rd 144 0.000162575s
Oct 02 10:11:41 ip-10-0-17-209.ec2.internal coredns[1644]: [INFO] 10.0.16.19:56306 - 54143 "A IN example. udp 36 false 1232" SERVFAIL qr,aa,rd,ra 25 0.000310053s

The sample output shows:

  • NOERROR: Query succeeded for amazon.com AAAA query.
  • NXDOMAIN: Service doesn't exist in the default namespace for example.default.svc.cluster.local AAAA query.
  • SERVFAIL: Server failure during resolution for simple hostname example. query.
  1. You can troubleshoot NXDOMAINand SERVFAIL further by verifying if the service exist in your Kubernetes cluster using the commands:
kubectl get svc -A | grep example
kubectl get endpoints -A | grep example

Replace the example with your own service name that is returning the NXDOMAIN and SERVFAIL message in the CoreDNS logs.

  1. Verify Network Policies using the commands
kubectl get networkpolicies -A
kubectl describe networkpolicy -A

Verify that the DNS resolution works in an Auto Mode EC2 managed instance

  1. To verify the DNS (Domain Name System) resolution of the system configuration file your Node's /etc/resolv.conf file, run the following command in the Pod shell:
nsenter -t 1 -m cat /etc/resolv.conf

The following example resolv.conf file shows that DNSSEC validation enabled, EC2 internal domain search enabled, and the node is using systemd-resolved (127.0.0.53) as local DNS resolver for DNS requests:

...
nameserver 127.0.0.53
options edns0 trust-ad
search ec2.internal
  1. To verify that your Node can use the nameserver value to resolve a public domain, run the following commands in the Node debugger shell to install bind-utils to run nslookup command:
yum install -y bind-utils 
nsenter -t 1 -n nslookup amazon.com 127.0.0.53

Example output:

Server:         127.0.0.53
Address:        127.0.0.53#53

Non-authoritative answer:
Name:   amazon.com
Address: 54.239.28.85
Name:   amazon.com
Address: 52.94.236.248
Name:   amazon.com
Address: 205.251.242.103

Related resources

Use a Kubernetes NodeDiagnostic resource to retrieve node logs by using the Node monitoring agent. For more steps, see Retrieve node logs for a managed node using kubectl and S3.

AWS
EXPERT
published 2 months ago244 views