How can I retrieve Amazon EKS control plane logs from CloudWatch Logs?

9 minute read
0

I want to troubleshoot an Amazon Elastic Kubernetes Service (Amazon EKS) issue. I need to collect logs from the components that run on the EKS control plane.

Short description

To view the logs in Amazon CloudWatch Logs, you must turn on Amazon EKS control plane logging. For more information, see Viewing cluster control plane logs.

You can use CloudWatch Logs Insights to search through the EKS control plane log data. For more information, see Analyzing log data with CloudWatch Insights.

Important: You can view log events in CloudWatch Logs only after you turn on control plane logging in a cluster. Before you select a time range to run queries in CloudWatch Logs Insights, verify that you turned on control plane logging.

Resolution

Search CloudWatch Insights

Complete the following steps:

  1. Open the CloudWatch console.
  2. In the navigation pane, choose Logs, and then choose Log Insights.
  3. In the Select log group(s) menu, select the cluster log group that you want to query.
  4. Choose Run to view the results.

Note: To export the results as a .csv file or to copy the results to the clipboard, choose Export results. You can change the sample query to get data for a specific use case. See these example queries for common EKS use cases.

Sample queries for common EKS use cases

To find mutating changes made to the aws-auth ConfigMap, run a query similar to the following example:

fields @logStream, @timestamp, @message| filter @logStream like /^kube-apiserver-audit/
| filter requestURI like /\/api\/v1\/namespaces\/kube-system\/configmaps/
| filter objectRef.name = "aws-auth"
| filter verb like /(create|delete|patch)/
| sort @timestamp desc
| limit 50

Example output:

@logStream,@timestamp,@messagekube-apiserver-audit-f01c77ed8078a670a2eb63af6f127163,2021-10-27 05:43:01.850,{""kind"":""Event"",""apiVersion"":""audit.k8s.io/v1"",""level"":""RequestResponse"",""auditID"":""8f9a5a16-f115-4bb8-912f-ee2b1d737ff1"",""stage"":""ResponseComplete"",""requestURI"":""/api/v1/namespaces/kube-system/configmaps/aws-auth?timeout=19s"",""verb"":""patch"",""responseStatus"": {""metadata"": {},""code"": 200 },""requestObject"": {""data"": { contents of aws-auth ConfigMap } },""requestReceivedTimestamp"":""2021-10-27T05:43:01.033516Z"",""stageTimestamp"":""2021-10-27T05:43:01.042364Z"" }

To find requests that were denied, run a query for messages that contain denied similar to the following example:

fields @logStream, @timestamp, @message| filter @logStream like /^authenticator/
| filter @message like "denied"
| sort @timestamp desc
| limit 50

Example output:

@logStream,@timestamp,@messageauthenticator-8c0c570ea5676c62c44d98da6189a02b,2021-08-08 20:04:46.282,"time=""2021-08-08T20:04:44Z"" level=warning msg=""access denied"" client=""127.0.0.1:52856"" error=""sts getCallerIdentity failed: error from AWS (expected 200, got 403)"" method=POST path=/authenticate"

To find the node that a pod was scheduled on, run a query similar to the following example:

fields  @timestamp, @message
| filter @logStream like /kube-scheduler/
| filter @message like "<Pod Name>"
| filter @message like "ip-"
| sort @timestamp asc
| limit 3

Example output:

@timestamp,@messagekube-scheduler-bb3ea89d63fd2b9735ba06b144377db6,2021-08-15 12:19:43.000,"I0915 12:19:43.933124       1 scheduler.go:604] ""Successfully bound pod to node"" pod=""kube-system/aws-6799fc88d8-jqc2r"" node=""ip-192-168-66-187.eu-west-1.compute.internal"" evaluatedNodes=3 feasibleNodes=2"

To find HTTP 5xx server errors for Kubernetes API server requests, run a query similar to the following example:

fields @logStream, @timestamp, responseStatus.code, @message| filter @logStream like /^kube-apiserver-audit/
| filter responseStatus.code >= 500
| limit 50

Example output:

@logStream,@timestamp,responseStatus.code,@messagekube-apiserver-audit-4d5145b53c40d10c276ad08fa36d1f11,2021-08-04 07:22:06.518,503,"...""requestURI"":""/apis/metrics.k8s.io/v1beta1?timeout=32s"",""verb"":""get"",""user"":{""username"":""system:serviceaccount:kube-system:resourcequota-controller"",""uid"":""36d9c3dd-f1fd-4cae-9266-900d64d6a754"",""groups"":[""system:serviceaccounts"",""system:serviceaccounts:kube-system"",""system:authenticated""]},""sourceIPs"":[""12.34.56.78""],""userAgent"":""kube-controller-manager/v1.21.2 (linux/amd64) kubernetes/d2965f0/system:serviceaccount:kube-system:resourcequota-controller"",""responseStatus"":{""metadata"":{},""code"":503},..."}}"

To troubleshoot a CronJob object activation, run a query for API calls that the cronjob-controller made:

fields @logStream, @timestamp, @message| filter @logStream like /kube-apiserver-audit/
| filter user.username like "system:serviceaccount:kube-system:cronjob-controller"
| display @logStream, @timestamp, @message, objectRef.namespace, objectRef.name
| sort @timestamp desc
| limit 50

Example output:

{ "kind": "Event", "apiVersion": "audit.k8s.io/v1", "objectRef": { "resource": "cronjobs", "namespace": "default", "name": "hello", "apiGroup": "batch", "apiVersion": "v1" }, "responseObject": { "kind": "CronJob", "apiVersion": "batch/v1", "spec": { "schedule": "*/1 * * * *" }, "status": { "lastScheduleTime": "2021-08-09T07:19:00Z" } } }

To find API calls that the replicaset-controller made, run a query similar to the following:

fields @logStream, @timestamp, @message| filter @logStream like /kube-apiserver-audit/
| filter user.username like "system:serviceaccount:kube-system:replicaset-controller"
| display @logStream, @timestamp, requestURI, verb, user.username
| sort @timestamp desc
| limit 50

Example output:

@logStream,@timestamp,requestURI,verb,user.usernamekube-apiserver-audit-8c0c570ea5676c62c44d98da6189a02b,2021-08-10 17:13:53.281,/api/v1/namespaces/kube-system/pods,create,system:serviceaccount:kube-system:replicaset-controller
kube-apiserver-audit-4d5145b53c40d10c276ad08fa36d1f11,2021-08-04 0718:44.561,/apis/apps/v1/namespaces/kube-system/replicasets/coredns-6496b6c8b9/status,update,system:serviceaccount:kube-system:replicaset-controller

To retrieve a count of HTTP response codes for calls made to the Kubernetes API server, run a query similar to the following example:

fields @logStream, @timestamp, @message| filter @logStream like /^kube-apiserver-audit/
| stats count(*) as count by responseStatus.code
| sort count desc

Example output:

responseStatus.code,count200,35066
201,525
403,125
404,116
101,2

To find changes that are made to DaemonSets/Addons in the kube-system namespace, run a query similar to the following example:

filter @logStream like /^kube-apiserver-audit/| fields @logStream, @timestamp, @message
| filter verb like /(create|update|delete)/ and strcontains(requestURI,"/apis/apps/v1/namespaces/kube-system/daemonsets")
| sort @timestamp desc
| limit 50

Example output:

{ "kind": "Event", "apiVersion": "audit.k8s.io/v1", "level": "RequestResponse", "auditID": "93e24148-0aa6-4166-8086-a689b0031612", "stage": "ResponseComplete", "requestURI": "/apis/apps/v1/namespaces/kube-system/daemonsets/aws-node?fieldManager=kubectl-set", "verb": "patch", "user": { "username": "kubernetes-admin", "groups": [ "system:masters", "system:authenticated" ] }, "userAgent": "kubectl/v1.22.2 (darwin/amd64) kubernetes/8b5a191", "objectRef": { "resource": "daemonsets", "namespace": "kube-system", "name": "aws-node", "apiGroup": "apps", "apiVersion": "v1" }, "requestObject": { "REDACTED": "REDACTED" }, "requestReceivedTimestamp": "2021-08-09T08:07:21.868376Z", "stageTimestamp": "2021-08-09T08:07:21.883489Z", "annotations": { "authorization.k8s.io/decision": "allow", "authorization.k8s.io/reason": "" } }

Note: In the preceding example output, the kubernetes-admin user used kubectl v1.22.2 to patch the aws-node DaemonSet.

To find all patch, update, create, and delete calls related to a specific deployment and deployment pods, run a query similar to the following example:

fields @timestamp,verb, objectRef.name, objectRef.resource, requestObject.message
| filter objectRef.name like /<Deployment Name>/
| filter objectRef.resource not like /serviceaccounts/
# Uncomment the below line if you don't want the events  
# | filter objectRef.resource not like /events/ 
| filter verb like /create|delete|patch|update/
| sort @timestamp asc

To find the user that deleted a node, run a query similar to the following example:

fields @logStream, @timestamp, @message| filter @logStream like /^kube-apiserver-audit/
| filter verb == "delete" and requestURI like "/api/v1/nodes"
| sort @timestamp desc
| limit 10

Example output:

@logStream,@timestamp,@messagekube-apiserver-audit-e503271cd443efdbd2050ae8ca0794eb,2022-03-25 07:26:55.661,"{"kind":"Event","verb":"delete","user":{"username":"kubernetes-admin","groups":["system:masters","system:authenticated"],"arn":["arn:aws:iam::1234567890:user/awscli"],"canonicalArn":["arn:aws:iam::1234567890:user/awscli"],"sessionName":[""]}},"sourceIPs":["1.2.3.4"],"userAgent":"kubectl/v1.21.5 (darwin/amd64) kubernetes/c285e78","objectRef":{"resource":"nodes","name":"ip-192-168-37-22.eu-west-1.compute.internal","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Success","code":200},"requestObject":{"kind":"DeleteOptions","apiVersion":"v1","propagationPolicy":"Background"},"responseObject":{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Success","details":{"name":"ip-192-168-37-22.eu-west-1.compute.internal","kind":"nodes","uid":"518ba070-154e-4400-883a-77a44a075bd0"}},"requestReceivedTimestamp":"2022-03-25T07:26:55.355378Z",}}"

To find the user that deleted a resource, such as a configmap, pod, or deployment, run a query similar to the following example:

fields @timestamp,verb, user.username, user.extra.arn.0, user.extra.canonicalArn.0 
| filter  objectRef.name like /aws-auth/
# If you want to find delete call for a pod, replace aws-auth with the pod name
| filter verb like /delete/
| sort @timestamp asc

To find the image version of a deployment, run a query similar to the following example:

fields @timestamp, verb, objectRef.name,  objectRef.resource
| filter objectRef.name like /<deployment name>/
| filter @message like /image/
| filter objectRef.resource  like /deployments/
| parse requestObject.spec.template.spec 'image":*,' as image #this may vary in your case
| sort @timestamp asc
| limit 10000

To get the events for a specific node, run a query similar to one of the following examples.

Example 1:

fields @timestamp, @message, @logStream
| sort @timestamp asc
| filter @message like "node <node name> hasn't been updated for"

Example 2:

fields @timestamp
| parse responseObject.status.conditions.0 "lastTransitionTime*" as MemoryPressure
| parse responseObject.status.conditions.1 "lastTransitionTime*" as DiskPressure
| parse responseObject.status.conditions.2 "lastTransitionTime*" as PIDPressure
| parse responseObject.status.conditions.3 "lastTransitionTime*" as ReadyStatus
| parse responseObject.status.conditions.3 "lastTransitionTime*" as Timepass
| filter objectRef.name like /<Node name>/
| filter verb like /patch/
| filter @message like /lastTransitionTime/
| sort @timestamp asc

To find the user that cordoned specific nodes or made the nodes unable to be scheduled, run a query similar to the following example:

fields @timestamp, objectRef.name as node_name, verb,user.username, user.extra.sessionName.0 as name, requestObject.spec.unschedulable as unschedulable_flag
| filter @logStream like /kube-apiserver-audit/
| filter @message like /<NODE IP>/
| filter verb like /patch/
| filter requestObject.spec.unschedulable like /1/

To find the podIP of a deleted pod, run a query similar to the following example:

fields @timestamp,objectRef.name as pod, requestObject.status.podIP as podIP
| filter @logStream like /kube-apiserver-audit/
| filter objectRef.name = "<pod name>"
| filter verb like /patch/
| filter ispresent(requestObject.status.podIP)
| sort @timestamp asc

To find the describe output of an object when a pod is deleted and you don't have the name of the previous pod, run a query similar to the following example:

fields @timestamp, requestURI, requestObject.message
| filter requestURI like '/api/v1/namespaces/<namespace name>/events' 
| filter  responseObject.involvedObject.name like /<Object Name>/
| sort @timestamp asc

To find the node that a pod is scheduled on, run a query similar to the following example:

fields  @timestamp, @message
| filter @logStream like /kube-scheduler/
| filter @message like "<Pod Name>"
| filter @message like "ip-"
| sort @timestamp asc
| limit 3

To check whether an eviction API appears in your audit logs, run a query similar to the following example:

filter @logStream like /kube-apiserver-audit/
| fields @timestamp, user.username,user.extra.canonicalArn.0, responseStatus.code, responseObject.status, responseStatus.message
| sort @timestamp asc
| filter verb == "create" and objectRef.subresource == 'eviction'

If AWS Fargate OS patching deleted your pods or nodes, then the eviction API appears in the audit logs. To view the audit logs and then find this event, run a query similar to the following example:

fields @logStream, @timestamp, @message 
| sort @timestamp asc 
| filter user.username == "eks:node-manager" and requestURI like "eviction" and requestURI like "pod"

To find the Task ID of the Fargate pod, run a query similar to the following example:

fields @timestamp, verb, responseObject.spec.providerID as InstanceID
| filter @message like /<Fargate Node IP>/
| filter ispresent(responseObject.spec.providerID)

To find the URI that received more 4xx or 5xx errors, run a query similar to the following example:

fields requestURI 
| filter @logStream like "kube-apiserver-audit-i" 
| filter count > 12 
| stats count(*) as count by requestURI, responseStatus.code 
| filter responseStatus.code > 400
| sort count desc

To see whether there are any issues with webhooks, run a query similar to the following example:

fields @timestamp, @message
| filter @logStream like /kube-apiserver/ and @logStream not like /kube-apiserver-audit/
| filter @message like /failed calling webhook/
| sort @timestamp desc
| stats count(*) by bin(1m)

To list the API Server health checks that failed, run a query similar to the following example:

fields @message
| sort @timestamp asc
| filter @logStream like "kube-apiserver"
| filter @logStream not like "kube-apiserver-audit"
| filter @message like "healthz check failed"

To count the requests by k8s objects and useAgent CW Log Insights, run a query similar to the following example:

fields @timestamp, @message, @logStream
| filter @logStream like "kube-apiserver-audit" 
| display @logStream, requestURI, verb 
| stats count(*) as count by objectRef.resource, userAgent
| sort count desc
| display objectRef.resource, userAgent, count

To view the most frequent logs, run a query similar to the following example:

fields @timestamp, @message, @logStream
| filter @logStream not like /kube-apiserver-audit/
| parse @message "*] *" as loggingTimeStamp, loggingMessage
| stats count(*) as count by loggingMessage 
| sort count desc
AWS OFFICIAL
AWS OFFICIALUpdated 6 days ago
No comments