Proactive Monitoring and Alerting for EKS AWS API Calls Using CloudWatch

7 minute read
Content level: Intermediate
1

This article guides users through setting up an alerting system with CloudTrail, Athena, and CloudWatch to monitor and manage EKS API calls. The goal is to ensure smooth and reliable operations in EKS environments by anticipating potential API throttling issues.

Using Amazon EKS as a managed Kubernetes service to orchestrate your containers on the cloud benefits customers. Aside from AWS taking care of the undifferentiated heavy lifting of managing the control plane, you can easily integrate with AWS's existing services like EFS, S3, ALB or RDS. The integration is by giving the EKS permission to call the AWS API. The Kubernetes application can be chatty regarding API calls. Since a throttling mechanism protects AWS APIs, when a certain quota is reached, the calls will be rejected, thus causing the Kubernetes event to fail. Such scenarios are rare, but this situation might arise when a customer plans an LSE (Large Scale Event) like sales or release of a new product.

In this article, I will show you how to set up an alerting system in your cloud environment using CloudTrail, Athena and CloudWatch. Using this setup, customers can easily monitor and react to situations where they are close to reaching the service quota limits. Since Kubernetes makes a lot of calls to the AWS API, your environment might have other services making calls to the AWS API. You will need to know the calls from EKS. In this step, you will capture the CloudTrail events in your account, send them to a datastore, in this case, an S3 bucket connected to Athena, and then use SQL queries from the Athena console to filter out the API calls made by EKS. The steps to achieve this are enumerated below:

CloudTrail Events Storage and Query:

  1. In the CloudTrail console, choose Event History from the sidebar menu; it will take you to the event. On this page, select Create Athena table.

Enter image description here

  1. In the modal which comes up, select the storage location where you want the events to be stored. You can choose from an existing S3 bucket. If there is no existing bucket, then create a new Trail, which will allow you to create a new S3 bucket that you can select when creating an Athena table (check documentation for steps on this here).
  2. After table creation, visit the Athena console page console.aws.amazon.com/athena and click on the Query Editor on the menu bar left of the console. Run the SQL query below and replace the TableName with the one in your console. The name is under the tables section of the Query Editor.
SELECT REVERSE(
        SUBSTRING(
            REVERSE(
                useridentity.sessioncontext.sessionissuer.username
            ),
            14
        )
    ),
    eventsource,
    eventname,
    eventtype,
    awsregion,
    COUNT(eventname) AS eventname_count
FROM cloudtrail_logs_aws_cloudtrail_logs_111111111111_7a32ei43
WHERE eventtype = 'AwsApiCall'
    AND awsregion = 'us-east-1'
    AND (
        useridentity.sessioncontext.sessionissuer.username LIKE 'eksctl-playground-cluster-nodegrou-NodeInstanceRole-%'
        OR useridentity.sessioncontext.sessionissuer.username LIKE 'eksctl-playground-cluster-cluster-ServiceRole-%'
        OR useridentity.sessioncontext.sessionissuer.username LIKE 'eksctl-playground-cluster-addon-vpc-cni-Role1%'
    )
GROUP BY REVERSE(
        SUBSTRING(
            REVERSE(
                useridentity.sessioncontext.sessionissuer.username
            ),
            14
        )
    ),
    eventsource,
    eventname,
    eventtype,
    awsregion
ORDER BY eventname_count DESC
  1. The result of the above query should produce something similar to the image below with the list of all AWS API calls made by the EKS service in a particular region sorted by AWS APIs called the most. A few things to note about the query above since Amazon EKS uses IAM roles that make calls to the AWS API on your behalf, the query will retrieve API calls made by these roles. These roles include, but not limited to, those used by IRSA, EKS Pod Identities, and EKS Add-ons. Visit the IAM console at console.aws.amazon.com/iam/ and search the list of roles, for example: NodeInstanceRole, EKSClusterRole, and vpc-cni-Role. Depending on the cluster creation method, this might have different names, especially when using IaC tools like eksctl. The SQL queries also include these roles to capture all API calls.

Enter image description here

Following the steps above, I have obtained some AWS API calls (you can decide to focus on the APIs with a high call count) to monitor and set alerts. Using CloudWatch, set custom metrics using the API calls of interest and put an alerting threshold on the CloudWatch graph.

Setting up CloudWatch Graph and Alerting:

  1. Visit the CloudWatch console at console.aws.amazon.com/cloudwatch/, and choose Metric/All metrics on the menu bar to the left. Choose Usage and then select By AWS Resource. In the search box of the list of metrics, search for the AWS API call of interest.

Enter image description here

  1. Since I have sorted the list according to API call count, you can decide to pick the Top 10 and concentrate on graphing those in CW. For brevity sake, I will focus on DescribeAutoScalingGroups , DescribeInstances, and DescribeSecurityGroups. Next, select the APIs by eventname and create a graph for the CW console.
  2. The result should be similar to the image below after graphing these API calls. You can also create a custom dashboard for the graphs. For guidance, check the documentation at this link at create CloudWatch Dashboard.

Enter image description here

  1. Finally, to set the alarm for each API call shown above, click the Bell symbol beside each API under the Actions header. This action will take you to the page shown in the figure below, where you will set the threshold and data points for the alarms. Setting the threshold would require the throttling limits found on each service's documentation and their respective limits (NB: The limits might differ if you had requested a limit increase). The API throttling limits for the three API calls above are in this guide: https://docs.aws.amazon.com/ec2/latest/devguide/ec2-api-throttling.html. To set the alarm threshold for these API calls, you have to consider that throttling uses the token bucket algorithm (check here: https://docs.aws.amazon.com/ec2/latest/devguide/ec2-api-throttling.html#throttling-how for more details). Cloudwatch only allows to set alarms at a minimum interval of 10 seconds. Using the worst-case scenario and DescribeInstances as an example: With a bucket capacity of 50 and a refill rate of 10, assuming a steady rate of 50 requests per second thus in 10 seconds, based on the workings of the token bucket algorithm, a maximum of 140 requests will be allowed. Simply put, Max allowed request = Bucket max capacity + ((CloudWatch period - 1) in seconds * Bucket refill rate). Using the calculated figure, you can set the alarm threshold at his figure or close depending on preference with a datapoint of 1 out of 3 breaches before the alarm goes to ALARM state. This alarm will help with possible periods of spikes. The figure below shows how to set the alarms. Another option for setting thresholds is to use CloudWatch Anomaly Detection, which will set a range based on historical trends (CloudWatch Anomaly Detection).

Enter image description here

After completing the above procedure, you will have a robust solution that will help mitigate issues relating to AWS API call failures on your EKS cluster and also inform on when to request a quota increase from AWS. Note that some API calls are indirectly associated with AWS Service resource usage. Thus, setting up AWS Service Quota alerts in CW will be beneficial. To do this, follow the guide in the document at Visualizing your service quotas and setting alarms.

By utilising CloudTrail, Athena, and CloudWatch, you can effectively monitor and manage AWS API calls in Amazon EKS. This will help boost the performance and reliability of your cloud infrastructure.