Troubleshooting at scale with AWS

13 minute read
Content level: Advanced
2

This article provides approaches that can help you with your journey to operational excellence and faster troubleshooting.

Introduction

As a senior principal engineer in AWS Support, I often work with customers to troubleshoot distributed systems that have grown over time. Some problems begin with significant ambiguity, but after investigating deeper, I learn more about how to help customers enhance their operational excellence. Transient issues that occur only at scale are particularly hard to troubleshoot. In this article, I will share key learnings from troubleshooting at scale with AWS.

Detect and classify issues

It’s hard to notice a problem until it’s detected, so consistent monitoring is necessary in production systems. It’s critical to know about problems before your end customers are impacted and report the problem. Be sure to set up metric coverage for known failure modes in Amazon CloudWatch or a monitoring tool of your choice. Try to classify common issues that have a clear signal and, if possible, automate their resolution. That way, you can prioritize your focus on issues that require complex decision-making. For example, Amazon DevOps Guru uses machine learning to analyze operational data and application metrics and identify behaviors that deviate from normal operating patterns. When DevOps Guru detects an operational issue or risk, it notifies users so that they can initiate a manual or automated action. AWS Health sends you notifications about AWS service health or scheduled changes so that you can prepare for events and quickly troubleshoot issues that might impact your workloads. In the case of troubleshooting persistent transient network drops, DevOps Guru provides anomaly detection through metrics analysis, and AWS Health helps you exclude an AWS detected event. While these services can’t provide a root cause, they provide the required information for your next step towards setting up detailed network monitoring.

You can use chaos engineering methodologies to discover edge-case failure modes that aren’t known. AWS Fault Injection Service (FIS) allows you to inject failures to test your applications. It’s a best practice to run game days on your applications to simulate real-world event conditions. This exercise can help you make sure that detection is in place and your system is ready with a response. Effective canaries that use realistic traffic to test our systems can also discover issues before they occur. You can use Amazon CloudWatch Synthetics to create canaries as configurable scripts that run on a schedule to monitor endpoints and APIs.

It’s important to log failure signals and debug information. For example, AWS X-Ray can analyze the behavior of distributed applications by providing request tracing, exception collection, and profiling capabilities. You must set up X-Ray before issues start to occur so that you can leverage the distributed tracing capabilities of X-Ray effectively. Therefore, this setup must be part of your environmental setup. Documenting resource identifiers and specific timestamps of the issue are critical. For detailed guidance on logging and instrumentation, see Instrumenting distributed systems for operational visibility.

As explained in Building dashboards for operational visibility, dashboards are the human-facing views into our systems that provide concise summaries of system behavior with time series metrics, logs, traces, and alarms. Setting up useful dashboards and regularly reviewing them (for example, weekly) can help you discover the right alarm thresholds and find missing signals that need work. The key takeaway is that having operations in mind in design is important to make sure that issues can be detected, the root cause can be identified, and then they can be remediated quickly, as covered in the AWS Well-Architected operational excellence pillar.

Understand the problem by asking the right questions

To solve problems quickly, it’s important to narrow down the scope and identify the specific situations where the issue manifests. Asking clarifying questions can aid in dealing with ambiguity when the root cause isn’t clear. I tend to use the initial symptoms to derive the questions that I need to ask. I review the components that make up the services and understand their dependencies. Then, I check AWS re:Post to look for similar issues or solutions. It’s important to be a healthy skeptic, and check whether any changes in the environment caused the issue. Watch out for imprecise answers. For example, you might waste time if you incorrectly assume that a service is down when you get an error rate that’s within a normal range. Use an imprecise answer as a good indicator to look deeper. For example, consider an application that sends network packets over a Container Network Interface (CNI) running on Amazon Elastic Kubernetes Service (Amazon EKS) and reporting network packet timeouts. Ask questions, such as the following:

  • Are only a few packets timing out, or are all packets to this destination timing out?
  • Is this issue occurring only with a specific pod on a worker node, or are all pods running the same type of software?

You might waste time troubleshooting network performance broadly in situations, while the issue is actually just a connectivity problem on a specific path, for example. You can ask the right questions to isolate the issue to the specific components. For example, if all packets are timing out, then checking the network configuration can be a faster path to resolution rather than using performance tools. When the right people who can ask or answer the right questions aren’t engaged, then you must escalate the issue to get the right resources and solve the problem quickly. Building a timeline of events greatly helps with understanding the problem. Documenting the timeline helps others better understand the circumstances. The timeline can help explain what could be an effect rather than a cause. We widely employ such mechanisms in Amazon.

Automate data collection

I was working on an issue related to elevated 5XX HTTP errors when getting objects from multiple Amazon Simple Storage Service (Amazon S3) buckets across accounts. To investigate deeper, I needed to analyze the specific requests that got the HTTP 5XX responses and their Amazon S3 request IDs. The S3 buckets didn’t have the S3 server access logging turned on. An AWS Systems Manager Automation document helped me configure logging for these buckets and new buckets that I created. The main steps in the automation document are as simple as the following code snippet:

"mainSteps": [
  {
    "name": "PutBucketLoggingByUri",
    "isCritical": false,
    "action": "aws:executeAwsApi",
    "onFailure": "step:PutBucketLoggingById",
    "nextStep": "End",
    "inputs": {
      "Service": "s3",
      "Api": "PutBucketLogging",
      "Bucket": "{{BucketName}}",
      "BucketLoggingStatus": {
        "LoggingEnabled": {
          "TargetBucket": "{{TargetBucket}}",
          "TargetPrefix": "{{TargetPrefix}}",
          "TargetGrants": [
            {
              "Grantee": {
                "Type": "{{GranteeType}}",
                "URI": "{{GranteeUri}}"
              },
              "Permission": "{{GrantedPermission}}"

The automation document is illustrated in the following diagram:

Enter image description here

You can also use the AWS Systems Manager automation documents to extract diagnostics from your Amazon Elastic Compute Cloud (Amazon EC2) instances across accounts. For detailed network monitoring, Virtual Private Cloud (VPC) Flow Logs and Elastic Load Balancing (ELB) access logs are useful to isolate the issue. To troubleshoot transient network errors in distributed networks, use the AWSSupport-SetupIPMonitoringFromVPC runbook to automate network diagnostics, such as traceroutes and VPC Flow Logs, at scale.

After you set up logging, use Amazon Athena to easily analyze flow logs through queries. You can do so after you create an Athena table for Amazon VPC Flow Logs.

For example, run the following query to list all rejected TCP connections and use the newly created date partition column date to extract the day of the week when these events occurred:

SELECT day_of_week(date) AS
  date,
  interface_id,
  srcaddr,
  action,
  protocol
FROM vpc_flow_logs
WHERE action = 'REJECT' AND protocol = 6
LIMIT 100;

The output of the query looks like the following. It provides information on where and when the TCP connections were rejected:

Enter image description here

After you identify the interface and location where the packets are dropping, you can mitigate the issue by moving away from unhealthy resources. You can also find the exact message exchanges that were captured in the VPC Flow Logs with the source and destination details along with timestamps. I’ve found that conquering complexity requires simplification to use this approach of data collection and analysis at scale. You can save time by analyzing large datasets with tools that are designed for the job, such as CloudWatch Logs Insights or Athena. To augment the general local filtering tools, such as sed, awk, PowerShell, and Python script, you can use the CloudWatch log metric filters to match terms in log events and convert log data into metrics at scale. In addition to helping with issue diagnosis, Amazon Q can provide you with log files that help you understand and troubleshoot issues with your customization. At scale, AI capabilities, such as CloudWatch log anomaly detector, make it easier to scan log events within the log group and find anomalies in the log data. Anomaly detection uses machine learning and pattern recognition to establish baselines of typical log content. For additional best practices for observability, see Instrumenting distributed systems for operational visibility.

To figure out the cause of an issue, you might require data, such as logs, and traces, that might not be readily available. Optimizing the process of collecting the right data securely and accurately can save you time. We use tools, such as AWS Systems Manager or the CloudWatch agent, to collect and view debug logs, trace issues end to end with X-Ray, or review OS kernel crash dumps and system calls. You can reduce the number of steps in the diagnostic collection or remediation to resolve the issue faster. For some examples, see the AWS Support Automation Workflows (SAW) Systems Manager documents. These automated runbooks make it easy for you to collect diagnostic data and automate resolution. For example, the EC2Rescue SAW document can help you diagnose and troubleshoot issues in your EC2 instances through Systems Manager automation and the AWSSupport-ExecuteEC2Rescue runbook.

Profiling is another useful technique that you can use to troubleshoot performance-related problems. Operating systems tools, such as perf, provide several tracing and performance monitoring capabilities that can be visualized into heatmaps. These heatmaps can help you find that needle in the haystack that causes performance issues. For applications, Amazon CodeGuru Profiler collects runtime performance data from live applications and provides recommendations that can help fine-tune application performance.

Test and replicate at scale

If an issue occurs randomly or only with production scale traffic, I use Systems Manager to try to replicate the issue across a test set of resources. To help prevent production impact, you can simulate real-world conditions through Systems Manager Automation and Run Command documents, or load testing software through Distributed Load Testing on AWS.

Troubleshooting operating system-critical errors or kernel panics that occur randomly across a fleet of EC2 instances can be challenging. When I troubleshoot this type of an issue, I need to avoid downtime on production instances. With the theory that stress conditions might cause the issue to manifest more often, I opt to scale out the test stage instances and configure kernel crash dumps across the fleet through Systems Manager Automation and Run Command. Introducing stress with load testing helps to reproduce the issue and capture a kernel crash dump. This helps me narrow down the issue to a specific pattern of IO traffic for a specific operating system kernel and driver combination. The blktrace utility is useful in analyzing and replicating production IO traffic by using btreplay on the test system. You can run both of these utilities across EC2 instances with Systems Manager. Systems Manager documents can run in each EC2 instance to reproduce issues in a test fleet, as shown in the following image:

Enter image description here

Validate the theory of probable cause

To find the root causes of the issues faster, we document troubleshooting methods for common failure modes in playbooks or troubleshooting guides. Also, automation simplifies investigations. Machine learning can accelerate automation for troubleshooting. For more information, see Five troubleshooting examples with Amazon Q. However, when the root cause of the issue isn’t clear, it’s important to seek diverse perspectives. We can isolate specific components of the system, and analyze or test them to see how they might cause parts of the overall issue, which might help surface the problem sooner. When the issue is particularly ambiguous, it’s useful to experiment with theories and test out quick solutions. If the experiment fails, then iterate at low cost to validate the correctness of a theory of probable cause.

Verify the resolution

After you identify and review a fix, the next step is to test the fix in a sufficiently scaled system. Fixing one component can sometimes reveal new issues further down the path. Therefore, be sure to perform tests that validate the safety of the fix. It’s best to deploy the fix in stages and have your rollback procedures ready. For information on approaches for safe continuous deployments, see Automating safe, hands-off deployments.

Learn from past issues to avoid them in the future

You must design systems to be resilient and fault-tolerant. However, faults can eventually occur at scale in time, despite our best design efforts. Therefore, you must design your systems for that eventuality, with operations in mind, to solve issues quickly and learn from them. Any lack of operational readiness, including gaps in troubleshooting approaches, can lead to extended impact for end customers. At Amazon, we learn from each issue through the Correction of Error (CoE) mechanism. We use the CoE process to improve quality by documenting all aspects of the issue in a detailed retrospective, resulting in avoidance of past problems and addressing issues at scale. Failures are great opportunities for learning. Not only can they be used to drive correct actions in post-mortems to improve resilience, but they can also make us so much better at troubleshooting.

By understanding the needs of workloads, predefining runbooks for routine activities and playbooks for guiding issue resolution, using the operations as code features in AWS, and maintaining situational awareness, operations can be better prepared and can respond effectively when incidents occur. AWS learns from common issues that you encounter, and shares answers to these common questions in our troubleshooting resources and re:Post. Also, AWS updates best practices in AWS Well-Architected and AWS Trusted Advisor.

Conclusion

These approaches can help you with your journey to operational excellence and faster troubleshooting. To learn more about how our plans and offerings can help you get the most out of your AWS environment, see AWS Support.


About the author

Enter image description here

Tipu Qureshi

Tipu Qureshi is a Senior Principal Engineer with AWS Support. He works with customers on moving mission critical workloads into AWS. He drives operational excellence across AWS and customer teams based on learnings from his many years of troubleshooting.