Designing for failures in the control plane


Based on last month's incident in us-east-1, I wonder what we could have done to avoid impact from the incident.

We had a minor impact on running services and some API error rates on a few selected AWS selected services. Still, we were operating almost blind, as no console was available, Cloudwatch wasn't behaving as expected, and observability as a whole was quite restricted.

We currently have a single region setup based exactly on the affected region. Still, I'm not confident that a multi-region setup would have actually helped, as the control plane was misbehaving, thus leading to changes or failover to other regions a bit cumbersome.

What kind of design strategies would have helped in such a scenario?

Incident Ref:

asked 2 years ago1053 views
1 Answer

First of all, you might want to reach out to your account team so the account Solutions Architect could help you making suggestions specific to your workloads.

We believe that for the vast majority of customer workloads, AWS's highly resilient regional design is the best way for customers to achieve their availability goals. Operating across multiple Availability Zones within a Region is a best practice, and allows customers to achieve very high availability. This is how has run and still runs today.

For some customer workloads, taking advantage of AWS's unique, highly isolated Region architecture can allow customers to meet the highest availability goals. Determining the right design requires understanding an application's business criticality, dependencies, workload volumes, and the nature of the work it performs. We have customers with workloads that operate with multi-Region resiliency on AWS and these customers have achieved extremely high availability. But this approach does require customers to make deeper investments to build their applications correctly and regularly test their failover capabilities to ensure it meets their business needs.

We highly recommend engaging in a well-architected review to help you assess your applications resiliency posture.

That said, as mentioned in the RCA, some access to the control plane is being impacted due to network issues. And more importantly, the impact is limited to the impacted region alone.

What that means is, if you are utilizing a multi-region architecture, you will be able to access different control plane endpoints, for example for EC2 you can refer here, which increase the resiliency of your application, especially when it comes down to automation.

Lastly, in a well-architected review and your conversation with your account team, they will be able to help you to understand your workloads' RTO/RPO objectives, which should be the main driver for the decision of multi-AZ / multi-Region designs, balance resiliency, cost and operational overhead.

answered 2 years ago
profile pictureAWS
reviewed 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions