Following best practices in designing resilient applications - Part 2

10 minute read
Content level: Advanced
1

This article is the second part of a series on resilience best practices and key design principles that can minimize business disruptions during outages.

Introduction

In part 1 of the resilience series, I covered the importance of designing resilient applications in the cloud. As a Senior Technical Account Manager (TAM) with AWS Support, I see customers design workloads without considering resilience. This can lead to business disruptions during outages.

I also explained the importance of resilience, and used examples from the automotive industry to show how resilience can affect critical operations and reputation. I outlined key strategies that included:

  • Know your workloads.

  • Use the AWS Well-Architected Framework to build resilient systems.

  • Strive for operational excellence.

I emphasized the importance of setting resilience goals, such as Recovery Time Objectives (RTOs) and Recovery Point Objectives (RPOs). I also introduced the first three of six design principles for building resilient applications:

  1. Design for failure, and nothing fails.

  2. Build security in every layer.

  3. Use many storage options.

These principles form the foundation to create robust, fault-tolerant applications in AWS. In part 2, I cover practical resilience strategies and best practices. In the following image, you can see how leveraging many options can help you build resilient applications.

Enter image description here

Principle 4: Reduce dependency on control planes

Control plane

The control plane is the centralized management layer that oversees AWS resources and promotes efficient service delivery. The control plane can streamline resource provisioning, monitor resources, or manage the cloud infrastructure. Because it's more complex and prone to delays or failures during high demand events, it's crucial to reduce control plane dependencies. When you minimize reliance on the control plane, especially for day-to-day operations, your applications can function more smoothly, even during control plane impairments.

For example, when you launch an Amazon Elastic Compute Cloud (Amazon EC2) instance, the control plane finds capacity and configures the instance. However, if your system relies too heavily on automated operations controlled by the AWS control plane, then it becomes vulnerable to interruptions. This vulnerability stems from delays or disruptions that the control plane can experience during rare Regional events or service issues. When you configure your system to operate independently, you enhance the resiliency of your system, even if control plane actions are impaired.

The following example shows the actions that the control plane takes when an EC2 instance is launched:

  1. Find a physical server. Make sure to consider placement group and virtual private cloud (VPC) tenancy requirements.

  2. Allocate a network interface out of the VPC subnet.

  3. Prepare an Amazon Elastic Block Store (Amazon EBS) volume.

  4. Generate AWS Identity and Access Management (IAM) role credentials.

  5. Install security group rules.

  6. Store the results in the data stores of the various downstream services.

  7. Propagates the needed configurations to the server in the VPC.

Data plane

The data plane processes and transmits data within AWS services. It facilitates the seamless movement of information, and provides reliable and efficient data processing across various AWS resources and services.

For example, the data plane performs the following tasks, all while maintaining existing EC2 instances:

  • Routes packets according to the VPC's route tables.

  • Reads and writes from Amazon EBS volumes.

When you consider AWS services and their availability and uptime, you can distinguish what's best handled by the control plane and the data plane. For example, consider Amazon EC2. The day-to-day operations of an EC2 instance, such as serving web traffic, are handled by the data plane. Less frequent actions that involve API calls to the Amazon EC2 service, such as launching or resizing instances, are handled by the control plane. The tasks of the control plane and data plane are divided to handle different complexities. The control plane is responsible for intricate tasks, such as finding capacity and configuring instances. These intricate tasks can make the control plane less reliable than the data plane. The control plane serves administrative API functions for CRUDL operations and is essential for resource provisioning. After resources are provisioned, they operate independently of the control plane. These resources contribute to a more statically stable system with reduced dependence on the intricacies of the control plane.

The Amazon EC2 data plane grants the physical machine that hosts an EC2 instance local access to all necessary routing information, both inside and outside its VPC. If an Amazon EC2 control plane is impaired, then the existing traffic flow isn't affected.

Note: The physical server might not receive updates during the impairment. These updates might include the addition of a new EC2 instance to a VPC, or a new security group rule.

Principle 5: Design for static stability

Static stability refers to a system's ability to maintain and sustain its normal operations without requiring modifications when faced with failures or when a dependent component isn't available. This characteristic makes sure that the system can continue to function reliably without the immediate need for adjustments. To enhance resilience and predictability, static stability minimizes the need for real-time modifications during adverse events and promotes dependable overall system performance.

For example, consider aircraft in flight. The static stability of aircraft determines how it responds to small changes in wind, turbulence, or other external forces. Highly stable aircraft automatically corrects itself and returns to its intended flight path. Less stable aircraft might diverge from its course and become more difficult to control.

When you build resilient systems, you must proactively address potential challenges and disruptions. You can create a stable operational environment when you separate the control planes and data planes, establish continuous data plane operation, and configure autonomous handling of failures.

The following guidelines are necessary to anticipate, manage, and mitigate impairments within a system.

Proactive preparation

  • Anticipate impairments: Conduct thorough risk assessments and identify potential vulnerabilities within the system.

  • Shift to proactive strategies: Move away from a reactive approach and implement preemptive measures based on identified risks. This makes sure that you're ready for potential impairments before they occur.

Separation of control and data planes

  • Clear architectural boundary: Design systems with a distinct separation between the control and data planes.

  • Preserve the data plane state: This separation allows the data plane to maintain its ongoing operations and existing state, even if the control plane is impaired.

Continuous data plane operation

  • Uninterrupted functionality: Despite challenges in the control plane, the data plane continues to operate without interruptions.

  • Preserve the operational state: The system makes sure that all functionalities and operations persist, and contributes to operational continuity.

Limited data plane updates

  • Halt temporary updates: During periods of control plane impairments, updates to the data plane might be temporarily paused.

  • Unaffected existing operations: All previously established functionalities, configurations, and operational processes remain unaffected.

Autonomous handling of failures

  • Inherent resilience: To handle failures independently, equip infrastructure and instances with self-sufficiency. To do this, implement automated processes and mechanisms that respond to potential issues proactively. For example, use load balancers to evenly balance traffic across resources and mitigate the impact of failures on performance.

  • Auto-scaling: While Auto Scaling groups adjust resources dynamically based on demand, they depend on the control plane and can introduce delays or complexities when scaling. Over-provisioning resources, however, is a more reliable option for resilience and a key aspect of static stability. When you pre-provision additional capacity, the system can handle unexpected spikes in demand or hardware failures without relying on dynamic scaling.

  • Observability: Incorporate systems in your design to monitor and alert you of anomalies, and provide automatic remediation actions. Examples include alerts to restart failed instances or divert traffic to healthy instances.

  • Reduced dependencies: Minimize reliance on services or dependencies so that the system can function autonomously during disruptions.

Local decision making

  • Utilization of local data: Systems make operational decisions based on locally stored data.

  • Enhanced autonomy: Reduce reliance on external communication so that systems can become more self-sufficient and operate and perform essential tasks independently.

Principle 6: Don't fear constraints

This principle encourages a strategic re-evaluation of traditional architectural limitations, and can foster adaptability and resilience.

Rethink traditional architectural constraints

Traditionally, when faced with the need for more computing power, the immediate solution is to upgrade to a larger EC2 instance. However, if you rethink this traditional constraint, then you can explore alternative solutions that offer greater flexibility and cost-effectiveness.

For example, you can implement a microservices architecture. This approach involves breaking down the application into smaller, independent services that can be deployed and scaled individually. With microservices, you can distribute the workload across multiple smaller instances rather than relying on a single large instance. This provides greater scalability and enhances fault isolation, as failures in one service do not impact the entire application.

Also, you can use serverless computing options such as AWS Lambda. With serverless architecture, you don't need to manage infrastructure provisioning and scaling manually. Instead, the cloud provider automatically handles the scaling of resources based on demand. That way, you can focus on writing code and developing features, rather than worrying about infrastructure management.

Enhance database performance

To increase resilience, this approach enhances the database's ability to withstand failures and maintain operational stability during challenging conditions.

You can use multiple read replicas, sharding, or database clustering to implement redundancy and fault tolerance across the database infrastructure. If there's a failure or outage in one replica or shard, then the system can seamlessly switch to an alternate. This ability to switch minimizes downtime and supports continuous availability of data.

You can also use strategies like Provisioned IOPS (PIOPS), SSD-backed instance storage, and database caching with Amazon ElastiCache to improve performance and scalability. When you efficiently handle increased workloads and delivering faster response times, the database can better withstand sudden spikes in traffic or demand without sacrificing stability or reliability.

Enhancements in database efficiency, such as improved IOPS and storage performance, can help maintain data integrity and durability. With faster and more reliable storage solutions, the risk of data corruption or loss during system failures or hardware issues is reduced. This promotes the resilience of the overall system.

Customized scaling

It's important to acknowledge the uniqueness of your system requirements. You can scale the instance size and type as needed with minimal or no downtime. This tailored approach enhances system resilience and adapts to varying workloads without sacrificing stability.

When you embrace constraints and adopt these strategies within the framework of static stability, you fortify the system against challenges.

Conclusion

This article series demonstrates that a resilient AWS application prioritizes security, optimizes storage, reduces dependencies on control planes, maintains static stability, and tackles constraints. This holistic approach makes sure that applications can withstand challenges and thrive in dynamic and unpredictable environments. AWS Support engineers and TAMs can help you with general guidance, best practices, troubleshooting, and operational support on AWS. To learn more about plans and offerings, see AWS Support.

About the author

Enter image description here

Rav Bommakanti is a Senior TAM with AWS Energy. He's passionate about solving complex customer problems. With more than 16 years of experience in IT across various domains and technologies, he brings vast expertise in developing resilient, cost-effective, and innovative solutions. In his free time, he enjoys traveling and photography.