- Newest
- Most votes
- Most comments
We don't supply estimated times for many things and this is one of those. My best advice here is to test and to take the numbers from there.
In the event of a large-scale failure those numbers might not be realistic - particularly as there will be many other customers all trying to do the same thing. So definitely plan on the worst-case.
If it were me, I would try and avoid doing any changes in the event of a failure - because you don't know what has failed and therefore what will work and what won't. And it's next to impossible to test for that. Instead, build a system which is active in two (or three) AZs all the time. I appreciate that this is a non-trivial thing to ask and to design for but if you have an application that must be up all of the time then this is a good strategy.
Relevant content
- Accepted Answerasked 10 months ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated 7 months ago
- AWS OFFICIALUpdated 2 years ago
Thanks for your quick reply!
This exact pattern is recommended in ""Limitations and Extensibility" section of "Floating IP pattern for HA between active–standby stateful servers" page (https://docs.aws.amazon.com/whitepapers/latest/real-time-communication-on-aws/floating-ip-pattern-for-ha-between-activestandby-stateful-servers.html).
To your point I have seen other AWS blog posts on DR warning against Control Plane changes during a failover. Is there a guide for DR that covers core AWS services and do's & dont's of AZ failure resilient DR?
Thanks again for your response and your insight.
Reflecting on your 3rd paragraph more, are you saying that the only actual reliable AZ failure resilient DR strategy is active/active? Or perhaps that as we learn more and experience actual AZ outages, what we thought worked previously (or perhaps because everyone is implementing what worked and was recommended previously and it's causing more problems during an outage), that implementing anything less than active/active that actually works is trickier than was previously thought and might not be worth it unless you have an intimate understanding of those issues?
There's a famous quote from Werner Vogels (Amazon CTO): "Everything fails all the time". And to add to that: Complex systems (of which an AWS AZ could be counted one) fail in ways that are .... complex. So while some things may continue to work during a failure event, other things will not - and it's not possible to predict which will and which won't. And the next failure may not be like the last. There are many components that go into making up an AZ; and there could be any sort of failure there from large-scale to small-scale. Planning is good! Great even! But in my opinion the best thing that you can do in these events is to be already running in another AZ.