Drivers for improvement


In context of Ops#10 of the Well-Architected Framework tool, please expand on "Identify drivers for improvement to help you evaluate and prioritize opportunities." An example would help-- is 'driver' meant to be an organizational goal, a key executive, an abstract principle?

1 Answer

Define drivers for improvement is not an abstract principle. When you look at the health of your operations, how do you choose what to improve? Drivers broadly fall into three categories: desired capabilities, unacceptable issues, and compliance requirements. Desired capabilities are new features, new processes, and new technologies. Unacceptable issues are things like persistent errors, known bugs in your processes, reducing toil, and technical debt. Compliance requirements can vary by industry and security level and can influence what you improve. Depending on the situation, you may need to focus on one over the other. In that case you'd prioritize your operational improvements based on the one that's most important to you at that time. Think of drivers like a filter to reduce the list of all possible improvements to a workable set that supports your business outcomes at this point in time.

It's also important to understand where this best practice fits into the rest of the best practices in OPS 11, "How do you evolve operations?". We want to understand how a customer approaches continuous improvement, specifically around operations. "Improving operations" is a very broad thing to do, so we use the best practices in the question to build a foundation for making improvements in an informed, predictable way.

We start with Have a process for continuous improvement. It really means that you are making it a priority to improve your operations and you have ceremony around that, and that you commit to doing that periodically, no matter what happens. It could be as simple as having a meeting once a month to look at your operations metrics and picking one to improve.

Next up is Perform post-incident analysis which give us more inputs into the process for continuous improvement. Post-incident analysis gives us a way to measure our runbooks and playbooks and use that data to drive improvements. It also gives us insight into inter-team communication and incident handling, which are yet more areas we can improve.

Implement feedback loops falls into two categories: immediate feedback and retrospective analysis. Building feedback into your operations procedures is another way to capture data to drive improvements. This can be anything from customer or team member feedback to metrics around mean time to discover or resolve. Retrospective analysis means that you run team retrospectives periodically and includes an improvement target to work on until the next retro.

Perform Knowledge Management is also a driver for improvement. Maintaining an organization's written knowledge is essential to sharing information. Is a runbook out of date? Has are production architecture diagram evolved? Updating these documents, or creating new documents, is another way we can improve our operations.

We've already covered Define drivers for improvements above.

The next three best practices cross the boundary from "your team" to "your organization." Validate insights is less about reviewing metrics and more about proving a hypothesis to other teams and business stakeholders. You should invite them to review your operational health improvement activities and provide input to those activities.

Perform operations metrics reviews is a formalized version of Validate insights. At Amazon we use operations metrics reviews to validate that our services behave the way we expect them to and, if something did happen, that we recovered as planned. The re:Invent talk below dives into how we do it starting at the 31:34 mark. Your operations metrics review should include business stakeholders and other teams.

Document and shared lessons learned is about being open with your successes and failures and showing other teams, or the wider organization, what you learned along the way. Perhaps the way you solved a problem will help another team in a similar circumstance. This is necessary to building a learning culture.

We end with something simple but difficult: Allocate time to make improvements. Now that we implemented all the rest of the best practices in OPS 10, we have to commit to making time to improve our operations. We should always devote a percentage of our capacity to making improvements no matter what happens. It sends the signal that improving operational health is just as important as new features and is something necessary to do.


answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions