AI Agents Operational Framework
Operational best practices for building, deploying and operating AI agents on AWS.
What are AI agents?
Artificial intelligence (AI) agents are autonomous software systems that leverage AI, like foundation models (FMs), to reason (break down user-requested task into multiple steps), plan (use the developer-provided instructions to create an orchestration plan), and complete tasks (invoking tools, APIs, and other agents) on behalf of humans or systems.
According to Gartner, 33% of enterprise software apps will include agentic AI by 2028, and 15% of day-to-day work decisions will be made autonomously through agentic AI by 2028.
This article provides you a structured approach to build, deploy and operate AI agents in your organization on AWS.
Benefits
The AI Agents operational framework provides several benefits for your organization:
- It increases the success rate of launching and scaling your agentic AI systems
- It increases the reliability and mitigates risks in your agentic AI systems
- It accelerates issue resolution
- It helps you achieve continuous optimization
Choice of AI agents
You can use or build AI agents in below 3 ways generally:
- Use fully managed specialized agents like Kiro, Amazon Q and AWS Transform.
- Build your own agents using managed services like Amazon Bedrock AgentCore and Amazon Bedrock Agents.
- Build your own agents using open source frameworks like Strands Agents and LangGraph.
In this article we focus on AI agents on Amazon Bedrock, the easiest way to build and scale generative AI applications. However, most of the operational best practices in the framework apply to the other types of AI agents as well.
5-stage operational framework for AI agents
The AI agents operational framework spans across the agent’s lifecycle, which generally includes 5 stages.
1. Prepare
This is the scoping and planning stage. You start by defining the business problem that is feasible for AI agents. Note that not all problems need to be solved by AI agents. More defined workflow or tasks can be implemented with workflow like Amazon Bedrock Flows or AWS Step Functions with AWS Lambda. AI agents are better for use cases where it requires stronger reasoning, planning, dynamic routing capabilities and multi-step tool calling. From operations perspective, you can start with a simple use case, like a single HR assistant agent to request PTO. To estimate cost of the entire agentic AI system, include all features and components used like the foundation model (FM), request and response tokens, vector database, and any other services in the architecture like Amazon API Gateway and AWS Lambda. For more details, reference to the Bedrock pricing page and this blog post. It’s also important to create a GenAI cost control strategy from the beginning like implementing cost guardrails with resource tagging, cost monitoring reports and budget alerts. Review service quotas to ensure you can operate at different stages from dev to prod without errors or throttling. Plan for failure and design for resiliency like have a multi-AZ or multi-region architecture. Review the desired FM throughput requirement to ensure the system is able to respond to production traffic, consider purchasing Provisioned Throughput to maintain a higher level of throughput.
It’s recommended to engage your AWS account team and/or AWS support to plan for your agentic AI system launch. We have SMEs to guide you throughout the deployment process like to review resiliency, service quotas, throughput, set up cost controls, monitoring, accelerate issue resolution, also provide post go-live guidance to optimize your agentic AI system over time. Consider using AWS Countdown Premium (CDP) to prepare for your critical agentic AI launches.
2. Build
When building AI agents, implement control logics like timeout and retry to handle idle or looping agents and automatically recover from service errors. Create an API gateway (reference here) and/or load balancing (reference here) layer so that you can abstract the FMs, conduct A/B testing and distribute traffic across different AZs and/or regions (with service like Amazon Route 53 reference to this blog post). You can enable agent memory to retain conversational context across multiple sessions. And define OpenAPI schemas if you want a structured way for your agent to invoke API operations and perform actions. Incorporate prompt engineering techniques in agent instructions and consider using advanced prompt templates. Consider incorporating human-in-the-loop (HITL) confirmation with Bedrock Agents, like user confirmation.
3. Deploy
Once you’ve built your agents, automate agent deployment using infrastructure as code (IaC), like AWS Cloud Development Kit (AWS CDK), AWS CloudFormation, or Terraform, and CI/CD pipelines like AWS CodePipeline or Jenkins. Create aliases and versions with a naming convention for your agents so that you can easily track all your agents.
4. Operate
In the operation stage, to understand and troubleshoot AI agent behaviors, keep a trace of all levels in the agentic AI system: from FM, vector database, application, to user feedbacks. Create a central dashboard in Amazon CloudWatch with performance and operational metrics (i.e. latency, request counts, errors) to monitor the health and performance of the entire system, create alerts with Amazon CloudWatch alarms accordingly based on your business requirements. Store logs such as Bedrock model invocation logs in central location like S3 or CloudWatch Logs so you can aggregate and query logs easily. Create a prompt catalog like prompt management in Bedrock to manage your prompt templates. Set up Bedrock security roles and permissions using least privilege access to limit the scope of agentic AI workflows and implement responsible AI guardrails to detect and filter any harmful content in prompt requests and AI agent responses.
5. Evolve
This is the continuous improvement and optimization stage. Once you have a simple AI agent in production and lay out a well-architected operational framework, you can scale and optimize by having more complex use cases like multi-agent collaboration, leverage open source protocol to connect agents with tools and other agents like Model Context Protocol (MCP) and Agent2Agent protocol (A2A), evaluate new and more performant/optimized FMs using Bedrock Evaluations or open source library like Ragas, or customize model for your use case via fine-tuning or continued pre-training. Develop a robust ground truth data with prompts, session attributes, and responses so you can automate the model evaluation for your agentic AI system. Consider optimization techniques such as model distillation, reducing the number of tokens, and/or reducing the vector size while ensuring the overall system performance meets the business requirement. Refer to this effective cost optimization strategies for Amazon Bedrock blog post for more details. The AI agents flywheel keeps spinning when the cost to build and deploy new AI agents is lower, it's easier to monitor and troubleshoot AI agents across organization, and Return on Investment (ROI) is higher.
Conclusion
This article explains 5 stages of building and running AI agents on Amazon Bedrock, and operational considerations and best practices in each stage. Ready to get started with building AI agents on Bedrock? As a next step, refer to this tutorial to build a simple Bedrock agent. If you have other questions or recommendations on building or operating AI agents on AWS, feel free to leave a comment in here or post it on the re:Post platform here.
Happy building and operating AI agents!
Relevant content
- asked 2 years ago
- asked a year ago
AWS OFFICIALUpdated a year ago
AWS OFFICIALUpdated a year ago