The production agent stack for sensitive environments: a field guide for 2026
By Iain,
What to actually deploy when mistakes carry consequences, and what to skip when they don’t.
The agent era has diversified its tooling. Eighteen months ago, most teams assembled whatever tools worked and hoped for the best. That approach was acceptable when agents only drafted emails and summarised meeting notes. It became problematic when those agents began executing trades, processing insurance claims, and triaging patient records. The market has responded by differentiating into genuine infrastructure categories, each with its own contenders, trade-offs, and failure modes.
This article outlines the current state across every layer of the production agent stack, identifies specific tools worth considering, and offers opinionated advice on when to build a modular stack versus when to delegate the entire problem to a hyperscaler platform. It also discusses operational and lifecycle concerns often overlooked in stack guides, but which sensitive environments cannot ignore — such as context management, error recovery, cost control, and human-in-the-loop patterns.
What counts as “sensitive”
A “sensitive environment” means any setting where the agent can do things, touches live data, and where mistakes carry weight. More specifically, an environment is considered sensitive when it meets one or more of the following five conditions: The agent executes irreversible actions such as sending money, modifying records, or publishing content. It processes untrusted content, including user input, third-party documents, or web pages. It accesses customer data or personally identifiable information. It operates across trust boundaries, moving between systems with different permission models. Or failures would expose the organisation to regulatory, legal, or reputational risk.
The distinction between sensitive and non-sensitive is less about the technology and more about the blast radius. Think of it as the difference between cooking dinner for yourself and cooking dinner for a restaurant full of strangers with food allergies and lawyers.
The layers of the stack
A production agent setup for sensitive environments must address at least 11 concerns, though not every team will need a dedicated tool for each one.
Orchestration defines how the agent reasons, plans, and sequences actions. Context management governs how the agent recalls what it has seen, done, and been told. Tool integration determines how agents discover, authenticate with, and call external services. Runtime and deployment cover where the agent code runs and how sessions are isolated from one another. Security and guardrails prevent the agent from doing things it should not. Observability enables you to see what the agent did and when. Evaluation tells you whether the agent’s actions were effective. Model drift detection catches the slow rot that degrades agent quality over weeks and months. Error handling and recovery determine what happens when something breaks mid-execution. Cost management prevents token spend from becoming a surprise line item. Human-in-the-loop patterns govern when and how a human intervenes before the agent acts. And identity and access management control which systems the agent can access and with whose credentials.
The maturity of tooling varies significantly across these layers. Orchestration and observability have reliable open-source options. Identity management remains an unresolved issue. Context management, cost control, and human-in-the-loop patterns are areas where most teams still depend on custom engineering rather than off-the-shelf products.
Orchestration: LangGraph has won the production argument
Three frameworks merit serious evaluation, and one has pulled ahead for compliance- heavy workloads.
LangGraph reached version 1.1.0 in late 2025 and became the default agent framework within the LangChain environment. Its core abstraction is an explicit state machine rather than an implicit chain of prompts, which means you define nodes (the actions the agent performs), edges (the transitions between them), and state (the data that persists across steps). This is not a minor architectural preference. When a regulator asks you to explain why your agent took a particular action, the difference between “it followed this state graph” and “the LLM decided to do stuff” determines whether you pass or fail an audit. LangGraph supports parallel execution through fan-out and fan-in patterns, human-in-the-loop breakpoints, persistent memory, and durable execution with checkpointing. If your agent crashes midway through a ten-step workflow, it can resume from the last checkpoint rather than starting anew. Designing a LangGraph agent feels more like designing a factory floor than hiring a team of freelancers.
CrewAI adopts a different approach, organising agents around roles, goals, and tasks rather than explicit state graphs. Backed by Andrew Ng’s AI Fund, role-based abstraction allows for faster prototyping of multi-agent workflows where agents collaborate as specialists. For production workloads requiring a lighter governance footprint, or for teams that find state machines excessive for their needs, CrewAI is a credible choice.
Strands Agents, the AWS entry, employs a model-first approach where the LLM itself drives tool selection and execution flow. It offers native MCP (Model Context Protocol) integration and integrates seamlessly into the Bedrock AgentCore platform. If your infrastructure is already committed to AWS, Strands is the path of least resistance.
The practical recommendation is LangGraph for any workload where compliance, auditability, or deterministic execution paths matter. CrewAI for teams requiring faster iteration cycles and comfortable with less explicit control flow. Strands for AWS-committed shops seeking the tightest possible integration with AgentCore.
Context management and memory
Every production agent guide spends pages on orchestration and security, then waves its hands about memory. This is a mistake. An agent that cannot manage its own context is an agent that forgets what the customer said three turns ago, repeats itself, hallucinates references to conversations that never happened, or silently drops critical instructions when the token budget runs dry.
Context management splits into two problems that sound similar but require different engineering. Short-term context is the conversation window, the rolling buffer of messages the model can see during a single interaction. Managing it means deciding when to summarise older messages to free up token budget, how aggressively to prune, and what summarisation strategy preserves the information the agent will actually need later in the session. Get this wrong, and the agent loses track of the user’s original request halfway through a multi-step task.
Long-term memory is everything that persists beyond a single session. This includes user preferences learned over time, facts extracted from previous interactions, episodic memories of past tasks and their outcomes, and semantic knowledge accumulated from documents the agent has processed. The engineering options here range from simple key-value stores to full vector databases with retrieval-augmented memory layers. MemGPT introduced a tiered memory architecture (inspired by operating system virtual memory) that moves information between a fast, small working context and a larger, slower archival store, paging facts in and out as needed. The pattern has influenced how several production teams think about agent memory, even if they do not use MemGPT directly.
The hardest subproblem is context window overflow. What happens when an agent executing a 15-step workflow accumulates so much intermediate state that it exceeds its context limit on step 12? Without a plan for this, the agent either crashes, silently truncates its history (losing critical context), or starts producing incoherent outputs. Production systems need a defined overflow strategy, whether that is aggressive mid-task summarisation, checkpointing to external storage, or splitting the workflow into stages with explicit context handoffs.
Cross-session persistence introduces its own difficulties. Agent memory must survive not just session boundaries but also restarts, failovers, and horizontal scaling events. If your orchestration layer spins up a new agent instance to handle load, that instance needs access to the same memory as the original. And in multi-tenant deployments, memory isolation is a security requirement, not a nice-to-have. One customer’s conversation history leaking into another customer’s session is the kind of incident that makes the news.
AWS AgentCore includes a Memory component for persistent context, but the options for framework-agnostic, vendor-neutral memory management remain thin. Most teams building outside a hyperscaler platform are writing custom memory layers, typically using Redis or DynamoDB for session state and a vector store such as Pinecone or pgvector for long-term retrieval.
Tool integration and MCP: the connective tissue
The Model Context Protocol (MCP) is mentioned in scattered places across agent framework documentation, often as a checkbox feature rather than an architectural decision. This undersells its importance. MCP is becoming the standard interface through which agents discover and call external tools, and the design choices around tool integration determine whether your agent system is a controlled, auditable platform or a spaghetti of bespoke API calls held together by optimism.
Tool integration in production requires solving at least four problems simultaneously. First, tool discovery and registration. An agent needs a way to know which tools exist, what they do, what inputs they expect, and when to use them. Some frameworks bake this into the prompt (listing the tools in the system message and letting the model pick), while MCP provides a more structured protocol for runtime discovery. The trade-off between MCP and native function calling comes down to standardisation versus performance. MCP gives you a vendor-neutral interface that works across models and frameworks. Native function calling (the tool-use APIs built into Claude, GPT-4, and Gemini) is faster and more tightly integrated, but it locks you into a specific model provider’s schema.
Second, tool authentication and sandboxing. Every tool an agent calls is a potential vector for privilege escalation. If your agent has an MCP tool that can query a database, and that tool’s credentials have write access, then a prompt injection attack can turn a read-only agent into a data modification engine. Sandboxing tools with minimum-privilege permissions and rotating credentials so that a compromised token has a short half-life are not optional in sensitive environments.
Third, tool versioning. What happens when a tool’s underlying API changes while an agent is in the middle of a workflow? The agent calls version 2 of an endpoint expecting version 1’s response schema, and the whole execution path breaks. Pinning tool versions and testing version upgrades against evaluation suites before rolling them out to production agents is the same discipline as dependency management in software engineering, and just as often neglected.
Fourth, error handling at the tool boundary. Tool calls fail. APIs time out. Rate limits hit. A production agent needs retry logic with backoff, circuit breakers that prevent hammering a dead endpoint, and, ideally, fallback tools (if the primary weather API is down, try the backup) that maintain service quality without human intervention.
Runtime and deployment: session isolation is non-negotiable
Session isolation means that one agent’s execution cannot observe, modify, or interfere with another agent’s execution. This is essential for any multi-tenant system, and it becomes a regulatory requirement when agents process customer data.
AgentCore Runtime represents the current high standard for isolation. Each agent session runs within its own Firecracker microVM, the same virtualisation technology that powers AWS Lambda. This provides hardware-level isolation instead of process-level isolation, which is important when your threat model includes a compromised agent attempting to escape its sandbox. AgentCore Runtime supports VPC connectivity and PrivateLink for agents that need to call internal services without traversing the public internet, and it scales dynamically according to demand.
Kubernetes with containerised agents is the most common deployment pattern for teams building outside hyperscaler platforms. Docker containers offer process-level isolation, and Kubernetes network policies can restrict which services each agent container can access. The isolation guarantees are weaker than those of microVMs (a container-escape vulnerability in the kernel exposes all co-located containers), but the operational tooling is mature and most platform engineering teams already know how to run Kubernetes.
gVisor sits between these two options. It is an application kernel that intercepts system calls before they reach the host kernel, providing a stronger isolation boundary than a standard container without the overhead of a full virtual machine. For teams that desire better-than-container isolation but cannot justify the overhead or cost of microVMs, gVisor is the pragmatic middle ground.
Security and guardrails: defence in depth, not a single checkpoint
Johann rehberger’s writing on the normalisation of deviance in AI systems should be required reading for anyone deploying agents in production. The core argument is that teams gradually accept degraded safety margins because nothing bad has happened yet, until something bad does. Security for production agents is not a single layer you bolt on. It is a stack of overlapping defenses, each catching what the others miss.
The defenses are split into three categories that work together but serve different purposes.
Input and output filtering
Lakera Guard (now part of Check Point following a September 2025 acquisition) operates as an inline API firewall, inspecting prompts and responses for injection attacks, jailbreaks, and policy violations before they reach or leave the model. Their Q4 2025 threat report documented a notable increase in attacks targeting agentic workflows specifically. Think of Lakera as a front door lock on a house where the windows do not close. It is necessary but nowhere near sufficient. I have written about the insidiousness of prompt injection and why it remains the hardest unsolved problem in agent security.
NeMo Guardrails takes a different approach, using a domain-specific language called Colang to define conversational policies as explicit rules rather than model-based classifiers. The open-source core is free. Production deployment under the NVIDIA AI Enterprise licence adds support guarantees and enterprise features. Where Lakera is a firewall, NeMo Guardrails is a policy engine that can enforce complex conversational constraints such as “the agent must never discuss competitor products” or “the agent must escalate to a human if the customer mentions legal action.”
Azure AI Content Safety has expanded its scope to cover tool calls and tool responses, in addition to prompts and completions. This matters because an agent’s most dangerous actions happen when it calls tools, not when it generates text. AWS Bedrock Guardrails provides denied topic blocking, content filtering, and native PII redaction, though its granularity is coarser than that of the dedicated security tools.
Runtime policy enforcement: deterministic, not probabilistic
The key distinction here is between probabilistic enforcement—where a classifier predicts whether an action is safe, with some false-positive and false-negative rates—and deterministic enforcement, which is a policy engine that evaluates strict rules and either permits or blocks an action with no ambiguity. Both approaches have their place, but deterministic policies are essential for actions with regulatory consequences.
AgentCore Policy utilises Cedar, a policy language originally developed for Amazon Verified Permissions. Cedar policies are evaluated at the AgentCore Gateway before requests reach the agent runtime, supporting a log-only testing mode that allows you to see what would be blocked without actually blocking it. This feature is particularly useful during the transition period when tuning policies and avoiding disruption to production workflows.
Microsoft Foundry AI Gateway builds on Azure API Management to offer rate limiting, backend routing, and policy enforcement. The Foundry Citadel Platform introduces a four-layer governance model covering prompt safety, tool call authorisation, response filtering, and audit logging.
For modular stacks that do not leverage either hyperscaler platform, the common approach is to implement a custom gateway proxy positioned between your orchestration layer and the LLM provider. This proxy inspects requests and responses, enforces policies, logs all activity, and routes traffic to different model backends according to your rules. While this involves more engineering effort than using a managed gateway, it prevents vendor lock-in and provides you with full control over the enforcement logic.
Observability: you cannot defend what you cannot see
I have previously argued that observability should be deployed as the first layer, not the last. The LangChain 2025 State of Agent Engineering report found that 89% of organisations already have some form of agent observability, but only 62% have detailed tracing at the individual step and tool-call levels. The remaining 27% are only equipped with basic instruments in foggy conditions.
Langfuse is the most robust open-source option. It is licensed under MIT, uses OpenTelemetry for trace ingestion, and maintains feature parity between its self-hosted and cloud versions. For regulated industries where data sovereignty is a concern, the ability to run the entire observability stack in your own infrastructure without feature degradation is a significant advantage.
Arize Phoenix combines tracing with drift detection, clustering, and anomaly detection. It features LLM-as-a-judge evaluation integrated into the observability process, allowing you to score agent responses in real time rather than in a separate batch process. Arize AX, the managed version, offers certifications for SOC 2, HIPAA, and ISO. After raising $70 million in Series C in February 2025, market confidence suggests that observability combined with evaluation is an emerging category worth investing in.
Braintrust approaches this differently by treating observability and evaluation as a unified workflow. Traces in Braintrust can be converted into test cases with a single click, so a production failure automatically becomes a regression test. This closes the loop between an issue occurring and having a test that ensures it doesn’t happen again, more quickly than any other tool I have evaluated.
LangSmith offers near-zero-configuration tracing for LangChain and LangGraph applications, adding support for OpenTelemetry in March 2025, which broadens its compatibility to non-LangChain frameworks. OpenLIT and Traceloop target teams that already have observability infrastructure (like Datadog, Prometheus, Grafana) and want to integrate agent tracing into their existing dashboards rather than adopting yet another tool.
The practical advice for choosing between these tools follows a prioritised order. If you require self-hosted observability with full data control, start with Langfuse. If you prefer an all-in-one solution with observability and evaluation in a managed service, consider Braintrust. If drift detection and clustering are of high importance, explore Arize Phoenix. If you are already running LangChain or LangGraph, LangSmith is the fastest path to basic tracing. And if you have an existing Datadog or Grafana stack, OpenLIT or Traceloop will meet you where you are.
One caveat worth stressing is that observability tells you what happened, but not why. Knowing that an agent produced an incorrect output at step seven of a twelve-step workflow is useful. Understanding why it produced that output, whether the model hallucinated, the retrieved context was stale, the tool returned unexpected data, or the prompt was ambiguous, requires evaluation, not just tracing.
Evaluation: the layer with the most failures traces back to
Observability answers “what did the agent do?” Evaluation answers “Was it any good?” These are different questions, and confusing them is a common and costly mistake.
Evaluation in production occurs in three forms. Offline experiments run test cases against the agent before deployment, scoring results against expected outputs or quality criteria. Online scoring assesses responses in real time during production traffic, flagging those that fall below a quality threshold. Regression gates in CI/CD pipelines prevent deployment if evaluation scores drop below a set minimum.
Braintrust has built its entire product around eval-driven development. Their Loop feature allows you to run evaluations on every code change, compare results across versions, and set up GitHub Actions to block merges that reduce quality. The automated failure-to-test-case conversion I mentioned in the observability section means your evaluation suite grows naturally from production incidents, creating a healthy feedback loop.
Arize AX offers session-level evaluation, assessing entire agent interactions rather than individual responses, and includes RAG retrieval diagnosis that helps determine whether a poor answer resulted from bad retrieval, a bad prompt, or a bad model. Langfuse includes LLM-as-a-Judge evaluators that score responses using a second model, which is useful for subjective quality aspects like helpfulness and tone.
The practical recommendation is to choose Braintrust if evaluation is your main concern and you want the closest integration between evaluation and development workflow. If you prefer to consolidate observability and evaluation into a single platform, Langfuse or Arize AX provide both with less operational overhead than managing separate tools.
Model drift: the unseen threat during your absence
I have written about drift in production AI systems, and it remains one of the least appreciated risks in the agent stack. A 2025 LLMOps report found unmonitored models showed a 35% increase in error rates after six months. Insurance carriers have reported accuracy drops after just nine months of unmonitored operation. The most notable example comes from Stanford and UC Berkeley, whose researchers tested GPT-4’s ability to identify prime numbers and observed performance falling from 97.6% to 2.4% between March and June 2023. Code execution on the same model declined from 52% to 10% over that period.
Distinguishing between data drift and model drift is crucial for diagnosis. Data drift occurs when the inputs your agent receives in production have shifted from the training distribution. Model drift happens when the model’s behaviour changes despite unchanged inputs, often because providers push silent updates, retrain on new data, or deprecate model versions.
Arize Phoenix employs embedding-based clustering to identify when production inputs start to diverge from your evaluation dataset. Evidently and WhyLabs offer statistical drift tests that can run continuously and alert you when distributions shift beyond predefined thresholds.
The SAFi architecture (Session-Aware Fidelity) takes a different approach by monitoring drift within individual agent sessions rather than across aggregate distributions. It uses exponential moving averages to detect when an agent’s response quality degrades during a conversation and can inject corrective “coaching notes” into the agent’s context to steer it back on track. This is a newer pattern and not yet widely adopted, but the idea of within-session drift correction is compelling for long-running agent interactions where quality degradation compounds over dozens of turns.
The practical defence is a layered approach. Run continuous evaluation against a held-out test suite to catch model drift early. Deploy statistical drift detection on production inputs to catch data drift. Pin model versions so you control when model changes affect your agents rather than discovering them in production logs. And budget for ongoing maintenance, because keeping an agent working well costs nearly as much per year as building it in the first place.
Error handling and recovery: what happens when things break
Every agent stack guide includes guardrails to prevent poor outputs. However, far fewer cover what occurs when the system itself fails, and in production, systems fail constantly. Network requests timeout. Model endpoints go down. APIs return unexpected schemas. Third-party services hit rate limits. An agent that cannot handle these failures gracefully will eventually produce a catastrophic result at three in the morning when nobody is watching.
Graceful degradation involves deciding, in advance, what the agent should do when a dependency is unavailable. If the primary model endpoint is down, does the agent queue the request and retry? Switch to a fallback model? Return a structured error to the user explaining that the service is temporarily unavailable? The worst possible outcome is “the agent hangs indefinitely and then returns a garbled partial response,” which is exactly what happens when one does not design for failure.
Retry logic and circuit breakers are essential. Exponential backoff prevents a flood of retries from overwhelming a recovering service. Circuit breakers stop the agent from repeatedly calling a failing endpoint, giving the downstream service time to recover before trying again. Dead letter queues capture failed agent runs, allowing them to be investigated and replayed later rather than remaining silent.
Fallback model routing is a pattern that more teams should adopt. If your agent uses Claude as its primary model and Claude’s API experiences elevated latency, the system can route requests to GPT-4o or a smaller, faster model for simple subtasks while queuing complex reasoning tasks until the primary model recovers. This requires that prompts and tool schemas work across multiple model providers, which supports MCP-style standardisation over provider-specific function calling.
Partial failure handling is the most challenging problem. An agent completes seven of ten steps in a workflow, then fails on step eight. What happens? Without checkpointing, the entire workflow restarts from the beginning, re-executing seven steps already succeeded and potentially producing duplicate effects (sending the same email twice, creating duplicate records, processing the same payment again). With checkpointing, the agent resumes from step eight, using the saved state from previous steps. LangGraph’s durable execution model supports this natively. Most other frameworks require custom implementation.
Poison message detection completes the error handling stack. Some inputs consistently crash agents, either because they trigger edge cases in the model’s reasoning, exploit parsing bugs in tool schemas, or exceed context limits, leading to corrupted state. Identifying these inputs, quarantining them, and routing them for human review, rather than allowing repeated crashes, is the agent-world equivalent of a dead-letter queue.
Cost management and optimisation
Agent costs can escalate rapidly, and the teams most surprised by the bill are often those processing the highest-value workloads. A healthcare agent that reads a 50-page medical record, reasons through a diagnosis, calls four specialist tools, and generates a structured report can burn through several pounds of token spend in a single interaction. Multiply that by thousands of daily interactions, and the annual cost begins to rival the salaries of the humans the agent was meant to support.
Token usage tracking with per-agent, per-user, and per-session budgets is the minimum viable approach to cost control. If a single-agent run exceeds its budget, the system should halt the run and route it for human review rather than allowing it to continue accruing charges. This is not purely a cost concern. A runaway agent that keeps calling tools in a loop is often a sign that something has gone wrong with the reasoning, and cutting it off early prevents both financial waste and garbage outputs.
Model routing based on task difficulty is the most effective cost optimisation available today. Not every subtask within an agent workflow requires the most capable (and most expensive) model. Extracting structured data from a well-formatted document is work that a smaller, cheaper model handles well. Complex multi-step reasoning over ambiguous inputs justifies the cost of a frontier model. Routing each subtask to the cheapest model that can handle it reliably, and using evaluation scores to verify that the cheaper model performs well enough, can cut token expenditure by 40-60% without measurable quality degradation for many workloads.
Semantic caching of common queries avoids redundant model calls altogether. If 200 users ask the same question about your return policy within an hour, calling the model 200 times is wasteful. A semantic cache that recognises similar (not just identical) queries and serves cached responses where appropriate can dramatically reduce both cost and latency. The engineering challenge is cache invalidation: ensuring the cache does not serve stale answers when the underlying information changes.
Cost observability involves linking token expenditure to business outcomes. Knowing that your agent spent $12,000 on tokens last month is less useful than knowing that $8,000 of that spend went to the customer onboarding agent, which successfully completed 95% of onboarding flows and reduced human processing time by 60%. The first figure is a cost; the second is an ROI calculation.
Human-in-the-loop patterns: when and how humans intervene
The phrase “human-in-the-loop” appears in nearly every enterprise AI presentation, usually as a single bullet point that gets a knowing nod and no further elaboration. In practice, designing effective human intervention for production agents is one of the hardest engineering and UX problems in the stack.
Approval workflows define which agent actions require human sign-off before execution. The challenge is calibrating the approval surface. Require approval for too many actions, and the agent becomes a glorified suggestion engine that creates more work than it saves. Require approval for too few, and you have an autonomous system taking consequential actions without oversight. The right calibration depends on the blast radius of each action type. An agent that drafts an email can operate autonomously. An agent that sends the email to 10,000 customers needs a human to press the button.
Escalation triggers determine when an agent routes a task to a human, even when it could technically continue on its own. Confidence thresholds are the most common trigger, where the agent escalates when its uncertainty about the correct action exceeds a configured limit. Policy violations (the agent detects that the user’s request would require violating a business rule) and domain-specific rules (a financial agent must escalate any transaction above a certain value) are the other standard triggers. The difficult design question is how the agent communicates uncertainty. A blunt “I’m not sure, sending to a human” erodes user trust. A well-structured escalation that explains what the agent has done so far, what it is uncertain about, and what the human needs to decide is far more useful but harder to implement.
Review queues and SLAs are crucial at scale. If your agent escalates 500 tasks daily and three humans review them, you need a queue management system with prioritisation, assignment, and time-based SLAs. Otherwise, high-priority escalations get delayed behind lower-priority ones, diminishing the value of having an agent. Feedback loops close the circle: when a human overrides an agent’s decision, that override becomes training data. Recording what the agent proposed, what the human did instead, and why (even a brief note) creates a dataset to improve the agent through fine-tuning, prompt updates, or expanded evaluation. Without this, the agent will keep making the same mistakes. Audit trails for overrides are essential in regulated industries. When an auditor asks why a decision was made, you must show whether the agent acted independently, was approved by a human, or was overruled, along with the reasons. Identity and access management: the unresolved challenge The question of “who is the agent, and what is it permitted to do?” seems straightforward until implementation begins. An agent booking meetings needs access to the user’s calendar; one querying a database needs credentials; filing regulatory reports requires system permissions. Each pattern has specific security, credential, and audit implications. AgentCore Identity offers OAuth-based authentication with automatic token refresh and scope limitations, supporting Okta, Entra, and Cognito as identity providers. This covers the “agent authenticates to a service” flow, but it is AWS-specific.
Microsoft’s Entra Agent ID adopts a Zero Trust architecture, treating agent identities as first-class security principals with unified discovery and governance. If your organisation already uses Entra for human identity management, extending it to agent identities provides a consistent security model.
Nango abstracts away the authentication layer entirely, handling credential vaulting, token lifecycle management, and the provider-specific quirks that make OAuth integrations painful. For modular stacks that connect to many different services, Nango reduces the per-integration engineering burden from days to hours.
For teams building outside these platforms, the standard approach is to use custom OAuth per service, with credentials stored in HashiCorp Vault or a similar secrets manager. This works but requires ongoing maintenance as APIs change their authentication requirements, tokens expire, and new services are added to the agent’s toolkit.
The lack of a standardised, vendor-neutral protocol for agent identity is one of the most conspicuous gaps in the current stack. We have MCP for tool integration and OpenTelemetry for observability, but nothing equivalent for identity. This means every deployment requires bespoke identity plumbing, which increases cost, slows adoption, and creates a security surface area.
When a hyperscaler platform makes sense
The modular stack provides control, prevents vendor lock-in, and allows you to swap individual layers as better tools become available. However, it also demands considerable engineering effort to connect, operate, and maintain. The hyperscaler platforms (AWS and Microsoft, with Google trailing) offer a pre-wired alternative where the layers are designed to function together from the outset, albeit at the expense of flexibility and portability.
AWS Bedrock AgentCore
AgentCore achieved general availability in October 2025, and its SDK has been downloaded over two million times in five months. The platform encompasses Runtime (serverless Firecracker microVMs), Gateway (with MCP support and Cedar policy enforcement), Policy, Memory (persistent context management), Identity (OAuth with multi-provider support), Observability (CloudWatch plus OpenTelemetry export), and Evaluations (currently in preview). It is framework-agnostic, with official support for CrewAI, LangGraph, LlamaIndex, Strands, and custom Python agents.
The case studies are persuasive. Robinhood scaled from 500 million to five billion tokens daily while reducing costs by 80%. Ericsson processes millions of lines of telecom code. The PGA TOUR increased content production speed by 1,000% and cut costs by 95%. A cohort of Latin American banks has adopted AgentCore for regulatory compliance workflows. These figures are vendor-reported and should be approached with some scepticism, but the trend of large-scale adoption by regulated industries is authentic.
Microsoft Foundry
Foundry (formerly Azure AI Foundry) boasts the most robust enterprise integration story, which is expected given Microsoft’s dominance in enterprise software. Entra ID for identity, Defender for threat detection, M365 for productivity data, and Purview for data governance all connect natively. The Control Plane oversees agent deployments, and the four-point guardrail model covers tool calls, tool responses, prompts, and completions.
The AI Gateway provides rate limiting, policy enforcement, and backend routing. Foundry remains the only cloud platform offering both OpenAI and Anthropic models natively, with a model router that can direct requests to different models depending on the task. On the framework side, Semantic Kernel and AutoGen are open-source options, while Copilot Studio offers a managed, low-code experience.
Which platform, then?
AWS for teams that want the best session isolation, framework-agnostic deployment, and the flexibility to bring their own orchestration framework. Microsoft for organisations already invested in the Entra, Defender, M365, and Purview stack, where the integration benefits outweigh the platform-specific constraints. Google Vertex AI’s agent platform is less integrated than either AWS or Microsoft at the time of writing, though this is a fast-moving space.
I have written about the market bifurcation between hyperscaler platforms and independent middleware. The prediction holds. The hyperscalers will capture the “just make it work” segment, while the middleware layer (independent companies building specific layers of the stack) is where the most interesting engineering and differentiated products will be built.
When you do not need any of this
The full production stack described above is excessive for most agent use cases, and deploying it unnecessarily wastes engineering time and money.
If you’re creating internal tools for your own staff, Claude via claude.ai with Projects usually suffices. Projects provide persistent context, custom instructions, and file uploads within an interface your team can operate without any infrastructure work.
For internal workflows requiring more structure, such as multi-step processes with clear handoffs, CrewAI Studio and Relevance AI offer visual builders allowing non-engineers to configure agent workflows. Helicone adds monitoring and cost tracking with approximately 15 minutes of setup time.
A straightforward rule of thumb is: if the worst-case outcome is that someone on your team has to redo some work, you do not need production-grade guardrails. Reserve engineering resources for scenarios where the worst-case involves customers, regulators, or media coverage.
Further operational considerations
The layers discussed above warrant dedicated tools and architectural focus. However, a production agent stack in a sensitive environment presents a second tier of operational issues that, although less likely to require standalone products, can cause problems if ignored.
Multi-tenancy extends beyond runtime session isolation. Tenant-specific data access ensures that Agent A operating for Customer X cannot access, reason about, or inadvertently display Customer Y’s data. Per-tenant model configuration allows different clients to specify models, guardrails, or policies aligned with their risk tolerances. Resource quotas per tenant prevent noisy neighbour problems, where heavy agent usage by one customer hampers performance for others. Furthermore, tenant-specific compliance requirements (such as EU data residency or HIPAA controls) demand a flexible multi-tenancy model capable of enforcing varied rules for different tenants.
Data and RAG architecture underpin most production agents, even though it is rarely treated as part of the agent stack. Ingestion process design, chunking and embedding strategies, vector store selection (Pinecone, Weaviate, pgvector, and Qdrant each have different trade-offs around scale, cost, and managed hosting), retrieval quality evaluation, index freshness, and hybrid search combining vector similarity with keyword matching all directly affect agent output quality. A badly chunked knowledge base produces poor retrievals, which yield poor agent answers, and no amount of prompt engineering or guardrail fixes addresses that root cause.
Prompt management and versioning are the dark matter of the agent stack. In production, prompts are code and should be treated with the same discipline. Prompt registries provide centralised storage and versioning. A/B testing different prompt versions and measuring their impact on evaluation scores lets you iterate on prompt quality with data rather than intuition. Prompt rollback (reverting to a previous version when a new prompt degrades quality) and environment-specific prompts (different system prompts for staging and production) are the kind of operational hygiene that separates reliable agent systems from fragile ones.
CI/CD and deployment processes for agents do not map cleanly onto traditional software deployment because an agent is not a single artefact. It is a composition of prompts, tool configurations, model versions, guardrail policies, and orchestration logic, all of which can change independently. Versioning this composite artefact, running blue-green or canary deployments to roll out changes safely, setting up automated evaluation gates that block deployment if quality scores drop, and having tested rollback procedures are all necessary. Infrastructure-as-code (Terraform, CDK, Pulumi) for agent infrastructure is not yet standard practice, but it should be.
Testing extends well beyond evaluation. Unit tests for individual tools and functions, integration tests for the full agent-to-tool-to-API chain, load and stress tests to understand behaviour under production traffic, chaos engineering to deliberately inject failures and verify resilience, and regression tests to ensure new changes do not break existing agent behaviours are all part of a mature testing strategy.
Compliance and audit logging in regulated industries requires immutable, tamper-proof records of every agent action and decision. Data retention policies must balance the need for audit trails with the obligation to delete personal data on request under regulations such as the GDPR. Chain-of-thought logging, which records the model’s reasoning process, is extremely useful for debugging and auditing but creates tension with data minimisation principles when the reasoning contains personal information.
Agent-to-agent communication becomes relevant when your system grows beyond a single agent. Multi-agent coordination patterns (publish/subscribe, shared blackboards, negotiation protocols), context transfer during handoffs between agents, conflict resolution when two agents produce contradictory outputs, and shared state management for agents working on the same task are all design problems that appear once you have more than one agent in production.
Latency and performance optimisation matter more than most teams realise. Streaming tokens to the user as they are generated (rather than waiting for full completion) dramatically improves perceived responsiveness. Parallel tool execution, running independent tool calls concurrently rather than sequentially, can cut end-to-end latency by half or more for multi-tool workflows. Speculative execution (starting likely next steps before the current step completes) trades compute cost for speed. And selecting models based on latency requirements, using smaller, faster models for time-sensitive paths and reserving larger models for reasoning-heavy steps, keeps the system responsive without sacrificing quality where it counts.
Structured output and output parsing are the last-mile problem. Schema enforcement ensures agent outputs conform to expected JSON or XML schemas. Output validation catches type errors, out-of-range values, and broken references. Graceful handling of malformed model output, with retry strategies that include increasingly explicit formatting instructions, prevents a single parsing failure from crashing the entire workflow.
Assembling the modular stack: a reference architecture
For teams building outside hyperscaler platforms, the modular stack is assembled as follows.
Orchestration with LangGraph or CrewAI, depending on whether you need explicit state machine control or faster role-based iteration. Context management with a custom memory layer, typically Redis or DynamoDB for session state, paired with a vector store for long-term retrieval, until better off-the-shelf options arrive. Tool integration through MCP for standardised tool discovery and calling, with native function calling as a fallback for latency-sensitive paths. Runtime on Kubernetes with gVisor for better-than-container isolation, or AgentCore Runtime if you are on AWS and want microVM-grade isolation. Security with Lakera Guard for input/output filtering, NeMo Guardrails for policy-based conversational constraints, and a custom gateway proxy for deterministic policy enforcement. Observability with Langfuse for self-hosted deployments or Arize Phoenix/AX for managed deployments with drift detection. Evaluation with Braintrust for eval-driven development and automated regression gates. Error handling with custom retry logic, circuit breakers, fallback model routing, and checkpointing for long-running workflows. Cost management with per-session token budgets, model routing by task difficulty, and semantic caching for high-frequency queries. Human-in-the-loop with approval workflows calibrated to action blast radius, escalation triggers based on confidence thresholds, and feedback loops that turn human overrides into evaluation data. Identity with Nango for third-party service authentication and your cloud provider’s IAM for internal resources.
This is not a cheap stack to operate. It is not a simple one to build. But for sensitive environments, the cost of building it is lower than the cost of the incident you will have without it.
The final assessment
Gartner projects that 40% of agent initiatives will be cancelled or significantly scaled back by 2027. The primary cause will not be that the technology does not work. Organisations will underestimate the engineering discipline required to run non-deterministic systems in production. The normalisation of deviance, Rehberger describes, will claim more agent deployments than any technical limitation.
The expanded stack mapped in this article, from orchestration through context management, tool integration, security, observability, evaluation, drift detection, error handling, cost control, human-in-the-loop governance, and identity, represents the full engineering surface area of a production agent deployment in a sensitive environment. Not every team needs every layer. But every team needs to have made a conscious decision about each one, even if that decision is “we accept the risk of not covering this.”
Build for your actual risk profile. Start with observability because you cannot fix what you cannot see. Add guardrails for anything that touches untrusted input. Add evaluation so you know whether the agent is doing a good job, not just a job. Build error handling so failures degrade gracefully rather than catastrophically. Track costs so the CFO does not shut the project down in Q3. And design your human-in-the-loop patterns before the first production incident forces you to improvise them.
If your agent does not touch customer data or take consequential actions, close this article and go build something with Claude and a good prompt. For everyone else, the stack is waiting, and the engineering effort you invest in it now will determine whether your agent deployment is in Gartner’s surviving 60% or its discarded 40%.
Like this? Get email updates or grab the RSS feed.
More insights:
-
The path to an agent-first web
For three decades, the web has operated on an implicit contract between the people who build websites and the people who visit them. You design pages for human eyes and organise information for human brains, monetising attention through ads, upsells, and sticky navigation patter…
-
Generative engine optimisation: separating sound practice from snake oil
A new three-letter acronym is stalking the marketing industry. Generative Engine Optimisation (GEO) is the practice of making your content visible in AI-generated answers, such as those produced by ChatGPT, Perplexity, Google AI Overviews, and Claude. The term was coined in a 20…
-
Automating your marketing 01: Paid Search Ads
Google has always wanted you to believe that running search ads is simple and not as complex as it actually is. Set a budget (a generous one!), choose some keywords, and let the machine handle the rest. To be fair, the machine has become exceptionally good at certain aspects of …
-
Why AI models hallucinate
In September 2025, OpenAI published a paper that said something the AI industry already suspected but hadn’t quite articulated. The paper, “Why Language Models Hallucinate”, authored by Adam Tauman Kalai, Ofir Nachum, Santosh Vempala, and Edwin Zhang, didn’t just catalogue the p…
-
Received wisdom: classic frameworks under AI pressure 01: David C Baker
David C Baker has spent thirty years telling agency owners something they already suspected but lacked the courage to act on. You are not expensive enough, not focused enough in what you do. You are not sufficiently authoritative with your clients. The issue is not your work. Th…