Claude Opus 4.6 just shipped agent teams. But can you trust them?

By Iain, 06 Feb 2026

Anthropic shipped Claude Opus 4.6 this week. The headline features are strong: a 1M token context window (a first for Opus models), 128K output tokens, adaptive thinking that adjusts reasoning depth to the task, and top-of-the-table benchmark scores across coding, finance, and long-context retrieval. It scored 65.4% on Terminal-Bench 2.0, the highest ever recorded on that agentic coding evaluation (Anthropic, 2026).

But the feature that should arguably be getting the most scrutiny is agent teams in Claude Code. The following is a broad summary of the state of multi-agent approaches across multiple platforms and the security concerns you should be aware of if deploying them in sensitive environments.

In Opus 4.6, agent teams let you spin up multiple Claude agents that work in parallel and coordinate autonomously. Anthropic describes this as particularly useful for “read-heavy work like codebase reviews.” Instead of one agent handling everything sequentially, you split the work across multiple agents that each take a piece and bring their findings together.

This is a meaningful capability improvement. It is also, from a security standpoint, an architecture that the industry has barely started to think about defending. Anthropic isn’t alone here, the same coordination patterns showing up in Claude Code exist across AutoGen, CrewAI, LangGraph, and OpenAI’s Agents SDK. The security gaps are structural, not vendor-specific.

How Multi-Agent Coordination Works

The implementations differ across frameworks, but the core pattern is consistent. A coordinating agent divides a task, delegates portions to specialist agents, and synthesises the results.

In Claude Code’s new agent teams, you spawn subagents that operate in parallel. Each gets a portion of the task, works through it independently, and produces output that feeds back into the coordinator. You can take over any subagent via Shift+Up/Down or tmux (Anthropic, 2026).

AutoGen (Microsoft) uses a conversational model where agents interact through structured turn-taking, each posting messages and reacting to others’ outputs (Wu et al., 2023). CrewAI mirrors a human organisation, with a manager agent delegating to specialists and aggregating results. LangGraph treats agent interactions as nodes in a directed graph with explicit state management (DataCamp, 2025). OpenAI’s Agents SDK (which evolved from their experimental Swarm framework) uses explicit handoff functions where one agent transfers control to another, keeping just one agent in charge at any time (OpenAI, 2025).

The efficiency gains are real across all of these systems. A codebase review that took an hour with a single agent could take fifteen minutes with four agents running in parallel. One of the Opus 4.6 launch testimonials describes a “multi-million-line codebase migration” that finished in half the expected time. But every one of these frameworks shares a common assumption about how agents within a team relate to each other, and that assumption is where the problems begin.

The Trust-Vulnerability Paradox

Xu et al.’s research on multi-agent trust, published in October 2025, gives this problem a formal name: the Trust-Vulnerability Paradox. Their findings are worth paying attention to. They demonstrated empirically that increasing inter-agent trust to improve coordination simultaneously expands risks of over-exposure and over-authorisation (Xu et al., 2025).

Their experiments across multiple model backends (DeepSeek, Qwen, GPT, Llama-3-8B) and orchestration frameworks showed consistent results, higher trust improved task success but also heightened exposure risks. They measured this using two metrics, Over-Exposure Rate (how often agents share information beyond what’s necessary) and Authorisation Drift (how much leakage sensitivity changes with trust levels). The relationship was monotonic i.e. more trust and more coordination led to more vulnerability. There was no configuration sweet spot where you get the benefits without the costs.

This maps directly onto what’s happening with agent teams. Every framework listed above operates on implicit mutual trust between agents. CrewAI’s manager trusts its specialists’ outputs. AutoGen’s conversational agents trust each other’s messages. LangGraph’s nodes trust the state passed between them. Claude Code’s coordinator trusts its subagents’ assessments. The trust is there because it has to be for coordination to work, and because all agents are presumed to be instances of the same system working toward the same goal.

If you’ve spent time in enterprise network security, this architecture should feel uncomfortably familiar. It mirrors the flat corporate network of the early 2000s, where every machine trusted every other machine because they were all “inside the perimeter.” The entire zero trust movement exists because that model fails badly when any single node gets compromised. Once an attacker is inside, lateral movement is straightforward because nothing challenges the assumption that internal traffic is legitimate. Agent teams have the same structural weakness.

What Compromise Looks Like

A penetration testing study by researchers examining AutoGen and CrewAI found that more than half of malicious prompts succeeded despite enterprise-grade safety mechanisms. Grok 2 running on CrewAI rejected only 2 of 13 attacks, a 15.4% refusal rate. The overall refusal rate across all tested configurations was 41.5% (arxiv, 2024). These numbers are bad enough for single-agent deployments. In a multi-agent team, the consequences multiply.

The most straightforward compromise scenario involves prompt injection through the content being processed. If an agent team is reviewing a codebase, each subagent reads files, processes their contents, and forms conclusions. If any of those files contain adversarial content designed to manipulate LLM behaviour, the subagent processing that file is the one exposed.

With a single agent, you might notice something odd in its output. With a team, the compromised subagent’s output flows into the coordinator’s synthesis alongside legitimate outputs from other agents. The coordinator has no reliable mechanism for distinguishing a genuine finding from a manipulated one, because the architecture assumes all subagent outputs are trusted input.

A compromised subagent doesn’t need to do anything dramatic. It doesn’t need to execute malicious code or access external systems directly. It just needs to subtly influence its own output in ways that affect the coordinator’s conclusions.

Consider a codebase review where one subagent has been manipulated through an injected comment in a source file. That agent might report that a particular authentication implementation is “well-structured and follows best practices” when it actually contains a vulnerability.

Or it might flag a series of false positives to create noise that distracts from a real issue. The coordinator, receiving these assessments alongside legitimate ones, has to weigh and synthesise them without any ground truth about which subagent to trust more.

There’s a subtler variant too. A compromised subagent doesn’t have to lie about its findings. It can tell the truth selectively, emphasising cosmetic issues like naming conventions and comment formatting while leaving serious vulnerabilities unmentioned.

The coordinator receives a report that looks thorough and legitimate but that consumed the agent’s attention on trivialities. The final review then has a blind spot exactly where the attacker wanted one, and nothing in the output looks wrong because every individual claim is accurate.

This kind of misdirection through selective reporting is particularly hard to defend against because it produces no false statements. The attack is in the gaps, in what the agent chose not to say, and detecting omissions requires knowing what should have been there in the first place.

Lateral Movement Between Agents

In network security, lateral movement is the process by which an attacker who has compromised one system moves through a network to reach additional systems. Research on multi-agent propagation confirms this applies to LLM systems too. NetSafe’s analysis showed how hallucinations and misinformation propagate across multi-agent topologies, with structural dependencies determining how quickly corruption spreads (Yu et al., 2024).

The mechanism works as follows. Agent A processes a file containing adversarial content. The injected instructions tell Agent A to include specific phrasing in its report to the coordinator. The coordinator reads Agent A’s report, and the phrasing is crafted to influence how the coordinator interprets other agents’ reports or frames its final output. The attack has moved from Agent A to the coordinator without any direct connection between the attacker and the coordinator.

In a more complex team with multiple coordination layers, this chain could extend further. A compromised agent at the leaf level could influence an intermediate coordinator, which then influences the top-level coordinator. Each hop adds noise and reduces the attacker’s control, but it also adds distance between the original injection and the final output, making detection harder.

The natural response to “what if an agent gets compromised” is usually “we’ll review the output.” But if the compromise propagates through the coordination chain before reaching output, the final result might look perfectly reasonable while being subtly wrong in ways that serve the attacker’s goals.

The Coordination Channel as an Attack Surface

The communication between agents is itself a vector worth examining. When a coordinator sends task descriptions to subagents, and subagents return their findings, those messages carry implicit authority. This is analogous to a man-in-the-middle attack on a coordination protocol, except the “protocol” is natural language and the “messages” are context window contents.

The TRiSM (Trust, Risk, and Security Management) framework for agentic AI, published by researchers in mid-2025, identifies this as a fundamental gap. Their taxonomy of threats explicitly calls out prompt injection, memory poisoning, collusive failure, and emergent misbehaviour as risks that expand when multiple agents interact, and concludes that “defenses designed for single-model LLM applications are not sufficient” (TRiSM survey, 2025).

In Claude Code’s current implementation, coordination happens through the local environment, which limits the attack surface. But as agent teams get deployed in more distributed architectures, and as MCP servers get integrated into the workflow, the coordination boundary expands.

An MCP server providing data to one subagent could inject instructions that influence that agent’s report. The agent doesn’t know the difference between legitimate tool output and adversarial tool output, because at the model level, everything is text in the context window.

Each framework handles this boundary differently, and none of them yet handle it well enough. AutoGen’s conversational model means every message in the agent dialogue is a potential injection point. CrewAI’s hierarchical model concentrates trust at the manager level, creating a single point of failure.

LangGraph’s graph-based approach provides the most structural control (you can define explicit validation at each edge), but validation logic is left entirely to the developer. OpenAI’s Agents SDK has guardrails as a first-class concept, with input and output validation running in parallel with agent execution (OpenAI, 2025), but these guardrails operate at the agent boundary, not at the inter-agent communication boundary. None of these frameworks validate what agents say to each other.

What We Can Do About This?

The honest answer is that defences for multi-agent coordination security are underdeveloped. But research is catching up, and some approaches show promise.

Output validation between agents. Rather than allowing the coordinator to accept subagent outputs as raw text injected into its context, there should be structured formats for inter-agent communication with validation at each boundary. If a subagent’s output is supposed to be a code review assessment, it should conform to a schema that limits the types of content it can contain. This doesn’t prevent semantic attacks (an agent can still say “this code is fine” when it isn’t), but it prevents the most direct forms of instruction injection through the coordination channel.

Differential analysis. If multiple agents review overlapping portions of a codebase, their findings can be compared for consistency. Significant disagreements between agents examining related code could trigger additional scrutiny or a fresh review by a separate agent that wasn’t exposed to the same input. This borrows from Byzantine fault tolerance in distributed systems, where you need agreement among a majority of nodes to accept a result.

The BlockAgents framework (Chen et al., 2024) takes this idea seriously, using a proof-of-thought consensus mechanism with multi-round debate-style voting to prevent Byzantine attacks. Their experiments showed the framework reduced the impact of poisoning attacks on accuracy to less than 3% and the success rate of backdoor attacks to less than 5% (BlockAgents, 2024).

More recently, DecentLLMs proposed a leaderless architecture where worker agents generate answers in parallel and evaluator agents independently score and rank them, avoiding the single-point-of-failure problem in leader-based approaches (Jo et al., 2025). These are academic implementations and add significant overhead, but they demonstrate that the problem is solvable with the right architectural choices.

Abandoning flat trust. Each agent’s output should be treated as potentially influenced, and the coordinator should apply its own judgment rather than simply aggregating subagent reports. This is zero trust applied to agent architectures i.e. never trust, always verify, even when the traffic comes from inside the team. Xu et al. ‘s research found that two specific defences reduced exposure: Sensitive Information Repartitioning (dividing sensitive data so no single agent holds the complete picture) and Guardian-Agent enablement (a dedicated oversight agent monitoring inter-agent exchanges). Both reduced Over-Exposure Rate and attenuated Authorisation Drift (Xu et al., 2025).

Permission scoping. In most current frameworks, all agents in a team inherit the same permission set. A more defensive architecture would scope permissions per agent based on its assigned task. An agent reviewing documentation doesn’t need write access to source code. An agent checking test coverage doesn’t need network access. CrewAI’s task-level tool scoping is the closest any major framework gets to this, allowing developers to restrict which tools each agent can access during specific tasks (CrewAI, 2025). But even CrewAI doesn’t implement full role-based access control, which remains an open area for development.

Temporal isolation. Agents in a team shouldn’t influence each other’s behaviour in real time during execution. If Agent A can modify a shared resource that Agent B reads during its own processing, you’ve created a side channel that bypasses the coordination protocol entirely. Subagents should operate on snapshots of the relevant data, produce their outputs independently, and only the coordinator should see all the results. This prevents a compromised agent from poisoning the input data for other agents mid-execution, which is a more direct attack vector than the output poisoning discussed above.

Observability. Every inter-agent message, every tool call, every context window update should be logged and available for audit. LangGraph has the most developed story here through its LangSmith integration, providing trace IDs, latency breakdowns, and cost attribution (Thread Transfer, 2025). AutoGen requires custom instrumentation. CrewAI logs to console by default. Anthropic’s Claude Code provides some visibility through tmux access to subagents, but this is manual inspection rather than systematic monitoring. None of these match what you’d expect from a production grade system.

The Framework Gap

What’s striking about the current state of multi-agent security is how differently the frameworks handle coordination, and how uniformly they fail to address inter-agent trust.

OpenAI’s approach with the Agents SDK is probably the most conservative. By keeping one agent in charge at any time and using explicit handoff functions, they limit the coordination surface. The tradeoff is that you lose the parallelism that makes agent teams attractive in the first place. OpenAI’s guardrails run concurrently with agent execution and can halt processing if constraints are breached, which is good, but they validate the agent’s relationship with the outside world rather than agents’ relationships with each other (OpenAI, 2025).

LangGraph provides the most structural control through its graph-based architecture. You can define explicit validation nodes between agents, checkpoint state for rollback, and encode failures directly as graph edges. But this capability is opt-in and nothing in the framework forces or even encourages developers to build validation into their agent coordination flows. The graph model gives you the tools to build a secure architecture, but it gives you equal tools to build an insecure one (Galileo, 2025).

CrewAI’s hierarchical model, with its task-level tool scoping, comes closest to implementing least-privilege principles. But the hierarchy also means the manager agent is a high-value target. If the manager’s context is poisoned, every downstream agent’s work is affected. And CrewAI’s observability is limited to console logging, making it harder to detect compromise after the fact (Thread Transfer, 2025).

AutoGen’s conversational approach offers flexibility but creates a large attack surface. Every message in the multi-agent dialogue is a potential injection vector, and the framework’s reliance on conversational retries means a compromised agent gets multiple attempts to influence the conversation (Galileo, 2025).

Claude Code’s agent teams sit somewhere in the middle. The local execution environment provides meaningful containment, and the tmux-based access gives developers direct visibility into each agent’s activity. But the coordination model is still flat trust, and as the capability matures, the pressure to extend it beyond local development will be considerable.

Why This Matters

There’s a pattern in technology adoption where convenience features ship before security features, and the architecture solidifies around the convenient version before anyone builds the secure version. We saw it with web applications (SQL injection was trivial because nobody sanitised inputs). We saw it with cloud computing (S3 buckets were public by default). We saw it with MCP (the protocol shipped without authentication and people connected it to production systems anyway).

Agent teams are at the beginning of this curve. The capability is real and useful, and teams are going to adopt it because the productivity gains are immediate and the security risks are theoretical (until they aren’t). Anthropic, OpenAI, Microsoft, and the open-source frameworks have each shipped versions that work well for their intended use cases within their intended boundaries.

But capabilities like this don’t stay contained. The enterprise adoption curve will follow the same trajectory we saw with MCP. Early adopters will use it carefully, with human oversight at every step. Then someone will build an orchestration layer that makes deployment easier. Then someone else will connect that layer to production systems because the demo was impressive. And somewhere in that progression, the gap between what the architecture assumes (all agents are trustworthy) and what it needs to handle (agents processing untrusted input) becomes a real problem rather than a theoretical one.

The academic community is already treating this seriously. The Trust-Vulnerability Paradox paper, the TRiSM framework, BlockAgents, DecentLLMs, the penetration testing studies on AutoGen and CrewAI are all from the past eighteen months. The research is there, what’s missing is the translation of that research into practical tooling that developers actually use when building multi-agent systems.

The model improvements in Opus 4.6 are tangible. The coding gains are strong, the context window expansion is useful, and adaptive thinking is a smart approach to cost-quality tradeoffs. But the most consequential feature in this release might be agent teams, precisely because it changes the security model in ways we haven’t fully worked through yet, and because it arrives alongside similar capabilities from every other major provider.

If you’re building on top of any multi-agent framework, the question to ask isn’t whether your agents can be compromised individually. That’s been true since the first LLM connected to a tool. The question is what happens when a compromised agent is part of a team, and whether your architecture assumes the answer is “nothing, because they’re all on the same side.” That simple assumption is the core vulnerability.

Like this? Get email updates or grab the RSS feed.

More from the blog:

AI slop: psychology, history, and the problem of the ersatz
07 Jan 2026
In 2025, the term “slop” emerged as the dominant descriptor for low-quality AI-generated output. It has quickly joined our shared lexicon, and Merriam-Webster’s human editors chose it as their Word of the Year. As a techno-optimist, I am at worst ambivalent about AI outputs, so…
The missiles are the destination
28 Oct 2025
One of my uncommon enjoyments is the work that happens right in the middle of a big problem that needs to be solved, or even a nosedive. A calmness kicks in, the path gets clearer and I can usually tunnel vision my way through to course correction. I used to think this was spec…
Fall back
17 Oct 2025
What creative studios and dev shops (and probably everyone else, too) need to do to stay relevant in the AI era without becoming commoditized slop. What’s covered: Your people are your moat · Easy to do, hard to be the best · Quality and simplicity · Never look to others · Don…
On getting paid faster
29 Sep 2025
These five cashflow levers are arguably the quickest, easiest wins when optimizing your service business. Frequency of online payment deposits Update your online payment system to deposit into your bank account daily instead of weekly. Or whatever the quickest interval availab…
Reading between the lines
23 Sep 2025
There’s a lot to learn in a professional services business between the day you open your doors and the day you sell or retire or whatever it is you do next. And there are a million different angles you can take to improve the future based on the past. Here is a line graph of my…

All blog posts

Claude Opus 4.6 just shipped agent teams. But can you trust them?

How Multi-Agent Coordination Works

The Trust-Vulnerability Paradox

What Compromise Looks Like

Lateral Movement Between Agents

The Coordination Channel as an Attack Surface

What We Can Do About This?

The Framework Gap

Why This Matters

More from the blog:

AI slop: psychology, history, and the problem of the ersatz

The missiles are the destination

Fall back

On getting paid faster

Reading between the lines

Let’s chat