Claude Opus 4.6: Agent Teams, 1M Context & Effort Tuning
Claude Opus 4.6 brings major upgrades for Claude Code users: Agent Teams (swarms), 1M token context, effort tuning, and state-of-the-art benchmarks across coding, reasoning, and agentic tasks.
Anthropic just dropped Claude Opus 4.6 — and it’s not just an incremental upgrade. This is a fundamentally more capable model that plans more carefully, sustains agentic tasks longer, operates reliably in massive codebases, and catches its own mistakes. For Claude Code users, it also comes with two game-changing features: Agent Teams and Effort Tuning.
Here’s everything you need to know.
What Makes Opus 4.6 Different
Claude Code developer Boris Cherny summed it up:
“I’ve been using Opus 4.6 for a bit — it is our best model yet. It is more agentic, more intelligent, runs for longer, and is more careful and exhaustive.”
The key improvements:
- Better planning: Opus 4.6 deliberates more carefully before acting, reducing wasted steps in complex tasks
- Longer agentic sessions: The model sustains focus across extended multi-step operations without degrading
- Massive codebase reliability: Operates effectively in large, real-world codebases where earlier models would lose context
- Self-correction: Superior code review and debugging capabilities — it catches its own mistakes
- 1M token context (beta): The first Opus-class model with a million-token context window, with dramatically reduced “context rot” — 76% accuracy on MRCR v2’s 8-needle 1M variant vs Sonnet 4.5’s 18.5%
Benchmark Results: State-of-the-Art
Opus 4.6 leads or matches the best results across nearly every major benchmark. Here’s the full comparison against Opus 4.5, Sonnet 4.5, Gemini 3 Pro, and GPT-5.2:
Agentic Coding
| Benchmark | Opus 4.6 | Opus 4.5 | Sonnet 4.5 | Gemini 3 Pro | GPT-5.2 |
|---|---|---|---|---|---|
| Terminal-Bench 2.0 (Agentic terminal coding) | 65.4% | 59.8% | 51.0% | 56.2% | 64.7% (Codex CLI) |
| SWE-bench Verified (Agentic coding) | 80.8% | 80.9% | 77.2% | 76.2% | 80.0% |
| OSWorld (Agentic computer use) | 72.7% | 66.3% | 61.4% | — | — |
Agentic Tool Use
| Benchmark | Opus 4.6 | Opus 4.5 | Sonnet 4.5 | Gemini 3 Pro | GPT-5.2 |
|---|---|---|---|---|---|
| t2-bench Retail | 91.9% | 88.9% | 86.2% | 85.3% | 82.0% |
| t2-bench Telecom | 99.3% | 98.2% | 98.0% | 98.0% | 98.7% |
| MCP Atlas (Scaled tool use) | 59.5% | 62.3% | 43.8% | 54.1% | 60.6% |
| BrowseComp (Agentic search) | 84.0% | 67.8% | 43.9% | 59.2% | 77.9% |
Reasoning & Knowledge
| Benchmark | Opus 4.6 | Opus 4.5 | Sonnet 4.5 | Gemini 3 Pro | GPT-5.2 |
|---|---|---|---|---|---|
| Humanity’s Last Exam (no tools) | 40.0% | 30.8% | 17.7% | 37.5% | 36.6% |
| Humanity’s Last Exam (with tools) | 53.1% | 43.4% | 33.6% | 45.8% | 50.0% |
| ARC AGI 2 (Novel problem-solving) | 68.8% | 37.6% | 13.6% | 45.1% | 54.2% |
| GPQA Diamond (Graduate-level reasoning) | 91.3% | 87.0% | 83.4% | 91.9% | 93.2% |
| MMMLU (Multilingual Q&A) | 91.1% | 90.8% | 89.5% | 91.8% | 89.6% |
Other Domains
| Benchmark | Opus 4.6 | Opus 4.5 | Sonnet 4.5 | Gemini 3 Pro | GPT-5.2 |
|---|---|---|---|---|---|
| Finance Agent (Agentic financial analysis) | 60.7% | 55.9% | 54.2% | 44.1% | 56.6% |
| GDPVal-AA (Office tasks, Elo) | 1606 | 1416 | 1277 | 1195 | 1462 |
| MMMU Pro (Visual reasoning, no tools) | 73.9% | 70.6% | 63.4% | 81.0% | 79.5% |
| MMMU Pro (Visual reasoning, with tools) | 77.3% | 73.9% | 68.9% | — | 80.4% |
Standout results:
- ARC AGI 2: 68.8% — nearly double Opus 4.5’s 37.6%, a massive jump in novel problem-solving
- BrowseComp: 84.0% — 24% ahead of the nearest competitor for hard-to-find information retrieval
- Terminal-Bench 2.0: 65.4% — highest score, narrowly beating GPT-5.2’s Codex CLI at 64.7%
- GDPVal-AA: 1606 Elo — outperforms GPT-5.2 by 144 points and Opus 4.5 by 190 points
Additionally, Opus 4.6 is 2x better than Opus 4.5 on computational biology, structural biology, organic chemistry, and phylogenetics. On BigLaw Bench, it scores 90.2% with 40% perfect scores. It also outperforms all competitors in cybersecurity vulnerability detection.
New for Claude Code: Agent Teams (Research Preview)
This is the biggest Claude Code feature drop in a while. Agent Teams (aka swarms) let you orchestrate multiple Claude Code instances working together as a team.
How It Works
- A lead agent coordinates work, spawns teammates, and synthesizes results
- Teammates work independently, each in its own context window
- Teammates can message each other directly — not just report back to the lead
- A shared task list coordinates work across the team
- Tasks have dependency tracking with automatic unblocking
Agent Teams vs Sub-Agents
| Sub-Agents | Agent Teams | |
|---|---|---|
| Context | Own window; results return to caller | Own window; fully independent |
| Communication | Report results back only | Message each other directly |
| Coordination | Main agent manages all | Shared task list, self-coordination |
| Best for | Focused tasks | Complex collaborative work |
| Token cost | Lower | Higher |
Use Cases
- Parallel code review: Security reviewer + performance checker + test coverage validator, all working simultaneously
- Competing hypothesis debugging: 5 agents investigating different theories, debating each other like a scientific panel
- Cross-layer features: Frontend, backend, and test agents each owning their domain
- Research tasks: Multiple agents investigating different aspects and synthesizing findings
How to Enable
Add to your settings.json:
{
"env": {
"CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS": "1"
}
}
Then tell Claude to create a team:
Create an agent team to refactor the authentication module.
Spawn three teammates: one for the API layer, one for the database layer,
one for test coverage. Require plan approval before changes.
Display Modes
- In-process (default): All teammates in your terminal. Use
Shift+Up/Downto select and message teammates directly. - Split panes: Each teammate gets its own pane via tmux or iTerm2. See all output at once.
Key Controls
Shift+Up/Down— Select a teammateShift+Tab— Toggle delegate mode (lead coordinates only, no coding)Ctrl+T— Toggle task listEnteron a teammate — View their sessionEscape— Interrupt a teammate’s current turn
New: Effort Tuning in Claude Code
You can now precisely control how much the model thinks. Run /model and use arrow left/right to tune effort:
- Less effort = Faster responses, lower token cost
- More effort = Longer thinking, better results for complex tasks
This maps to the API’s new Adaptive Thinking with four effort levels:
| Level | Behavior |
|---|---|
| Low | Fast, minimal thinking |
| Medium | Balanced |
| High (default) | Standard extended thinking |
| Max | Maximum deliberation for hardest problems |
API & Developer Details
Model ID
claude-opus-4-6
Pricing
Unchanged from Opus 4.5:
- Input: $5 / million tokens
- Output: $25 / million tokens
- Prompts exceeding 200k tokens: $10 / $37.50 per million
New API Features
- Context Compaction (beta): Automatically summarizes older context at configurable thresholds — essential for long-running agentic operations
- 128k output tokens supported
- US-only inference option at 1.1x pricing for data residency requirements
Availability
- claude.ai — available now
- Claude API —
claude-opus-4-6 - Amazon Bedrock — available
- Google Cloud Vertex AI — available
What This Means for Claude Code Users
Opus 4.6 is the default model for Claude Code users on the Max plan. Here’s the practical impact:
-
Longer, more reliable coding sessions: The model won’t degrade mid-task. It sustains focus across complex, multi-file refactors.
-
Agent Teams unlock true parallelism: Instead of one agent working sequentially, you can now have a team of specialists working simultaneously and talking to each other.
-
Effort tuning saves tokens: Quick fixes? Dial down. Complex architecture? Dial up. You control the cost-intelligence tradeoff per task.
-
1M context means fewer “I’ve lost track” moments: With the beta 1M context window, you can work with massive codebases without the model forgetting earlier context.
-
Better self-correction: The model is more likely to catch bugs in its own code before you have to point them out.
Safety & Alignment
Opus 4.6 maintains Anthropic’s strong safety profile:
- Matches or exceeds Opus 4.5’s alignment record
- Lowest over-refusal rate among recent Claude models
- Six new cybersecurity probes to monitor its enhanced capabilities
- Interpretability research integrated into evaluation methodology
Also New: Claude in Office Tools
Worth mentioning for completeness:
- Claude in Excel: Improved long-running task handling, pre-planning, unstructured data ingestion, multi-step changes in single pass
- Claude in PowerPoint (research preview): Generates from descriptions or templates, respects layouts and fonts
The Bottom Line
Opus 4.6 represents a significant step forward, particularly for agentic use cases. The combination of a smarter model + Agent Teams + effort tuning makes Claude Code substantially more powerful for real-world software engineering.
If you’re already on Claude Code, update and try:
- Enable Agent Teams for your next complex task
- Experiment with effort tuning via
/model+ arrow keys - Push the limits of the 1M context window on a large codebase
The future of AI-assisted coding isn’t a single agent doing everything sequentially — it’s a coordinated team of specialists working in parallel. Opus 4.6 + Agent Teams is the first real implementation of that vision.