Claude Opus 4.6: Agent Teams, 1M Context & Effort Tuning

Anthropic just dropped Claude Opus 4.6 — and it’s not just an incremental upgrade. This is a fundamentally more capable model that plans more carefully, sustains agentic tasks longer, operates reliably in massive codebases, and catches its own mistakes. For Claude Code users, it also comes with two game-changing features: Agent Teams and Effort Tuning.

Here’s everything you need to know.

What Makes Opus 4.6 Different

Claude Code developer Boris Cherny summed it up:

“I’ve been using Opus 4.6 for a bit — it is our best model yet. It is more agentic, more intelligent, runs for longer, and is more careful and exhaustive.”

The key improvements:

Better planning: Opus 4.6 deliberates more carefully before acting, reducing wasted steps in complex tasks
Longer agentic sessions: The model sustains focus across extended multi-step operations without degrading
Massive codebase reliability: Operates effectively in large, real-world codebases where earlier models would lose context
Self-correction: Superior code review and debugging capabilities — it catches its own mistakes
1M token context (beta): The first Opus-class model with a million-token context window, with dramatically reduced “context rot” — 76% accuracy on MRCR v2’s 8-needle 1M variant vs Sonnet 4.5’s 18.5%

Benchmark Results: State-of-the-Art

Opus 4.6 leads or matches the best results across nearly every major benchmark. Here’s the full comparison against Opus 4.5, Sonnet 4.5, Gemini 3 Pro, and GPT-5.2:

Agentic Coding

Benchmark	Opus 4.6	Opus 4.5	Sonnet 4.5	Gemini 3 Pro	GPT-5.2
Terminal-Bench 2.0 (Agentic terminal coding)	65.4%	59.8%	51.0%	56.2%	64.7% (Codex CLI)
SWE-bench Verified (Agentic coding)	80.8%	80.9%	77.2%	76.2%	80.0%
OSWorld (Agentic computer use)	72.7%	66.3%	61.4%	—	—

Agentic Tool Use

Benchmark	Opus 4.6	Opus 4.5	Sonnet 4.5	Gemini 3 Pro	GPT-5.2
t2-bench Retail	91.9%	88.9%	86.2%	85.3%	82.0%
t2-bench Telecom	99.3%	98.2%	98.0%	98.0%	98.7%
MCP Atlas (Scaled tool use)	59.5%	62.3%	43.8%	54.1%	60.6%
BrowseComp (Agentic search)	84.0%	67.8%	43.9%	59.2%	77.9%

Reasoning & Knowledge

Benchmark	Opus 4.6	Opus 4.5	Sonnet 4.5	Gemini 3 Pro	GPT-5.2
Humanity’s Last Exam (no tools)	40.0%	30.8%	17.7%	37.5%	36.6%
Humanity’s Last Exam (with tools)	53.1%	43.4%	33.6%	45.8%	50.0%
ARC AGI 2 (Novel problem-solving)	68.8%	37.6%	13.6%	45.1%	54.2%
GPQA Diamond (Graduate-level reasoning)	91.3%	87.0%	83.4%	91.9%	93.2%
MMMLU (Multilingual Q&A)	91.1%	90.8%	89.5%	91.8%	89.6%

Other Domains

Benchmark	Opus 4.6	Opus 4.5	Sonnet 4.5	Gemini 3 Pro	GPT-5.2
Finance Agent (Agentic financial analysis)	60.7%	55.9%	54.2%	44.1%	56.6%
GDPVal-AA (Office tasks, Elo)	1606	1416	1277	1195	1462
MMMU Pro (Visual reasoning, no tools)	73.9%	70.6%	63.4%	81.0%	79.5%
MMMU Pro (Visual reasoning, with tools)	77.3%	73.9%	68.9%	—	80.4%

Standout results:

ARC AGI 2: 68.8% — nearly double Opus 4.5’s 37.6%, a massive jump in novel problem-solving
BrowseComp: 84.0% — 24% ahead of the nearest competitor for hard-to-find information retrieval
Terminal-Bench 2.0: 65.4% — highest score, narrowly beating GPT-5.2’s Codex CLI at 64.7%
GDPVal-AA: 1606 Elo — outperforms GPT-5.2 by 144 points and Opus 4.5 by 190 points

Additionally, Opus 4.6 is 2x better than Opus 4.5 on computational biology, structural biology, organic chemistry, and phylogenetics. On BigLaw Bench, it scores 90.2% with 40% perfect scores. It also outperforms all competitors in cybersecurity vulnerability detection.

New for Claude Code: Agent Teams (Research Preview)

This is the biggest Claude Code feature drop in a while. Agent Teams (aka swarms) let you orchestrate multiple Claude Code instances working together as a team.

How It Works

A lead agent coordinates work, spawns teammates, and synthesizes results
Teammates work independently, each in its own context window
Teammates can message each other directly — not just report back to the lead
A shared task list coordinates work across the team
Tasks have dependency tracking with automatic unblocking

Agent Teams vs Sub-Agents

	Sub-Agents	Agent Teams
Context	Own window; results return to caller	Own window; fully independent
Communication	Report results back only	Message each other directly
Coordination	Main agent manages all	Shared task list, self-coordination
Best for	Focused tasks	Complex collaborative work
Token cost	Lower	Higher

Use Cases

Parallel code review: Security reviewer + performance checker + test coverage validator, all working simultaneously
Competing hypothesis debugging: 5 agents investigating different theories, debating each other like a scientific panel
Cross-layer features: Frontend, backend, and test agents each owning their domain
Research tasks: Multiple agents investigating different aspects and synthesizing findings

How to Enable

Add to your settings.json:

{
  "env": {
    "CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS": "1"
  }
}

Then tell Claude to create a team:

Create an agent team to refactor the authentication module.
Spawn three teammates: one for the API layer, one for the database layer,
one for test coverage. Require plan approval before changes.

Display Modes

In-process (default): All teammates in your terminal. Use Shift+Up/Down to select and message teammates directly.
Split panes: Each teammate gets its own pane via tmux or iTerm2. See all output at once.

Key Controls

Shift+Up/Down — Select a teammate
Shift+Tab — Toggle delegate mode (lead coordinates only, no coding)
Ctrl+T — Toggle task list
Enter on a teammate — View their session
Escape — Interrupt a teammate’s current turn

New: Effort Tuning in Claude Code

You can now precisely control how much the model thinks. Run /model and use arrow left/right to tune effort:

Less effort = Faster responses, lower token cost
More effort = Longer thinking, better results for complex tasks

This maps to the API’s new Adaptive Thinking with four effort levels:

Level	Behavior
Low	Fast, minimal thinking
Medium	Balanced
High (default)	Standard extended thinking
Max	Maximum deliberation for hardest problems

API & Developer Details

Model ID

claude-opus-4-6

Pricing

Unchanged from Opus 4.5:

Input: $5 / million tokens
Output: $25 / million tokens
Prompts exceeding 200k tokens: $10 / $37.50 per million

New API Features

Context Compaction (beta): Automatically summarizes older context at configurable thresholds — essential for long-running agentic operations
128k output tokens supported
US-only inference option at 1.1x pricing for data residency requirements

Availability

claude.ai — available now
Claude API — claude-opus-4-6
Amazon Bedrock — available
Google Cloud Vertex AI — available

What This Means for Claude Code Users

Opus 4.6 is the default model for Claude Code users on the Max plan. Here’s the practical impact:

Longer, more reliable coding sessions: The model won’t degrade mid-task. It sustains focus across complex, multi-file refactors.
Agent Teams unlock true parallelism: Instead of one agent working sequentially, you can now have a team of specialists working simultaneously and talking to each other.
Effort tuning saves tokens: Quick fixes? Dial down. Complex architecture? Dial up. You control the cost-intelligence tradeoff per task.
1M context means fewer “I’ve lost track” moments: With the beta 1M context window, you can work with massive codebases without the model forgetting earlier context.
Better self-correction: The model is more likely to catch bugs in its own code before you have to point them out.

Safety & Alignment

Opus 4.6 maintains Anthropic’s strong safety profile:

Matches or exceeds Opus 4.5’s alignment record
Lowest over-refusal rate among recent Claude models
Six new cybersecurity probes to monitor its enhanced capabilities
Interpretability research integrated into evaluation methodology

Also New: Claude in Office Tools

Worth mentioning for completeness:

Claude in Excel: Improved long-running task handling, pre-planning, unstructured data ingestion, multi-step changes in single pass
Claude in PowerPoint (research preview): Generates from descriptions or templates, respects layouts and fonts

The Bottom Line

Opus 4.6 represents a significant step forward, particularly for agentic use cases. The combination of a smarter model + Agent Teams + effort tuning makes Claude Code substantially more powerful for real-world software engineering.

If you’re already on Claude Code, update and try:

Enable Agent Teams for your next complex task
Experiment with effort tuning via /model + arrow keys
Push the limits of the 1M context window on a large codebase

The future of AI-assisted coding isn’t a single agent doing everything sequentially — it’s a coordinated team of specialists working in parallel. Opus 4.6 + Agent Teams is the first real implementation of that vision.