Skip to main content
Featured Opus 4.6 Agent Teams Model Release Claude Code Benchmarks

Claude Opus 4.6: Agent Teams, 1M Context & Effort Tuning

Claude Opus 4.6 brings major upgrades for Claude Code users: Agent Teams (swarms), 1M token context, effort tuning, and state-of-the-art benchmarks across coding, reasoning, and agentic tasks.

February 6, 2026 10 min read By Claude World

Anthropic just dropped Claude Opus 4.6 — and it’s not just an incremental upgrade. This is a fundamentally more capable model that plans more carefully, sustains agentic tasks longer, operates reliably in massive codebases, and catches its own mistakes. For Claude Code users, it also comes with two game-changing features: Agent Teams and Effort Tuning.

Here’s everything you need to know.


What Makes Opus 4.6 Different

Claude Code developer Boris Cherny summed it up:

“I’ve been using Opus 4.6 for a bit — it is our best model yet. It is more agentic, more intelligent, runs for longer, and is more careful and exhaustive.”

The key improvements:

  • Better planning: Opus 4.6 deliberates more carefully before acting, reducing wasted steps in complex tasks
  • Longer agentic sessions: The model sustains focus across extended multi-step operations without degrading
  • Massive codebase reliability: Operates effectively in large, real-world codebases where earlier models would lose context
  • Self-correction: Superior code review and debugging capabilities — it catches its own mistakes
  • 1M token context (beta): The first Opus-class model with a million-token context window, with dramatically reduced “context rot” — 76% accuracy on MRCR v2’s 8-needle 1M variant vs Sonnet 4.5’s 18.5%

Benchmark Results: State-of-the-Art

Opus 4.6 leads or matches the best results across nearly every major benchmark. Here’s the full comparison against Opus 4.5, Sonnet 4.5, Gemini 3 Pro, and GPT-5.2:

Agentic Coding

BenchmarkOpus 4.6Opus 4.5Sonnet 4.5Gemini 3 ProGPT-5.2
Terminal-Bench 2.0 (Agentic terminal coding)65.4%59.8%51.0%56.2%64.7% (Codex CLI)
SWE-bench Verified (Agentic coding)80.8%80.9%77.2%76.2%80.0%
OSWorld (Agentic computer use)72.7%66.3%61.4%

Agentic Tool Use

BenchmarkOpus 4.6Opus 4.5Sonnet 4.5Gemini 3 ProGPT-5.2
t2-bench Retail91.9%88.9%86.2%85.3%82.0%
t2-bench Telecom99.3%98.2%98.0%98.0%98.7%
MCP Atlas (Scaled tool use)59.5%62.3%43.8%54.1%60.6%
BrowseComp (Agentic search)84.0%67.8%43.9%59.2%77.9%

Reasoning & Knowledge

BenchmarkOpus 4.6Opus 4.5Sonnet 4.5Gemini 3 ProGPT-5.2
Humanity’s Last Exam (no tools)40.0%30.8%17.7%37.5%36.6%
Humanity’s Last Exam (with tools)53.1%43.4%33.6%45.8%50.0%
ARC AGI 2 (Novel problem-solving)68.8%37.6%13.6%45.1%54.2%
GPQA Diamond (Graduate-level reasoning)91.3%87.0%83.4%91.9%93.2%
MMMLU (Multilingual Q&A)91.1%90.8%89.5%91.8%89.6%

Other Domains

BenchmarkOpus 4.6Opus 4.5Sonnet 4.5Gemini 3 ProGPT-5.2
Finance Agent (Agentic financial analysis)60.7%55.9%54.2%44.1%56.6%
GDPVal-AA (Office tasks, Elo)16061416127711951462
MMMU Pro (Visual reasoning, no tools)73.9%70.6%63.4%81.0%79.5%
MMMU Pro (Visual reasoning, with tools)77.3%73.9%68.9%80.4%

Standout results:

  • ARC AGI 2: 68.8% — nearly double Opus 4.5’s 37.6%, a massive jump in novel problem-solving
  • BrowseComp: 84.0% — 24% ahead of the nearest competitor for hard-to-find information retrieval
  • Terminal-Bench 2.0: 65.4% — highest score, narrowly beating GPT-5.2’s Codex CLI at 64.7%
  • GDPVal-AA: 1606 Elo — outperforms GPT-5.2 by 144 points and Opus 4.5 by 190 points

Additionally, Opus 4.6 is 2x better than Opus 4.5 on computational biology, structural biology, organic chemistry, and phylogenetics. On BigLaw Bench, it scores 90.2% with 40% perfect scores. It also outperforms all competitors in cybersecurity vulnerability detection.


New for Claude Code: Agent Teams (Research Preview)

This is the biggest Claude Code feature drop in a while. Agent Teams (aka swarms) let you orchestrate multiple Claude Code instances working together as a team.

How It Works

  • A lead agent coordinates work, spawns teammates, and synthesizes results
  • Teammates work independently, each in its own context window
  • Teammates can message each other directly — not just report back to the lead
  • A shared task list coordinates work across the team
  • Tasks have dependency tracking with automatic unblocking

Agent Teams vs Sub-Agents

Sub-AgentsAgent Teams
ContextOwn window; results return to callerOwn window; fully independent
CommunicationReport results back onlyMessage each other directly
CoordinationMain agent manages allShared task list, self-coordination
Best forFocused tasksComplex collaborative work
Token costLowerHigher

Use Cases

  • Parallel code review: Security reviewer + performance checker + test coverage validator, all working simultaneously
  • Competing hypothesis debugging: 5 agents investigating different theories, debating each other like a scientific panel
  • Cross-layer features: Frontend, backend, and test agents each owning their domain
  • Research tasks: Multiple agents investigating different aspects and synthesizing findings

How to Enable

Add to your settings.json:

{
  "env": {
    "CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS": "1"
  }
}

Then tell Claude to create a team:

Create an agent team to refactor the authentication module.
Spawn three teammates: one for the API layer, one for the database layer,
one for test coverage. Require plan approval before changes.

Display Modes

  • In-process (default): All teammates in your terminal. Use Shift+Up/Down to select and message teammates directly.
  • Split panes: Each teammate gets its own pane via tmux or iTerm2. See all output at once.

Key Controls

  • Shift+Up/Down — Select a teammate
  • Shift+Tab — Toggle delegate mode (lead coordinates only, no coding)
  • Ctrl+T — Toggle task list
  • Enter on a teammate — View their session
  • Escape — Interrupt a teammate’s current turn

New: Effort Tuning in Claude Code

You can now precisely control how much the model thinks. Run /model and use arrow left/right to tune effort:

  • Less effort = Faster responses, lower token cost
  • More effort = Longer thinking, better results for complex tasks

This maps to the API’s new Adaptive Thinking with four effort levels:

LevelBehavior
LowFast, minimal thinking
MediumBalanced
High (default)Standard extended thinking
MaxMaximum deliberation for hardest problems

API & Developer Details

Model ID

claude-opus-4-6

Pricing

Unchanged from Opus 4.5:

  • Input: $5 / million tokens
  • Output: $25 / million tokens
  • Prompts exceeding 200k tokens: $10 / $37.50 per million

New API Features

  • Context Compaction (beta): Automatically summarizes older context at configurable thresholds — essential for long-running agentic operations
  • 128k output tokens supported
  • US-only inference option at 1.1x pricing for data residency requirements

Availability

  • claude.ai — available now
  • Claude APIclaude-opus-4-6
  • Amazon Bedrock — available
  • Google Cloud Vertex AI — available

What This Means for Claude Code Users

Opus 4.6 is the default model for Claude Code users on the Max plan. Here’s the practical impact:

  1. Longer, more reliable coding sessions: The model won’t degrade mid-task. It sustains focus across complex, multi-file refactors.

  2. Agent Teams unlock true parallelism: Instead of one agent working sequentially, you can now have a team of specialists working simultaneously and talking to each other.

  3. Effort tuning saves tokens: Quick fixes? Dial down. Complex architecture? Dial up. You control the cost-intelligence tradeoff per task.

  4. 1M context means fewer “I’ve lost track” moments: With the beta 1M context window, you can work with massive codebases without the model forgetting earlier context.

  5. Better self-correction: The model is more likely to catch bugs in its own code before you have to point them out.


Safety & Alignment

Opus 4.6 maintains Anthropic’s strong safety profile:

  • Matches or exceeds Opus 4.5’s alignment record
  • Lowest over-refusal rate among recent Claude models
  • Six new cybersecurity probes to monitor its enhanced capabilities
  • Interpretability research integrated into evaluation methodology

Also New: Claude in Office Tools

Worth mentioning for completeness:

  • Claude in Excel: Improved long-running task handling, pre-planning, unstructured data ingestion, multi-step changes in single pass
  • Claude in PowerPoint (research preview): Generates from descriptions or templates, respects layouts and fonts

The Bottom Line

Opus 4.6 represents a significant step forward, particularly for agentic use cases. The combination of a smarter model + Agent Teams + effort tuning makes Claude Code substantially more powerful for real-world software engineering.

If you’re already on Claude Code, update and try:

  1. Enable Agent Teams for your next complex task
  2. Experiment with effort tuning via /model + arrow keys
  3. Push the limits of the 1M context window on a large codebase

The future of AI-assisted coding isn’t a single agent doing everything sequentially — it’s a coordinated team of specialists working in parallel. Opus 4.6 + Agent Teams is the first real implementation of that vision.