MiniMax M2.5 vs Claude Opus 4.6: The $1/Hour Model That Shook the AI World

On February 12, 2026 — just six days after Anthropic released Claude Opus 4.6 — Chinese AI company MiniMax dropped M2.5, a model that claims to match Opus 4.6’s coding performance at 1/20th the price. MiniMax’s stock surged 35% in a single day, pushing its market cap past HK$210 billion.

Is M2.5 really an “Opus killer”? Or is the story more nuanced? Let’s dig into the data.

MiniMax: Who Are They?

MiniMax (Shanghai Xiyu Jizhi Technology) is one of China’s leading AI startups, founded in 2021. They’ve released three models in 108 days — M2, M2.1, and M2.5 — maintaining a nearly monthly release cadence that outpaces most competitors.

Key facts about M2.5:

Architecture: Sparse MoE (Mixture of Experts), 230B total parameters, only 10B activated
Release Date: February 12, 2026
Open Source: Yes, weights available on ModelScope
Agent-Native: The first model explicitly designed for agent workflows from the ground up
Training: RL across 200,000+ real-world environments, 10+ programming languages

Benchmark Showdown: M2.5 vs Opus 4.6

Coding Benchmarks

Benchmark	MiniMax M2.5	Claude Opus 4.6	Winner
SWE-Bench Verified (Claude Code scaffold)	80.2%	80.8%	Opus 4.6 (marginal)
SWE-Bench Verified (Droid scaffold)	79.7%	78.9%	M2.5
SWE-Bench Verified (OpenCode scaffold)	76.1%	75.9%	M2.5 (marginal)
Multi-SWE-Bench	51.3%	—	M2.5 (SOTA)
VIBE-Pro (internal)	~Opus 4.5 level	—	Comparable

Agentic & Search

Benchmark	MiniMax M2.5	Claude Opus 4.6
BrowseComp (with context mgmt)	76.3%	—
Wide Search	SOTA	—
RISE (internal search eval)	SOTA	—

Task Completion Speed

Metric	MiniMax M2.5	Claude Opus 4.6
SWE-Bench avg time	22.8 min	22.9 min
Tokens per task	3.52M	—
Improvement vs M2.1	37% faster	—

The speed parity is remarkable — M2.5 completes SWE-Bench tasks in essentially the same time as Opus 4.6, despite being a much smaller model.

The Cost Revolution

This is where M2.5 truly differentiates itself:

Model	Input Price	Output Price	Cost for 1hr @ 100 TPS
MiniMax M2.5-Lightning	$0.30/M tokens	$2.40/M tokens	$1.00
MiniMax M2.5 (50 TPS)	$0.15/M tokens	$1.20/M tokens	$0.30
Claude Opus 4.6	$15/M tokens	$75/M tokens	~$27.00

That’s a 27x price difference for comparable performance. As MiniMax puts it: “$10,000 can run 4 M2.5 agents continuously for an entire year.”

For agent-heavy workloads where token consumption is massive — think SWE-Bench tasks consuming 3.5M tokens each — this cost advantage is transformative.

Architecture Deep Dive: Why 10B Activated Parameters?

M2.5 uses a Sparse MoE architecture with 230B total parameters but only 10B activated at inference time. This is the secret behind its speed and cost efficiency:

100 TPS throughput: Nearly 2x faster than most frontier models
Low VRAM footprint: Makes private deployment feasible
Inference efficiency: Lower compute per token means lower API costs

Forge: The Agent-Native RL Framework

MiniMax built Forge, an in-house agent-native reinforcement learning framework with several innovations:

Decoupled Architecture: Separates the training engine from the agent scaffold, enabling optimization across arbitrary tool configurations
CISPO Algorithm: Ensures MoE model stability during large-scale training
Process Rewards: End-to-end quality monitoring for long agent trajectories, solving the credit assignment problem
Tree-Structured Sample Merging: Achieves ~40x training speedup

Emergent “Spec-Writing” Behavior

During training, M2.5 developed an emergent behavior: before writing any code, it actively plans and decomposes features like a software architect. This “spec-writing tendency” — planning structure, features, and UI design before implementation — emerged naturally from RL training rather than being explicitly programmed.

Where Opus 4.6 Still Wins

While M2.5 is impressive, Opus 4.6 retains clear advantages in several areas:

1. Context Window

Opus 4.6 offers 1M token context (beta) — the first Opus-class model with this capability. M2.5’s context management uses a different approach (discarding history when usage exceeds 30% of max context), which is less sophisticated.

2. Self-Correction & Code Review

Opus 4.6’s “catches its own mistakes” capability — validated across Terminal-Bench 2.0 (65.4%) and real-world codebase testing — remains industry-leading. M2.5 hasn’t published comparable data on self-correction.

3. Reasoning Depth

On ARC-AGI, Opus 4.6 scored 68%, far exceeding other mainstream models. M2.5’s reasoning capabilities, while improved, still have gaps — multiple reviewers noted it failed common-sense logic tests that Opus handles correctly.

4. Agent Teams & Ecosystem

Opus 4.6 comes with Agent Teams — native multi-agent collaboration with shared task lists, mailbox messaging, and independent context windows. Plus the entire Claude Code ecosystem: hooks, skills, MCP servers, and IDE integrations. M2.5 has its own “Expert” system (10,000+ Experts built on launch day), but the ecosystem is younger.

5. Safety & Alignment

Anthropic’s focus on constitutional AI and safety research gives Opus 4.6 a more mature safety profile. MiniMax’s safety approach is less documented publicly.

Real-World Usage: What Developers Say

M2.5 Strengths (from community testing)

Excellent in multi-language coding — supports Go, C, C++, TypeScript, Rust, Kotlin, Python, Java, JavaScript, PHP, Lua, Dart, Ruby
Full-stack capability — Web, Android, iOS, Windows across server APIs, business logic, databases
Strong Office automation — Word, PowerPoint, Excel financial modeling at professional quality
Works well with Claude Code, Trae, VS Code as the underlying model
OpenRouter daily volume exceeded 100B tokens post-launch

M2.5 Weaknesses (from community testing)

Failed the “car wash” logic test and “Ding Zhen” common-sense test
Logic reasoning still has room for improvement vs frontier models
Ecosystem and tooling is less mature than Claude’s
Less proven in long-running, complex agentic sessions

The Bigger Picture: What M2.5 Means for AI Coding

The Cost Floor Has Dropped

M2.5 proves that frontier-level coding performance is no longer gated by cost. At $1/hour, agent-heavy workflows become economically viable for startups and individual developers. This shifts the competition from “which model can do it” to “which model does it best within your workflow.”

Open Source Catches Up

M2.5 is the first open-source model to credibly match Claude Opus on coding benchmarks. Combined with its small activated parameter count (10B), it’s deployable on-premise — a critical requirement for enterprises with data sovereignty concerns.

RL Scaling is the New Battleground

MiniMax’s 108-day evolution from M2 to M2.5 — driven by RL scaling across 200,000+ real-world environments — demonstrates that post-training optimization is the new frontier. The rate of improvement, not the absolute benchmark score, may be the more important metric.

The Speed of Chinese AI

M2.5 launched alongside GLM-5 (Zhipu AI) and news of DeepSeek’s upcoming V4 — all during China’s Spring Festival. The “AI Spring Festival season” has become a concentrated showcase of China’s AI capabilities, with multiple models simultaneously targeting Opus 4.6-level performance.

Practical Recommendations

Use M2.5 When:

Cost is the primary constraint — agent workflows, batch processing, or continuous operation
Private deployment is required — 10B activated parameters makes local hosting feasible
Multi-language coding — M2.5’s training across 10+ languages and 200K environments shows strong multilingual coverage
Office automation — Word, PPT, Excel tasks where M2.5 achieves SOTA

Use Opus 4.6 When:

Maximum reliability matters — self-correction, long context, and proven real-world performance
Complex multi-step agentic tasks — Agent Teams, effort tuning, and 1M context window
Ecosystem integration — Claude Code, hooks, skills, MCP servers, IDE plugins
Safety-critical applications — Anthropic’s constitutional AI approach provides stronger alignment guarantees

Consider Both:

Many developers are already using M2.5 for cost-sensitive agent tasks while reserving Opus 4.6 for critical work. OpenHands summarized it well: “It’s essentially a two-horse race — Opus represents the highest capability ceiling, M2.5 offers high capability at much lower cost.”

Conclusion

MiniMax M2.5 is a genuinely impressive model that narrows the gap with Opus 4.6 on coding benchmarks while offering a 20x cost advantage. But benchmarks don’t tell the whole story. Opus 4.6’s advantages in context length, self-correction, ecosystem depth, and safety alignment remain significant for production use cases.

The real winner? Developers. Competition this fierce — with frontier-level models available at $1/hour — means AI coding assistance is rapidly becoming accessible to everyone. The question is no longer “can AI code?” but “which AI should I code with, and when?”

The AI model wars of 2026 are just getting started.