MiniMax M2.5 vs Claude Opus 4.6: The $1/Hour Model That Shook the AI World
MiniMax M2.5 claims to match Opus 4.6 in coding and agentic tasks at 1/20th the cost. We break down the benchmarks, architecture, and what this means for developers choosing their AI coding model.
On February 12, 2026 — just six days after Anthropic released Claude Opus 4.6 — Chinese AI company MiniMax dropped M2.5, a model that claims to match Opus 4.6’s coding performance at 1/20th the price. MiniMax’s stock surged 35% in a single day, pushing its market cap past HK$210 billion.
Is M2.5 really an “Opus killer”? Or is the story more nuanced? Let’s dig into the data.
MiniMax: Who Are They?
MiniMax (Shanghai Xiyu Jizhi Technology) is one of China’s leading AI startups, founded in 2021. They’ve released three models in 108 days — M2, M2.1, and M2.5 — maintaining a nearly monthly release cadence that outpaces most competitors.
Key facts about M2.5:
- Architecture: Sparse MoE (Mixture of Experts), 230B total parameters, only 10B activated
- Release Date: February 12, 2026
- Open Source: Yes, weights available on ModelScope
- Agent-Native: The first model explicitly designed for agent workflows from the ground up
- Training: RL across 200,000+ real-world environments, 10+ programming languages
Benchmark Showdown: M2.5 vs Opus 4.6
Coding Benchmarks
| Benchmark | MiniMax M2.5 | Claude Opus 4.6 | Winner |
|---|---|---|---|
| SWE-Bench Verified (Claude Code scaffold) | 80.2% | 80.8% | Opus 4.6 (marginal) |
| SWE-Bench Verified (Droid scaffold) | 79.7% | 78.9% | M2.5 |
| SWE-Bench Verified (OpenCode scaffold) | 76.1% | 75.9% | M2.5 (marginal) |
| Multi-SWE-Bench | 51.3% | — | M2.5 (SOTA) |
| VIBE-Pro (internal) | ~Opus 4.5 level | — | Comparable |
Agentic & Search
| Benchmark | MiniMax M2.5 | Claude Opus 4.6 |
|---|---|---|
| BrowseComp (with context mgmt) | 76.3% | — |
| Wide Search | SOTA | — |
| RISE (internal search eval) | SOTA | — |
Task Completion Speed
| Metric | MiniMax M2.5 | Claude Opus 4.6 |
|---|---|---|
| SWE-Bench avg time | 22.8 min | 22.9 min |
| Tokens per task | 3.52M | — |
| Improvement vs M2.1 | 37% faster | — |
The speed parity is remarkable — M2.5 completes SWE-Bench tasks in essentially the same time as Opus 4.6, despite being a much smaller model.
The Cost Revolution
This is where M2.5 truly differentiates itself:
| Model | Input Price | Output Price | Cost for 1hr @ 100 TPS |
|---|---|---|---|
| MiniMax M2.5-Lightning | $0.30/M tokens | $2.40/M tokens | $1.00 |
| MiniMax M2.5 (50 TPS) | $0.15/M tokens | $1.20/M tokens | $0.30 |
| Claude Opus 4.6 | $15/M tokens | $75/M tokens | ~$27.00 |
That’s a 27x price difference for comparable performance. As MiniMax puts it: “$10,000 can run 4 M2.5 agents continuously for an entire year.”
For agent-heavy workloads where token consumption is massive — think SWE-Bench tasks consuming 3.5M tokens each — this cost advantage is transformative.
Architecture Deep Dive: Why 10B Activated Parameters?
M2.5 uses a Sparse MoE architecture with 230B total parameters but only 10B activated at inference time. This is the secret behind its speed and cost efficiency:
- 100 TPS throughput: Nearly 2x faster than most frontier models
- Low VRAM footprint: Makes private deployment feasible
- Inference efficiency: Lower compute per token means lower API costs
Forge: The Agent-Native RL Framework
MiniMax built Forge, an in-house agent-native reinforcement learning framework with several innovations:
- Decoupled Architecture: Separates the training engine from the agent scaffold, enabling optimization across arbitrary tool configurations
- CISPO Algorithm: Ensures MoE model stability during large-scale training
- Process Rewards: End-to-end quality monitoring for long agent trajectories, solving the credit assignment problem
- Tree-Structured Sample Merging: Achieves ~40x training speedup
Emergent “Spec-Writing” Behavior
During training, M2.5 developed an emergent behavior: before writing any code, it actively plans and decomposes features like a software architect. This “spec-writing tendency” — planning structure, features, and UI design before implementation — emerged naturally from RL training rather than being explicitly programmed.
Where Opus 4.6 Still Wins
While M2.5 is impressive, Opus 4.6 retains clear advantages in several areas:
1. Context Window
Opus 4.6 offers 1M token context (beta) — the first Opus-class model with this capability. M2.5’s context management uses a different approach (discarding history when usage exceeds 30% of max context), which is less sophisticated.
2. Self-Correction & Code Review
Opus 4.6’s “catches its own mistakes” capability — validated across Terminal-Bench 2.0 (65.4%) and real-world codebase testing — remains industry-leading. M2.5 hasn’t published comparable data on self-correction.
3. Reasoning Depth
On ARC-AGI, Opus 4.6 scored 68%, far exceeding other mainstream models. M2.5’s reasoning capabilities, while improved, still have gaps — multiple reviewers noted it failed common-sense logic tests that Opus handles correctly.
4. Agent Teams & Ecosystem
Opus 4.6 comes with Agent Teams — native multi-agent collaboration with shared task lists, mailbox messaging, and independent context windows. Plus the entire Claude Code ecosystem: hooks, skills, MCP servers, and IDE integrations. M2.5 has its own “Expert” system (10,000+ Experts built on launch day), but the ecosystem is younger.
5. Safety & Alignment
Anthropic’s focus on constitutional AI and safety research gives Opus 4.6 a more mature safety profile. MiniMax’s safety approach is less documented publicly.
Real-World Usage: What Developers Say
M2.5 Strengths (from community testing)
- Excellent in multi-language coding — supports Go, C, C++, TypeScript, Rust, Kotlin, Python, Java, JavaScript, PHP, Lua, Dart, Ruby
- Full-stack capability — Web, Android, iOS, Windows across server APIs, business logic, databases
- Strong Office automation — Word, PowerPoint, Excel financial modeling at professional quality
- Works well with Claude Code, Trae, VS Code as the underlying model
- OpenRouter daily volume exceeded 100B tokens post-launch
M2.5 Weaknesses (from community testing)
- Failed the “car wash” logic test and “Ding Zhen” common-sense test
- Logic reasoning still has room for improvement vs frontier models
- Ecosystem and tooling is less mature than Claude’s
- Less proven in long-running, complex agentic sessions
The Bigger Picture: What M2.5 Means for AI Coding
The Cost Floor Has Dropped
M2.5 proves that frontier-level coding performance is no longer gated by cost. At $1/hour, agent-heavy workflows become economically viable for startups and individual developers. This shifts the competition from “which model can do it” to “which model does it best within your workflow.”
Open Source Catches Up
M2.5 is the first open-source model to credibly match Claude Opus on coding benchmarks. Combined with its small activated parameter count (10B), it’s deployable on-premise — a critical requirement for enterprises with data sovereignty concerns.
RL Scaling is the New Battleground
MiniMax’s 108-day evolution from M2 to M2.5 — driven by RL scaling across 200,000+ real-world environments — demonstrates that post-training optimization is the new frontier. The rate of improvement, not the absolute benchmark score, may be the more important metric.
The Speed of Chinese AI
M2.5 launched alongside GLM-5 (Zhipu AI) and news of DeepSeek’s upcoming V4 — all during China’s Spring Festival. The “AI Spring Festival season” has become a concentrated showcase of China’s AI capabilities, with multiple models simultaneously targeting Opus 4.6-level performance.
Practical Recommendations
Use M2.5 When:
- Cost is the primary constraint — agent workflows, batch processing, or continuous operation
- Private deployment is required — 10B activated parameters makes local hosting feasible
- Multi-language coding — M2.5’s training across 10+ languages and 200K environments shows strong multilingual coverage
- Office automation — Word, PPT, Excel tasks where M2.5 achieves SOTA
Use Opus 4.6 When:
- Maximum reliability matters — self-correction, long context, and proven real-world performance
- Complex multi-step agentic tasks — Agent Teams, effort tuning, and 1M context window
- Ecosystem integration — Claude Code, hooks, skills, MCP servers, IDE plugins
- Safety-critical applications — Anthropic’s constitutional AI approach provides stronger alignment guarantees
Consider Both:
Many developers are already using M2.5 for cost-sensitive agent tasks while reserving Opus 4.6 for critical work. OpenHands summarized it well: “It’s essentially a two-horse race — Opus represents the highest capability ceiling, M2.5 offers high capability at much lower cost.”
Conclusion
MiniMax M2.5 is a genuinely impressive model that narrows the gap with Opus 4.6 on coding benchmarks while offering a 20x cost advantage. But benchmarks don’t tell the whole story. Opus 4.6’s advantages in context length, self-correction, ecosystem depth, and safety alignment remain significant for production use cases.
The real winner? Developers. Competition this fierce — with frontier-level models available at $1/hour — means AI coding assistance is rapidly becoming accessible to everyone. The question is no longer “can AI code?” but “which AI should I code with, and when?”
The AI model wars of 2026 are just getting started.