From Userland Hack to SDK Native: The CMS Migration Blueprint
The complete architecture for migrating CMS Multi-Session from an external CLI orchestrator to SDK-native IterationEngine — replacing shell scripts with query() calls while keeping every advantage that made CMS work.
Knowledge Comic Series — Maximum technical depth, minimum words.
Why Migrate
CMS works. 50+ iterations, zero context rot, checkpoint recovery. So why touch it?
CMS Today:
Shell script → spawns Claude CLI process → parses stdout → writes checkpoint
├── Depends on CLI flags that can change any release
├── Parses <report> tags from unstructured text output
├── Cannot integrate with SDK's type system
└── Cannot be unit tested without a running Claude process
The Claude Agent SDK gives us query() — a single function call that creates a fresh conversation. Same isolation. But now it’s a Python function, not a subprocess. Testable. Typed. Version-pinned.
The goal: Replace the transport, keep the architecture.
The Core Insight
CMS’s power comes from one architectural decision:
The orchestrator is not an LLM.
A shell script loops. An LLM iterates. The shell script never fills its context window because it doesn’t have one.
The SDK migration must preserve this property:
WRONG approach:
Orchestrator = long-running query() session
→ Accumulates results from sub-queries
→ Same context rot problem
→ You just rebuilt Agent Teams Lead
RIGHT approach:
Orchestrator = Python class (IterationEngine)
→ Calls query() per iteration (no session resume)
→ Each query() gets a fresh context
→ IterationEngine has zero token footprint
→ You preserved CMS's core advantage
The key line:
result = await query(prompt=iteration_prompt, session_id=None)
# ^^^^^^^^^^^^
# No resume. Fresh context. Every time.
Architecture: Before and After
CMS External (Current)
═══════════════════════
┌─────────────────┐
│ cms-iterate.sh │ ← Shell script (not an LLM)
│ │
│ for i in 1..N: │
│ claude --print │─────→ [Claude CLI Process]
│ --prompt ... │ │
│ │ │ stdout
│ parse output │←──────────┘
│ write ckpt │───→ checkpoint.json
│ done │
└─────────────────┘
SDK Native (Target)
═══════════════════
┌──────────────────────┐
│ IterationEngine │ ← Python class (not an LLM)
│ │
│ for i in 1..N: │
│ await query( │─────→ [SDK query() call]
│ prompt=..., │ │
│ session_id=None │ │ ResultMessage
│ ) │←──────────┘
│ parse report │
│ checkpoint.save() │───→ checkpoint.json
│ done │
└──────────────────────┘
Same loop. Same checkpoint. Same isolation. Different transport.
What Changes
| Layer | CMS External | SDK Native |
|---|---|---|
| Orchestrator | Shell script | IterationEngine Python class |
| Spawn | claude --print --prompt | await query(prompt, session_id=None) |
| Output parsing | Regex on stdout <report>...</report> | IterationReport.parse(result.result) |
| Prompt template | .cms-iterate/prompts/iterator-1.md | prompt_loader.get("iterator-system", **vars) |
| Configuration | Hardcoded in shell | IterationConfig dataclass + defaults.yaml |
| Query options | CLI flags | orchestrator.create_fresh_query() |
| Error handling | Process crash = next iteration | try/except with typed errors |
| Testing | Integration only (needs running CLI) | Unit testable (mock query()) |
What Does NOT Change
| Component | Why |
|---|---|
.cms-iterate/ directory structure | Checkpoint, reports, prompts, backups — all same paths |
checkpoint.json format (v1.1.0) | Backward compatible — old checkpoints load into new engine |
| Decision logic | Same 4 stop conditions, same continue rules |
| External CMS Skills | Kept for now, deprecated later — engine runs alongside |
| Existing orchestrator methods | chat(), run(), run_skill() untouched |
| Agent definitions | All .claude/agents/ untouched |
The Four Design Decisions
Decision 1: Public API for Fresh Queries
The engine needs to call query() without accumulating context. This requires a clean public method on the orchestrator — not reaching into private internals.
WRONG:
engine calls orchestrator._build_options(session_id=None)
└── Private method. Breaks when orchestrator refactors internals.
RIGHT:
engine calls orchestrator.create_fresh_query(prompt, max_turns=30)
└── Public contract. Engine doesn't know how options are built.
Why it matters: The orchestrator owns the SDK client, API keys, model selection, and tool configuration. The engine shouldn’t duplicate any of that. It should say “give me a fresh conversation with this prompt” and get back a result.
One method. Clean boundary. The engine never imports the SDK directly.
Decision 2: Parallel Execution from Day One
CMS already runs independent tasks in parallel:
CMS parallel (current):
cms-iterate.sh spawns 3 CLI processes simultaneously
└── Process 1: Task A (no dependencies)
└── Process 2: Task B (no dependencies)
└── Process 3: Task C (no dependencies)
Wait for all → continue with Task D (depends on A, B)
The SDK migration must preserve this. Not as a v2 feature. As a v1 parameter.
SDK parallel (target):
engine.start(request, parallel=True)
└── asyncio.gather(
query(task_A_prompt, session_id=None),
query(task_B_prompt, session_id=None),
query(task_C_prompt, session_id=None),
)
Wait for all → continue with Task D
Config interface:
iteration:
parallel: false # v1 default: sequential
max_parallel_queries: 3 # When parallel=true
Even if v1 ships with parallel=false, the parameter exists. No API signature change for v2.
Decision 3: Evolve Hook Points
CMS’s Self-Evolving Loop is its biggest differentiator over Agent Teams:
Iteration fails → extract learning → evolve skills → retry with better tools
The engine must have hook points for this, even if v1 doesn’t implement the full loop.
_run_loop pseudocode:
for iteration in range(max):
report = await _run_single_iteration(checkpoint)
if report.status == "completed":
_apply_report(checkpoint, report)
checkpoint.recovery.failure_count = 0 ← reset on success
elif report.status == "failed":
checkpoint.recovery.failure_count += 1
if config.enable_evolving: ← HOOK POINT
await _evolve(checkpoint, report) ← extract + evolve
if failure_count >= threshold:
break
checkpoint.save()
v1 implementation of _evolve(): pass. Five characters. But the hook is in the loop, the config flag exists, and the IterationReport already captures the failure context needed for learning.
v2 plugs in: experience-extractor agent → skill-evolver agent → retry with evolved skills. Zero changes to the loop structure.
Decision 4: Defensive Error Handling
CMS processes are naturally isolated — a crashed CLI process doesn’t take down the shell script. SDK query() calls run in the same Python process. An unhandled exception kills the engine.
Three error tiers:
Tier 1 — Expected failures (rate limit, timeout):
→ Return IterationReport(status="failed", errors=[...])
→ Engine continues to next iteration or stops based on threshold
→ Checkpoint saved before and after
Tier 2 — Query anomalies (no <report> tag in output):
→ Return IterationReport(status="partial", raw_output=text)
→ Engine logs warning, attempts next iteration
→ Human can inspect raw output in reports/
Tier 3 — Infrastructure failure (SDK crash, network down):
→ Save checkpoint immediately
→ Re-raise exception (engine stops)
→ Resume picks up from last saved checkpoint
Critical invariant: Checkpoint is always saved before any operation that might fail. The engine never loses more than one iteration’s work.
_run_single_iteration:
checkpoint.save() ← Save BEFORE query (crash protection)
result = await query(...) ← This might fail
report = parse(result) ← This might fail
checkpoint.save() ← Save AFTER success
return report
Data Models
Checkpoint (v1.1.0 Compatible)
Checkpoint
├── version: "1.1.0"
├── iteration_type: "auto-cycle" | "auto-explore" | "custom"
├── request: str
├── current_iteration: int
├── max_iterations: int
├── status: "running" | "completed" | "failed" | "stopped"
│
├── original_context
│ ├── goal: str
│ └── acceptance_criteria_file: str
│
├── context_summary
│ ├── current: str
│ ├── key_decisions: list[str]
│ ├── blockers: list[str]
│ └── next_action: str
│
├── completed_items: list[dict]
├── pending_items: list[dict]
├── history: list[dict] ← One entry per iteration
│
├── progress
│ ├── percent: int
│ └── estimated_remaining: int
│
└── recovery
├── last_successful_iteration: int
└── failure_count: int
Serialization contract: Checkpoint.to_dict() output is byte-identical to what CMS shell script writes. Old checkpoints load with Checkpoint.from_file(). New checkpoints are readable by old CMS.
IterationReport
IterationReport
├── task_id: str
├── iteration: int
├── status: "completed" | "partial" | "failed" | "blocked"
│
├── iteration_result
│ ├── action_taken: str
│ ├── files_changed: list[str]
│ ├── tests_passed: bool
│ └── errors: list[str]
│
├── checkpoint_update
│ ├── completed_items: list[dict]
│ ├── pending_items: list[dict]
│ ├── progress_percent: int
│ └── context_summary: str
│
└── continue_decision
├── should_continue: bool
└── reason: str
Parsed from: <report>JSON</report> tags in query() output — same format CMS already uses.
Engine Public API
IterationEngine
├── start(request, type, max_iterations, ...) → Checkpoint
│ Creates fresh checkpoint, runs loop
│
├── resume() → Checkpoint
│ Loads checkpoint from disk, continues from last iteration
│
├── stop() → None
│ Sets flag, loop exits after current iteration completes
│
└── status() → Checkpoint
Reads checkpoint from disk (no side effects)
Exposed as 3 MCP tools: iteration_start, iteration_resume, iteration_status
Also exposed as orchestrator method: orchestrator.run_iterations(request, ...)
Two entry points, same engine. MCP tools for Claude-driven workflows. Orchestrator method for programmatic use.
Decision Logic (Unchanged from CMS)
CONTINUE if ALL true:
├── pending_items is not empty
├── current_iteration < max_iterations
├── recovery.failure_count < failure_threshold (default: 3)
└── stop() was not called
STOP if ANY true:
├── pending_items is empty → status: "completed"
├── current_iteration >= max_iterations → status: "stopped"
├── recovery.failure_count >= threshold → status: "failed"
└── stop() was called → status: "stopped"
Same four conditions. Same behavior. Now testable without running Claude.
File Map
NEW FILES:
src/core/iteration_engine.py ~400 lines ← Engine + data models
config/prompts/iterator-system.md ~250 lines ← Ported from CMS prompt
tests/core/test_iteration_engine.py ~300 lines ← Unit + integration tests
MODIFIED FILES:
src/core/orchestrator.py +30 lines ← run_iterations() + create_fresh_query()
src/tools/self_dev_tools.py +90 lines ← 3 MCP tools
config/defaults.yaml +6 lines ← iteration config section
UNTOUCHED:
.cms-iterate/ ← Same directory, same format
.claude/agents/ ← All agent definitions
.claude/skills/cms-* ← Kept as fallback
src/core/orchestrator.py (rest) ← chat(), run(), run_skill()
Total new code: ~1,070 lines Total modified: ~126 lines Risk surface: 2 existing files get small additions
Implementation Order
Phase 1 — Foundation (no integration, pure unit testable)
1. iteration_engine.py — Checkpoint + IterationReport data models only
2. test_iteration_engine.py — Data model tests + backward compat with real checkpoint
3. Run: pytest tests/core/test_iteration_engine.py -v
Phase 2 — Engine Logic (mock query, still no real integration)
4. iteration_engine.py — IterationEngine class with decision logic
5. iterator-system.md — Port prompt template
6. test_iteration_engine.py — Decision logic + mocked iteration tests
7. Run: pytest tests/core/test_iteration_engine.py -v
Phase 3 — Integration (wire into orchestrator)
8. orchestrator.py — create_fresh_query() + run_iterations()
9. self_dev_tools.py — 3 MCP tools
10. defaults.yaml — iteration config
11. Run: pytest tests/core/ -v (full suite, no regression)
Phase 4 — Smoke Test
12. Load real .cms-iterate/checkpoint.json → verify all fields
13. Checkpoint roundtrip: from_file() → to_dict() → byte compare
14. IterationReport parse real CMS report text
Each phase is independently shippable. Phase 1 alone provides typed checkpoint handling. Phase 2 adds testable decision logic. Phase 3 connects everything. Phase 4 validates backward compatibility.
Migration Timeline
Week 1: Phase 1-2 (data models + engine logic)
└── CMS external still runs all production workloads
└── Engine exists but isn't wired in
Week 2: Phase 3-4 (integration + smoke test)
└── Both paths available: orchestrator.run_iterations() AND /cms skill
└── Run same task through both, compare checkpoints
Week 3: Shadow mode
└── Engine runs alongside CMS for real tasks
└── Compare checkpoint outputs
└── If divergence: investigate, fix engine, keep CMS as source of truth
Week 4: Switch
└── Default to engine
└── CMS skills marked deprecated (not removed)
└── Remove CMS skills in v2 after 1 month with no issues
What This Enables
Once the engine is SDK-native, three things become possible that CMS couldn’t do:
1. Programmatic Iteration from Any Python Code
# Before: only through Claude CLI / Skill invocation
# After:
engine = IterationEngine(orchestrator, config)
checkpoint = await engine.start("Refactor auth module to use JWT")
print(f"Completed in {checkpoint.current_iteration} iterations")
Iteration loops become a library call. Tests, CI pipelines, external tools can all trigger them.
2. Typed Error Handling
# Before: parse stdout text for error patterns
# After:
if report.status == "failed":
for error in report.iteration_result.errors:
logger.error(f"Iteration {report.iteration}: {error}")
No more regex. No more “did the CLI output something unexpected?” Structured data all the way down.
3. Composable with Agent Teams
# The endgame: Agent Teams teammate that runs iterations
Task(
subagent_type="general-purpose",
team_name="dev-team",
prompt="Use iteration_start tool to run 10 iterations on the auth module"
)
An Agent Teams teammate can invoke the IterationEngine through MCP tools. CMS inside Agent Teams. Best of both architectures.
The Principle
Build userland solutions with clean interfaces. When the platform catches up, your migration is a transport swap, not a rewrite.
CMS → IterationEngine is 1,070 lines of new code and 126 lines of modifications. The checkpoint format doesn’t change. The directory structure doesn’t change. The decision logic doesn’t change.
Because CMS was built with the right abstractions from the start.
Further Reading
- We Built Agent Teams Before Agent Teams Existed — Why CMS and Agent Teams converged on the same architecture
- Agent Teams: Mesh Topology and Five Ways to Die — Communication topology and failure modes
- Multi-Agent Architecture: Parallel Execution Patterns — Sub-agent fundamentals