從 Userland Hack 到 SDK 原生：CMS 遷移藍圖

Knowledge Comic 系列 — 最大技術深度，最少文字。

為什麼要遷移

CMS 能用。50+ 輪迭代，零 context 腐爛，checkpoint 恢復。那為什麼要動它？

CMS 現狀：
  Shell script → 啟動 Claude CLI process → 解析 stdout → 寫入 checkpoint
  ├── 依賴可能每版都變的 CLI flags
  ├── 從非結構化文字輸出解析 <report> 標籤
  ├── 無法整合 SDK 的型別系統
  └── 無法在沒有運行中 Claude process 的情況下單元測試

Claude Agent SDK 給了我們 query() — 一個建立全新對話的函式呼叫。同樣的隔離。但現在它是 Python 函式，不是子 process。可測試、有型別、版本鎖定。

目標：替換傳輸層，保留架構。

核心洞見

CMS 的威力來自一個架構決策：

Orchestrator 不是 LLM。

Shell script 循環。LLM 迭代。Shell script 永遠不會填滿 context window，因為它沒有 context window。

SDK 遷移必須保留這個特性：

錯誤做法：
  Orchestrator = 長時間運行的 query() session
  → 累積子查詢的結果
  → 同樣的 context 腐爛問題
  → 你只是重建了 Agent Teams Lead

正確做法：
  Orchestrator = Python class（IterationEngine）
  → 每次迭代呼叫 query()（不 resume session）
  → 每次 query() 取得全新 context
  → IterationEngine 的 token 佔用量為零
  → 你保留了 CMS 的核心優勢

關鍵的那一行：

result = await query(prompt=iteration_prompt, session_id=None)
#                                              ^^^^^^^^^^^^
#                             不 resume。全新 context。每次都是。

架構：前後對比

CMS External（現狀）
═══════════════════

  ┌─────────────────┐
  │  cms-iterate.sh  │  ← Shell script（不是 LLM）
  │                  │
  │  for i in 1..N:  │
  │    claude --print │─────→  [Claude CLI Process]
  │      --prompt ... │           │
  │                  │           │ stdout
  │    parse output  │←──────────┘
  │    write ckpt    │───→  checkpoint.json
  │  done            │
  └─────────────────┘

SDK Native（目標）
════════════════

  ┌──────────────────────┐
  │  IterationEngine     │  ← Python class（不是 LLM）
  │                      │
  │  for i in 1..N:      │
  │    await query(      │─────→  [SDK query() 呼叫]
  │      prompt=...,     │           │
  │      session_id=None │           │ ResultMessage
  │    )                 │←──────────┘
  │    parse report      │
  │    checkpoint.save() │───→  checkpoint.json
  │  done                │
  └──────────────────────┘

同樣的循環。同樣的 checkpoint。同樣的隔離。不同的傳輸層。

改變了什麼

層級	CMS External	SDK Native
Orchestrator	Shell script	`IterationEngine` Python class
啟動	`claude --print --prompt`	`await query(prompt, session_id=None)`
輸出解析	正則在 stdout `<report>...</report>`	`IterationReport.parse(result.result)`
Prompt 模板	`.cms-iterate/prompts/iterator-1.md`	`prompt_loader.get("iterator-system", **vars)`
設定	Shell 內硬編碼	`IterationConfig` dataclass + `defaults.yaml`
Query 選項	CLI flags	`orchestrator.create_fresh_query()`
錯誤處理	Process 崩潰 = 下一輪迭代	`try/except` 帶型別錯誤
測試	僅整合測試（需要運行中的 CLI）	可單元測試（mock `query()`）

沒有改變的

元件	原因
`.cms-iterate/` 目錄結構	Checkpoint、reports、prompts、backups — 全部同路徑
`checkpoint.json` 格式（v1.1.0）	向後相容 — 舊 checkpoint 可載入新 engine
決策邏輯	同樣 4 個停止條件，同樣的繼續規則
外部 CMS Skills	暫時保留，之後棄用 — engine 平行運作
現有 orchestrator 方法	`chat()`, `run()`, `run_skill()` 不動
Agent 定義	所有 `.claude/agents/` 不動

四個設計決策

決策 1：Fresh Query 的公開 API

Engine 需要呼叫 query() 而不累積 context。這需要 orchestrator 上的乾淨公開方法 — 不是戳私有內部。

錯誤：
  engine 呼叫 orchestrator._build_options(session_id=None)
  └── 私有方法。orchestrator 重構時就壞了。

正確：
  engine 呼叫 orchestrator.create_fresh_query(prompt, max_turns=30)
  └── 公開合約。Engine 不知道選項是怎麼建構的。

為什麼重要：Orchestrator 擁有 SDK client、API keys、模型選擇和工具設定。Engine 不應該複製這些。它只需說「給我一個用這個 prompt 的全新對話」然後拿回結果。

一個方法。乾淨的邊界。Engine 永遠不直接 import SDK。

決策 2：第一天就有平行執行

CMS 已經可以平行跑獨立任務：

CMS 平行（現狀）：
  cms-iterate.sh 同時啟動 3 個 CLI process
  └── Process 1: Task A（無依賴）
  └── Process 2: Task B（無依賴）
  └── Process 3: Task C（無依賴）
  等待全部完成 → 繼續 Task D（依賴 A, B）

SDK 遷移必須保留這個。不是 v2 功能。是 v1 參數。

SDK 平行（目標）：
  engine.start(request, parallel=True)
  └── asyncio.gather(
        query(task_A_prompt, session_id=None),
        query(task_B_prompt, session_id=None),
        query(task_C_prompt, session_id=None),
      )
  等待全部完成 → 繼續 Task D

設定介面：

iteration:
  parallel: false          # v1 預設：串行
  max_parallel_queries: 3  # parallel=true 時使用

即使 v1 以 parallel=false 出貨，參數已存在。v2 不需要改 API signature。

決策 3：進化 Hook Points

CMS 的 Self-Evolving Loop 是它對 Agent Teams 最大的差異化：

迭代失敗 → 萃取學習 → 進化 skills → 用更好的工具重試

Engine 必須有這個的 hook points，即使 v1 不實作完整循環。

_run_loop 虛擬碼：

  for iteration in range(max):
      report = await _run_single_iteration(checkpoint)

      if report.status == "completed":
          _apply_report(checkpoint, report)
          checkpoint.recovery.failure_count = 0    ← 成功時重置

      elif report.status == "failed":
          checkpoint.recovery.failure_count += 1

          if config.enable_evolving:               ← HOOK POINT
              await _evolve(checkpoint, report)     ← 萃取 + 進化

          if failure_count >= threshold:
              break

      checkpoint.save()

v1 的 _evolve() 實作：pass。五個字元。但 hook 在循環裡，config flag 存在，IterationReport 已經捕捉了學習所需的失敗 context。

v2 插入：experience-extractor agent → skill-evolver agent → 用進化後的 skills 重試。循環結構零改動。

決策 4：防禦性錯誤處理

CMS process 天然隔離 — CLI process 崩潰不會拖垮 shell script。SDK query() 呼叫在同一個 Python process 裡。未處理的 exception 會殺死 engine。

三個錯誤層級：

Tier 1 — 預期失敗（rate limit、timeout）：
  → 回傳 IterationReport(status="failed", errors=[...])
  → Engine 根據閾值繼續下一輪或停止
  → Checkpoint 在操作前後都會存檔

Tier 2 — Query 異常（輸出中沒有 <report> 標籤）：
  → 回傳 IterationReport(status="partial", raw_output=text)
  → Engine 記錄警告，嘗試下一輪
  → 人工可以在 reports/ 檢查原始輸出

Tier 3 — 基礎設施故障（SDK 崩潰、網路斷線）：
  → 立即存檔 checkpoint
  → 重新拋出 exception（engine 停止）
  → Resume 從最後存檔的 checkpoint 接續

關鍵不變量：Checkpoint 總是在任何可能失敗的操作之前存檔。Engine 永遠不會丟失超過一輪迭代的工作。

_run_single_iteration：
  checkpoint.save()          ← 在 query 之前存檔（崩潰保護）
  result = await query(...)  ← 這可能失敗
  report = parse(result)     ← 這可能失敗
  checkpoint.save()          ← 成功後存檔
  return report

資料模型

Checkpoint（v1.1.0 相容）

Checkpoint
├── version: "1.1.0"
├── iteration_type: "auto-cycle" | "auto-explore" | "custom"
├── request: str
├── current_iteration: int
├── max_iterations: int
├── status: "running" | "completed" | "failed" | "stopped"
│
├── original_context
│   ├── goal: str
│   └── acceptance_criteria_file: str
│
├── context_summary
│   ├── current: str
│   ├── key_decisions: list[str]
│   ├── blockers: list[str]
│   └── next_action: str
│
├── completed_items: list[dict]
├── pending_items: list[dict]
├── history: list[dict]         ← 每輪迭代一筆
│
├── progress
│   ├── percent: int
│   └── estimated_remaining: int
│
└── recovery
    ├── last_successful_iteration: int
    └── failure_count: int

序列化合約：Checkpoint.to_dict() 輸出與 CMS shell script 寫的 byte-identical。舊 checkpoint 用 Checkpoint.from_file() 載入。新 checkpoint 舊 CMS 也能讀。

IterationReport

IterationReport
├── task_id: str
├── iteration: int
├── status: "completed" | "partial" | "failed" | "blocked"
│
├── iteration_result
│   ├── action_taken: str
│   ├── files_changed: list[str]
│   ├── tests_passed: bool
│   └── errors: list[str]
│
├── checkpoint_update
│   ├── completed_items: list[dict]
│   ├── pending_items: list[dict]
│   ├── progress_percent: int
│   └── context_summary: str
│
└── continue_decision
    ├── should_continue: bool
    └── reason: str

解析自：query() 輸出中的 <report>JSON</report> 標籤 — 跟 CMS 已經用的格式一樣。

Engine 公開 API

IterationEngine
├── start(request, type, max_iterations, ...) → Checkpoint
│     建立全新 checkpoint，執行循環
│
├── resume() → Checkpoint
│     從磁碟載入 checkpoint，從上次迭代繼續
│
├── stop() → None
│     設定旗標，循環在當前迭代完成後退出
│
└── status() → Checkpoint
      從磁碟讀取 checkpoint（無副作用）

暴露為 3 個 MCP 工具：iteration_start、iteration_resume、iteration_status

也暴露為 orchestrator 方法：orchestrator.run_iterations(request, ...)

兩個入口，同一個 engine。MCP 工具給 Claude 驅動的工作流。Orchestrator 方法給程式化使用。

決策邏輯（與 CMS 不變）

繼續的條件（全部為真）：
  ├── pending_items 不為空
  ├── current_iteration < max_iterations
  ├── recovery.failure_count < failure_threshold（預設：3）
  └── stop() 未被呼叫

停止的條件（任一為真）：
  ├── pending_items 為空                       → status: "completed"
  ├── current_iteration >= max_iterations      → status: "stopped"
  ├── recovery.failure_count >= threshold      → status: "failed"
  └── stop() 被呼叫                            → status: "stopped"

同樣四個條件。同樣行為。現在可以不跑 Claude 就測試。

檔案地圖

新增檔案：
  src/core/iteration_engine.py      ~400 行  ← Engine + 資料模型
  config/prompts/iterator-system.md  ~250 行  ← 從 CMS prompt 移植
  tests/core/test_iteration_engine.py ~300 行 ← 單元 + 整合測試

修改檔案：
  src/core/orchestrator.py          +30 行   ← run_iterations() + create_fresh_query()
  src/tools/self_dev_tools.py       +90 行   ← 3 個 MCP 工具
  config/defaults.yaml              +6 行    ← iteration 設定區段

不動的：
  .cms-iterate/                     ← 同目錄，同格式
  .claude/agents/                   ← 所有 agent 定義
  .claude/skills/cms-*              ← 保留作 fallback
  src/core/orchestrator.py（其餘）   ← chat(), run(), run_skill()

新增代碼總量：~1,070 行 修改代碼總量：~126 行 風險面：2 個現有檔案做小幅新增

實作順序

Phase 1 — 基礎（無整合，純單元可測試）
  1. iteration_engine.py — 僅 Checkpoint + IterationReport 資料模型
  2. test_iteration_engine.py — 資料模型測試 + 真實 checkpoint 向後相容
  3. 執行：pytest tests/core/test_iteration_engine.py -v

Phase 2 — Engine 邏輯（mock query，仍無真實整合）
  4. iteration_engine.py — IterationEngine class 帶決策邏輯
  5. iterator-system.md — 移植 prompt 模板
  6. test_iteration_engine.py — 決策邏輯 + mocked 迭代測試
  7. 執行：pytest tests/core/test_iteration_engine.py -v

Phase 3 — 整合（接入 orchestrator）
  8. orchestrator.py — create_fresh_query() + run_iterations()
  9. self_dev_tools.py — 3 個 MCP 工具
  10. defaults.yaml — iteration 設定
  11. 執行：pytest tests/core/ -v（完整套件，無回歸）

Phase 4 — Smoke Test
  12. 載入真實 .cms-iterate/checkpoint.json → 驗證所有欄位
  13. Checkpoint 往返：from_file() → to_dict() → byte 比對
  14. IterationReport 解析真實 CMS report 文字

每個 phase 都可獨立交付。 Phase 1 單獨就提供有型別的 checkpoint 處理。Phase 2 加入可測試的決策邏輯。Phase 3 連接一切。Phase 4 驗證向後相容。

遷移時程

第一週：Phase 1-2（資料模型 + engine 邏輯）
  └── CMS external 仍然跑所有正式工作
  └── Engine 存在但未接入

第二週：Phase 3-4（整合 + smoke test）
  └── 兩條路徑都可用：orchestrator.run_iterations() 和 /cms skill
  └── 用同一任務跑兩者，比對 checkpoint

第三週：Shadow mode
  └── Engine 與 CMS 平行跑真實任務
  └── 比對 checkpoint 輸出
  └── 如有差異：調查、修 engine、CMS 仍為 source of truth

第四週：切換
  └── 預設使用 engine
  └── CMS skills 標記為棄用（不移除）
  └── 1 個月無問題後在 v2 移除 CMS skills

這能實現什麼

一旦 engine 是 SDK 原生的，三件 CMS 做不到的事變得可能：

1. 從任何 Python 代碼程式化迭代

# 之前：只能透過 Claude CLI / Skill 調用
# 之後：
engine = IterationEngine(orchestrator, config)
checkpoint = await engine.start("重構 auth 模組使用 JWT")
print(f"在 {checkpoint.current_iteration} 輪迭代中完成")

迭代循環變成一個函式庫呼叫。測試、CI pipelines、外部工具都可以觸發。

2. 有型別的錯誤處理

# 之前：對 stdout 文字用正則匹配錯誤模式
# 之後：
if report.status == "failed":
    for error in report.iteration_result.errors:
        logger.error(f"Iteration {report.iteration}: {error}")

不再需要正則。不再有「CLI 輸出了什麼意外的東西？」的問題。全程結構化資料。

3. 可與 Agent Teams 組合

# 終極目標：執行迭代的 Agent Teams teammate
Task(
    subagent_type="general-purpose",
    team_name="dev-team",
    prompt="使用 iteration_start 工具在 auth 模組上跑 10 輪迭代"
)

Agent Teams teammate 可以透過 MCP 工具調用 IterationEngine。CMS 在 Agent Teams 裡面。 兩種架構的最佳組合。

原則

用乾淨的介面建構 userland 方案。當平台追上來時，你的遷移是傳輸層替換，不是重寫。

CMS → IterationEngine 是 1,070 行新代碼和 126 行修改。Checkpoint 格式不變。目錄結構不變。決策邏輯不變。

因為 CMS 從一開始就用了正確的抽象。