Agent Teams: The Afterlife — Six Fractures in Session Management
The last article covered five ways agents die. This one covers what happens after — when agents resurrect into amnesia, when the living inherit the chaos, and how six systemic fractures tear apart the session lifecycle.
Knowledge Comic Series — Maximum technical depth, minimum words.
This is the sequel to Mesh Topology and Five Ways to Die.
Previously
The last article analyzed five ways teammate processes terminate. Conclusion: only clean shutdown has a complete protocol. The other four are silent disappearances.
But that was just the beginning.
When agents die, the story truly starts — their residual effects tear open six fractures across the entire session lifecycle.
Fracture 1: First Come, First Served Trap
Agent Teams is fully asynchronous. You dispatch 5 agents. They work independently, report independently.
The problem: the main context doesn’t wait.
Timeline:
t=0 Lead dispatches Agent A, B, C, D, E simultaneously
t=10s Agent A completes, reports result
t=11s Lead sees A's result → "Phase 1 has output" → enters Phase 2
t=15s Agent B completes, reports result
t=16s Lead receives B → but already working on Phase 2
→ B's result is ignored or processed redundantly
t=30s Agent C, D, E finish one by one...
Lead may have already dispatched new agents for the same work
No barrier. No waitAll(). No “wait for everyone to arrive.”
The inbox is a JSON file with append writes:
~/.claude/teams/{team}/inboxes/{lead}.json
Lead consumes the inbox at the end of each turn. First in, first processed. Latecomers may be treated as new tasks by the “Phase 2 Lead.”
Observed: a 5-agent team dispatched the same work 3 times because Lead lost track of who did what.
Fracture 2: Amnesia After Compaction
Claude Code’s auto-compact triggers at 80% context window usage (raised from 60% in v1.0.51).
What compaction does: compress the entire conversation into a text summary.
Before compaction: full conversation history (all tool calls, agent reports, task states)
After compaction: a text summary + "continue from here"
Critical issue: the summary is plain text. It carries no structured state.
After compaction, the model doesn’t know:
- Which agents are still running
- Their agent IDs
- What tasks they’re responsible for
- Which tasks are done vs. pending
So when an in-flight agent finally returns:
Agent D (t=45s): "Finished the API schema design, here are the results..."
Lead (post-compact): "API schema? Let me see... Alright, I'll assign someone
to design the API schema."
→ Dispatches Agent F for the same work
Compaction kills memory. The agent rises from the dead, only to find the master no longer recognizes it.
CHANGELOG fixes confirm the severity:
| Version | Fix |
|---|---|
| v2.1.0 | Fixed files and skills not being properly discovered when resuming |
| v2.0.7 | Fixed sub-agents using the wrong model during conversation compaction |
| v1.0.51 | Increased auto-compact warning threshold from 60% to 80% |
| v1.0.12 | Improved todo list handling during compaction |
Raising the threshold isn’t a fix. It just delays the amnesia.
Fracture 3: Token Detonation
Each sub-agent’s transcript is a standalone JSONL file with no size limit.
Real-world measurements (anonymized):
Agent transcript sizes for a single session:
agent-aaa.jsonl → 425 MB
agent-bbb.jsonl → 256 MB
agent-ccc.jsonl → 19 MB
agent-ddd.jsonl → 12 MB
Main session → 22 MB, 1,533 lines
Total sub-agent transcripts: over 700 MB for one session.
v2.1.0 added a 30K character truncation — but that only truncates what’s returned to the main context. The agent’s own transcript grows without bound.
Worse scenario: multiple parallel agents complete and report simultaneously.
Agent A: returns 25K results
Agent B: returns 28K results
Agent C: returns 30K results
─── all written to Lead's inbox simultaneously ───
Lead's next turn: consumes 83K of new content
+ existing context ≈ 75%
= instantly exceeds 80% → triggers compact
→ back to Fracture 2
Parallel agent reports are burst traffic. No backpressure. No flow control. No queuing.
Fracture 4: The Resume Black Hole
Session persistence is .jsonl append-only. Resuming requires reloading the entire file.
A 2-hour session with 10+ agents can produce a 20+ MB JSONL.
claude --resume <session-id>
↓
Load 22MB JSONL
↓
Case A: previous compact exists → load compact summary (correct)
Case B: compact boundary corrupted → load full history (22MB)
Case C: orphaned tool_result exists → resume fails entirely
The CHANGELOG documents repeated fixes:
| Version | Fix |
|---|---|
| v2.1.9 | Fixed long sessions with parallel tool calls failing (orphan tool_result) |
| v2.1.7 | Fixed orphaned tool_result errors when sibling tools fail |
| v2.1.0 | Fixed session resume failures caused by orphaned tool results |
| v2.0.1 | Fixed session persistence stuck after transient server errors |
See the pattern? “Orphaned tool result” keeps recurring.
The cause: when one parallel tool call fails, the other tool_results become “orphans” — results without a matching tool_use. The API requires strict tool_use/tool_result pairing. Orphans cause immediate errors on resume.
The more parallelism, the more orphans. The more orphans, the more fragile resume becomes.
Fracture 5: Wandering Ghosts
After force-killing a session (Ctrl+C, close terminal, token exhaustion), the residue:
~/.claude/teams/
├── team-alpha/ ← last week's team
│ ├── config.json
│ └── inboxes/
│ ├── agent-1.json ← still has unread messages
│ ├── agent-2.json
│ └── agent-3.json
├── team-beta/ ← three days ago
│ └── inboxes/
│ └── ... 11 inbox files
├── default/ ← unknown origin
│ └── inboxes/ ← no config.json (incomplete state)
└── ... 6 more
The system has no:
- PID tracking (doesn’t know if processes are alive)
- Orphan detection (doesn’t know which teams are active)
- Auto-cleanup (no GC for expired team directories)
- Team state recovery on resume (doesn’t know where the team left off)
Every Agent Teams session leaves a sediment layer on disk. Cleanup is manual only.
# The only "cleanup mechanism" available today
rm -rf ~/.claude/teams/team-*
Fracture 6: The Idle Storm
Agent Teams’ idle notification design has an implicit assumption: one agent idles once and that’s enough.
Reality: every time an agent’s turn ends (no tool call), it triggers an idle notification.
Real-world data (10-second window from a single team):
agent-1: idle (t=0.000s)
agent-1: idle (t=0.514s) ← 0.5s gap
agent-1: idle (t=0.951s) ← 0.4s gap
agent-1: idle (t=1.505s)
agent-1: idle (t=1.960s)
agent-1: idle (t=2.533s)
agent-1: idle (t=3.034s)
agent-1: idle (t=3.522s)
agent-1: idle (t=5.076s)
One agent sent 9 idle notifications in 5 seconds.
In a 6-agent team, 69% of Lead’s inbox was idle noise.
Each idle notification can wake Lead → Lead consumes inbox → triggers new turn → burns tokens → accelerates context bloat → faster compact → back to Fracture 2.
idle storm → token bloat → compact → amnesia → duplicate dispatch → more agents → more idle
↑ │
└──────────────── positive feedback loop ────────────────┘
v2.0.14 fixed it once: “Fixed how idleness is computed for notifications.”
But the root issue remains — idle fires per-turn, not per-session.
How the Six Fractures Interact
These six problems are not independent. They amplify each other:
┌──── Fracture 1 (first come, first served)
│ ↓
│ duplicate dispatch
│ ↓
Fracture 6 (idle) → token explosion ← Fracture 3 (token detonation)
│ ↓
│ Fracture 2 (amnesia compact)
│ ↓
│ more duplicate dispatch
│ ↓
│ Fracture 4 (resume failure)
│ ↓
│ forced restart
│ ↓
└──── Fracture 5 (wandering ghosts)
A complex 2-hour session can experience all six fractures.
Root Cause: Three Missing Infrastructure Pillars
| Infrastructure | Current State | Fractures Caused |
|---|---|---|
| State machine management | No structured agent registry | 1, 2, 4 |
| Process supervision | No PID tracking, no heartbeat | 5, 6 |
| Synchronization barriers | No barrier / waitAll | 1, 3 |
Agent Teams is a “message passing + shared task list” loose collaboration framework.
It lacks the three essential pillars of distributed systems. In distributed systems terminology:
Has message passing → but no delivery guarantee
Has task queue → but no exactly-once processing
Has process spawn → but no supervisor tree
Contrast: Why Process Isolation Avoids These
| Fracture | Agent Teams | Process Isolation (e.g., independent CLI sessions) |
|---|---|---|
| First come, first served | Lead doesn’t wait | Orchestrator waits for all workers before advancing |
| Amnesia compact | Summary loses in-flight state | Each worker has independent context |
| Token detonation | Agent transcript unbounded | Each task gets its own context window |
| Resume black hole | Loads massive JSONL | Each worker persists independently |
| Wandering ghosts | No process tracking | Independent processes, exit = cleanup |
| Idle storm | Per-turn notification | Worker completes → process exits |
The core difference: Agent Teams is in-process concurrency (coroutines). Process Isolation is true parallelism (processes).
The former shares fate — one dies, all affected. The latter stands independent — one failure doesn’t cascade.
Conclusion
The last article concluded: star topology is not a technical limitation. It’s risk management.
This article concludes: Agent Teams’ session management lacks distributed systems infrastructure.
It’s well-suited for short, low-complexity team tasks — 3 agents, 10 minutes, no compaction needed.
But once you extend to hour-long sessions, 5+ agents, multi-phase coordination, the six fractures start resonating with each other.
Agent Teams sweet spot:
✅ 3-5 agents
✅ 10-15 minutes
✅ Single-phase tasks
✅ No resume needed
Danger zone:
❌ 5+ agents
❌ 30+ minutes
❌ Multi-phase dependencies
❌ Requires compact or resume
This isn’t saying Agent Teams is bad. It’s saying it has a scope.
Beyond that scope, you need process-level isolation — each worker as an independent CLI session, the orchestrator only coordinating, never sharing a context window.
Further Reading
- Agent Teams: Mesh Topology and Five Ways to Die — Prequel to this article
- From CMS to Agent SDK: Migration in Practice — Process Isolation in practice
- Multi-Agent Architecture: Parallel Execution Patterns — Sub-agent fundamentals