Agent Teams: The Afterlife — Six Fractures in Session Management

Knowledge Comic Series — Maximum technical depth, minimum words.

This is the sequel to Mesh Topology and Five Ways to Die.

Previously

The last article analyzed five ways teammate processes terminate. Conclusion: only clean shutdown has a complete protocol. The other four are silent disappearances.

But that was just the beginning.

When agents die, the story truly starts — their residual effects tear open six fractures across the entire session lifecycle.

Fracture 1: First Come, First Served Trap

Agent Teams is fully asynchronous. You dispatch 5 agents. They work independently, report independently.

The problem: the main context doesn’t wait.

Timeline:
t=0    Lead dispatches Agent A, B, C, D, E simultaneously
t=10s  Agent A completes, reports result
t=11s  Lead sees A's result → "Phase 1 has output" → enters Phase 2
t=15s  Agent B completes, reports result
t=16s  Lead receives B → but already working on Phase 2
       → B's result is ignored or processed redundantly
t=30s  Agent C, D, E finish one by one...
       Lead may have already dispatched new agents for the same work

No barrier. No waitAll(). No “wait for everyone to arrive.”

The inbox is a JSON file with append writes:

~/.claude/teams/{team}/inboxes/{lead}.json

Lead consumes the inbox at the end of each turn. First in, first processed. Latecomers may be treated as new tasks by the “Phase 2 Lead.”

Observed: a 5-agent team dispatched the same work 3 times because Lead lost track of who did what.

Fracture 2: Amnesia After Compaction

Claude Code’s auto-compact triggers at 80% context window usage (raised from 60% in v1.0.51).

What compaction does: compress the entire conversation into a text summary.

Before compaction: full conversation history (all tool calls, agent reports, task states)
After compaction:  a text summary + "continue from here"

Critical issue: the summary is plain text. It carries no structured state.

After compaction, the model doesn’t know:

Which agents are still running
Their agent IDs
What tasks they’re responsible for
Which tasks are done vs. pending

So when an in-flight agent finally returns:

Agent D (t=45s): "Finished the API schema design, here are the results..."
Lead (post-compact): "API schema? Let me see... Alright, I'll assign someone
                      to design the API schema."
                     → Dispatches Agent F for the same work

Compaction kills memory. The agent rises from the dead, only to find the master no longer recognizes it.

CHANGELOG fixes confirm the severity:

Version	Fix
v2.1.0	Fixed files and skills not being properly discovered when resuming
v2.0.7	Fixed sub-agents using the wrong model during conversation compaction
v1.0.51	Increased auto-compact warning threshold from 60% to 80%
v1.0.12	Improved todo list handling during compaction

Raising the threshold isn’t a fix. It just delays the amnesia.

Fracture 3: Token Detonation

Each sub-agent’s transcript is a standalone JSONL file with no size limit.

Real-world measurements (anonymized):

Agent transcript sizes for a single session:
agent-aaa.jsonl  →  425 MB
agent-bbb.jsonl  →  256 MB
agent-ccc.jsonl  →  19 MB
agent-ddd.jsonl  →  12 MB
Main session       →  22 MB, 1,533 lines

Total sub-agent transcripts: over 700 MB for one session.

v2.1.0 added a 30K character truncation — but that only truncates what’s returned to the main context. The agent’s own transcript grows without bound.

Worse scenario: multiple parallel agents complete and report simultaneously.

Agent A: returns 25K results
Agent B: returns 28K results
Agent C: returns 30K results
─── all written to Lead's inbox simultaneously ───
Lead's next turn: consumes 83K of new content
                  + existing context ≈ 75%
                  = instantly exceeds 80% → triggers compact
                  → back to Fracture 2

Parallel agent reports are burst traffic. No backpressure. No flow control. No queuing.

Fracture 4: The Resume Black Hole

Session persistence is .jsonl append-only. Resuming requires reloading the entire file.

A 2-hour session with 10+ agents can produce a 20+ MB JSONL.

claude --resume <session-id>
    ↓
    Load 22MB JSONL
    ↓
    Case A: previous compact exists → load compact summary (correct)
    Case B: compact boundary corrupted → load full history (22MB)
    Case C: orphaned tool_result exists → resume fails entirely

The CHANGELOG documents repeated fixes:

Version	Fix
v2.1.9	Fixed long sessions with parallel tool calls failing (orphan tool_result)
v2.1.7	Fixed orphaned tool_result errors when sibling tools fail
v2.1.0	Fixed session resume failures caused by orphaned tool results
v2.0.1	Fixed session persistence stuck after transient server errors

See the pattern? “Orphaned tool result” keeps recurring.

The cause: when one parallel tool call fails, the other tool_results become “orphans” — results without a matching tool_use. The API requires strict tool_use/tool_result pairing. Orphans cause immediate errors on resume.

The more parallelism, the more orphans. The more orphans, the more fragile resume becomes.

Fracture 5: Wandering Ghosts

After force-killing a session (Ctrl+C, close terminal, token exhaustion), the residue:

~/.claude/teams/
├── team-alpha/          ← last week's team
│   ├── config.json
│   └── inboxes/
│       ├── agent-1.json  ← still has unread messages
│       ├── agent-2.json
│       └── agent-3.json
├── team-beta/           ← three days ago
│   └── inboxes/
│       └── ... 11 inbox files
├── default/             ← unknown origin
│   └── inboxes/         ← no config.json (incomplete state)
└── ... 6 more

The system has no:

PID tracking (doesn’t know if processes are alive)
Orphan detection (doesn’t know which teams are active)
Auto-cleanup (no GC for expired team directories)
Team state recovery on resume (doesn’t know where the team left off)

Every Agent Teams session leaves a sediment layer on disk. Cleanup is manual only.

# The only "cleanup mechanism" available today
rm -rf ~/.claude/teams/team-*

Fracture 6: The Idle Storm

Agent Teams’ idle notification design has an implicit assumption: one agent idles once and that’s enough.

Reality: every time an agent’s turn ends (no tool call), it triggers an idle notification.

Real-world data (10-second window from a single team):

agent-1: idle (t=0.000s)
agent-1: idle (t=0.514s)    ← 0.5s gap
agent-1: idle (t=0.951s)    ← 0.4s gap
agent-1: idle (t=1.505s)
agent-1: idle (t=1.960s)
agent-1: idle (t=2.533s)
agent-1: idle (t=3.034s)
agent-1: idle (t=3.522s)
agent-1: idle (t=5.076s)

One agent sent 9 idle notifications in 5 seconds.

In a 6-agent team, 69% of Lead’s inbox was idle noise.

Each idle notification can wake Lead → Lead consumes inbox → triggers new turn → burns tokens → accelerates context bloat → faster compact → back to Fracture 2.

idle storm → token bloat → compact → amnesia → duplicate dispatch → more agents → more idle
                 ↑                                                        │
                 └──────────────── positive feedback loop ────────────────┘

v2.0.14 fixed it once: “Fixed how idleness is computed for notifications.”

But the root issue remains — idle fires per-turn, not per-session.

How the Six Fractures Interact

These six problems are not independent. They amplify each other:

                    ┌──── Fracture 1 (first come, first served)
                    │         ↓
                    │    duplicate dispatch
                    │         ↓
Fracture 6 (idle) → token explosion ← Fracture 3 (token detonation)
                    │         ↓
                    │    Fracture 2 (amnesia compact)
                    │         ↓
                    │    more duplicate dispatch
                    │         ↓
                    │    Fracture 4 (resume failure)
                    │         ↓
                    │    forced restart
                    │         ↓
                    └──── Fracture 5 (wandering ghosts)

A complex 2-hour session can experience all six fractures.

Root Cause: Three Missing Infrastructure Pillars

Infrastructure	Current State	Fractures Caused
State machine management	No structured agent registry	1, 2, 4
Process supervision	No PID tracking, no heartbeat	5, 6
Synchronization barriers	No barrier / waitAll	1, 3

Agent Teams is a “message passing + shared task list” loose collaboration framework.

It lacks the three essential pillars of distributed systems. In distributed systems terminology:

Has message passing → but no delivery guarantee
Has task queue      → but no exactly-once processing
Has process spawn   → but no supervisor tree

Contrast: Why Process Isolation Avoids These

Fracture	Agent Teams	Process Isolation (e.g., independent CLI sessions)
First come, first served	Lead doesn’t wait	Orchestrator waits for all workers before advancing
Amnesia compact	Summary loses in-flight state	Each worker has independent context
Token detonation	Agent transcript unbounded	Each task gets its own context window
Resume black hole	Loads massive JSONL	Each worker persists independently
Wandering ghosts	No process tracking	Independent processes, exit = cleanup
Idle storm	Per-turn notification	Worker completes → process exits

The core difference: Agent Teams is in-process concurrency (coroutines). Process Isolation is true parallelism (processes).

The former shares fate — one dies, all affected. The latter stands independent — one failure doesn’t cascade.

Conclusion

The last article concluded: star topology is not a technical limitation. It’s risk management.

This article concludes: Agent Teams’ session management lacks distributed systems infrastructure.

It’s well-suited for short, low-complexity team tasks — 3 agents, 10 minutes, no compaction needed.

But once you extend to hour-long sessions, 5+ agents, multi-phase coordination, the six fractures start resonating with each other.

Agent Teams sweet spot:
✅ 3-5 agents
✅ 10-15 minutes
✅ Single-phase tasks
✅ No resume needed

Danger zone:
❌ 5+ agents
❌ 30+ minutes
❌ Multi-phase dependencies
❌ Requires compact or resume

This isn’t saying Agent Teams is bad. It’s saying it has a scope.

Beyond that scope, you need process-level isolation — each worker as an independent CLI session, the orchestrator only coordinating, never sharing a context window.