Skip to content

Workers and signal files

When /catalyst-dev:orchestrate runs a wave, it dispatches one worker per ticket. Each worker is a separate Claude Code subprocess running /catalyst-dev:oneshot inside a dedicated git worktree. The worker communicates progress back to the orchestrator exclusively through its signal file.

Orchestrator Worker subprocess
│ │
│─ creates signal file ──────────>│
│─ launches humanlayer ──────────>│ starts /catalyst-dev:oneshot
│ │ Phase 1: researching
│ │ Phase 2: planning
│ │ Phase 3: implementing
│ │ Phase 4: validating
│ │ Phase 5: shipping (opens PR, arms auto-merge)
│ │
│<─ signal file: pr-created ──────│
│ │ exits
│ │
│ (Phase 4 poll loop, orchestrator side)
│ gh pr view → merged? ──> yes: writes mergedAt, Linear=Done

The split between worker and orchestrator matters: the worker subprocess reliably exits at its final tool-use, before merge completes. Polling until merged is the orchestrator’s responsibility, not the worker’s. A worker that claims to poll-until-merged burns tokens and produces false signals.

Located at <orchestrator-dir>/workers/<ticket>.json. The orchestrator creates an empty skeleton; the worker writes into it.

FieldTypeDescription
ticketstringThe ticket ID (e.g., CTL-48)
orchestratorstringOrchestrator ID
workerNamestringHuman-readable worker name (used for tmux titles, etc.)
statusstringCurrent status — see state machine below
phasenumber0–6, matching oneshot phases
startedAtISO stringWhen the worker was dispatched
updatedAtISO stringLast write to the signal file
lastHeartbeatISO stringMost recent heartbeat (~60s cadence during long phases)
completedAtISO string or nullSet when terminal state reached
worktreePathstringAbsolute path to the worker’s git worktree
phaseTimestampsobjectMap of status → ISO timestamp; populated at each transition
probject or nullPopulated at Phase 5 PR creation
linearStatestring or nullCurrent Linear state name
definitionOfDoneobjectPopulated at Phase 4 + 5 with real results
pidnumberWorker’s Claude process PID
{
"number": 123,
"url": "https://github.com/org/repo/pull/123",
"ciStatus": "pending",
"prOpenedAt": "2026-04-14T19:15:30Z",
"autoMergeArmedAt": "2026-04-14T19:15:32Z",
"mergedAt": null
}
  • ciStatus: pending | passing | failing | unknown | merged
  • prOpenedAt — set by worker the moment the PR is created
  • autoMergeArmedAt — set by worker after gh pr merge --squash --auto
  • mergedAtalways set by the orchestrator (or standalone /merge-pr), never by the worker
dispatched → researching → planning → implementing → validating → shipping → pr-created
(orchestrator polls)
v
merging → done
v
(at any stage) → failed | stalled

The worker writes statuses up through pr-created. The orchestrator writes merging and done (or failed/stalled if the wave times out or verification fails).

In parallel with the signal file, workers also write to ~/catalyst/state.json via catalyst-state.sh worker. This is the fleet-wide aggregate that the dashboard reads — it unifies workers across multiple orchestrators. Writes are atomic (jq + flock).

Schema:

{
"orchestrators": {
"orch-2026-04-14-abc123": {
"project": "CTL",
"startedAt": "...",
"lastHeartbeat": "...",
"wave": 2,
"totalWaves": 3,
"workers": {
"CTL-48": { /* same shape as signal file */ }
},
"attention": [
{ "type": "verification-failed", "ticket": "CTL-48", "message": "..." }
]
}
}
}

The attention array is the orchestrator’s way of flagging something that needs human decision. Never auto-resolved by the orchestrator itself.

During long phases (implementation, CI waits), workers update lastHeartbeat without changing status. The orch-monitor treats a worker as stalled if now - lastHeartbeat > 15 minutes. A stalled worker is never auto-restarted — it becomes an attention item.

If you’re writing a custom worker, heartbeat every ~60s at minimum. More often is fine; less often trips false stalled detections.

A worker reaches a terminal state in one of three ways:

StateMeansSignal writer
donePR merged, Linear=DoneOrchestrator (after observing merge)
failedUnrecoverable error, quality gates exhausted, or human escalationWorker (writes attention reason)
stalledNo heartbeat / no progress for 15+ minOrchestrator

Terminal states set completedAt. No further writes to the signal file happen once terminal is reached.

File-based signals are intentionally boring:

  • Debuggable with cat workers/*.json | jq
  • Survive process death on both sides — neither the worker crashing nor the orchestrator restarting destroys state
  • Atomic on POSIX via tmp+rename writes
  • Pickled history — old signal files live in archived orchestrator dirs, so you can audit past waves

The cost is polling latency — the orch-monitor uses fs.watch to avoid polling, but some consumers do poll. That’s fine for a one-machine setup; for multi-host you’d front this with a real event bus.