Symptom
On a box with two Node versions (an interactive shell on Node A, the Tower daemon + builders on Node B), afx spawn <id> --protocol … half-completes: worktree created, porch initialized, builder PTY running — but the builder is never registered (absent from afx status, no spawned_by_architect row). Every subsequent afx in that shell then fails with:
better_sqlite3.node compiled against NODE_MODULE_VERSION 147, this Node requires 127
Root cause (verified in code)
A native-module ABI split over one shared better-sqlite3, made fatal by the spawn writing state.db directly from the CLI process:
- Terminal creation —
startBuilderSession → getTowerClient().createTerminal() (spawn-worktree.ts) goes through the Tower daemon. Runs on the daemon's Node, succeeds → builder PTY is alive.
- Ownership write —
upsertBuilder (agent-farm/state.ts:165) opens the workspace state.db in the afx CLI process via import Database from 'better-sqlite3' (state.ts:10). When the CLI's Node ABI ≠ the ABI the global better-sqlite3 was built for, this throws before the row is written.
Net: terminal exists in Tower, builders row never lands → the builder is live but invisible to afx, and the CLI is broken for the rest of that shell.
The ABI flips whenever something rebuilds better-sqlite3 from source (node-gyp → classic ABI pinned to one Node) instead of using the version-independent N-API prebuilt. New/bleeding-edge Node versions (where a prebuilt may be missing) are exactly where this bites.
Recovery gap
afx spawn <id> --resume is not safe against the live-but-unregistered builder: on resume, spawn still calls startBuilderSession unconditionally — there is no existing-terminal/already-running guard — so it spawns a second PTY on the same worktree (two builders racing one porch/git state). There is no command to re-register an existing terminal (writing just the missing row), and hand-editing state.db is disallowed. Current safe path is stop-the-orphan-then---resume, run under the daemon's Node — fragile and non-obvious.
Proposed fixes
- Route the ownership write through the Tower daemon (like terminal-creation already is), instead of opening
better-sqlite3 directly in the CLI. This is the root fix:
- the CLI never loads the native module → its Node version cannot break
afx;
- spawn becomes atomic — the daemon owns both terminal creation and registration, eliminating the terminal-exists-but-no-row half-state.
- Aligns with the single-owner invariant (state.db has one owner: Tower).
- Fail loud on ABI mismatch. When
better-sqlite3 fails to load, emit an actionable error ("built for ABI X, running ABI Y — use the install-time Node / npm rebuild") instead of the raw native error.
- Never rebuild the shared global
better-sqlite3 during a spawn. If a worktree needs its own copy, build it worktree-local; a spawn must not mutate /…/node_modules/@cluesmith/codev/node_modules/better-sqlite3 out from under a running CLI.
- Recovery primitive (smaller): an
afx path to re-register an already-running terminal (write the missing builders row, no new PTY), so a half-spawn is recoverable without stop-then-resume.
Notes
- Reported by an external adopter running a multi-architect workspace; specifics scrubbed.
- Empirically, refreshing the global
better-sqlite3 to its N-API prebuilt (a normal reinstall) makes it load under both Node versions and clears the symptom — but that is incidental, not a structural fix: the next from-source rebuild re-breaks it.
- Code refs:
agent-farm/state.ts:10,165; agent-farm/commands/spawn.ts (resume path, no existing-terminal guard); agent-farm/commands/spawn-worktree.ts (startBuilderSession → Tower client).
Symptom
On a box with two Node versions (an interactive shell on Node A, the Tower daemon + builders on Node B),
afx spawn <id> --protocol …half-completes: worktree created, porch initialized, builder PTY running — but the builder is never registered (absent fromafx status, nospawned_by_architectrow). Every subsequentafxin that shell then fails with:Root cause (verified in code)
A native-module ABI split over one shared
better-sqlite3, made fatal by the spawn writing state.db directly from the CLI process:startBuilderSession→getTowerClient().createTerminal()(spawn-worktree.ts) goes through the Tower daemon. Runs on the daemon's Node, succeeds → builder PTY is alive.upsertBuilder(agent-farm/state.ts:165) opens the workspace state.db in the afx CLI process viaimport Database from 'better-sqlite3'(state.ts:10). When the CLI's Node ABI ≠ the ABI the globalbetter-sqlite3was built for, this throws before the row is written.Net: terminal exists in Tower,
buildersrow never lands → the builder is live but invisible toafx, and the CLI is broken for the rest of that shell.The ABI flips whenever something rebuilds
better-sqlite3from source (node-gyp → classic ABI pinned to one Node) instead of using the version-independent N-API prebuilt. New/bleeding-edge Node versions (where a prebuilt may be missing) are exactly where this bites.Recovery gap
afx spawn <id> --resumeis not safe against the live-but-unregistered builder: on resume,spawnstill callsstartBuilderSessionunconditionally — there is no existing-terminal/already-running guard — so it spawns a second PTY on the same worktree (two builders racing one porch/git state). There is no command to re-register an existing terminal (writing just the missing row), and hand-editing state.db is disallowed. Current safe path is stop-the-orphan-then---resume, run under the daemon's Node — fragile and non-obvious.Proposed fixes
better-sqlite3directly in the CLI. This is the root fix:afx;better-sqlite3fails to load, emit an actionable error ("built for ABI X, running ABI Y — use the install-time Node /npm rebuild") instead of the raw native error.better-sqlite3during a spawn. If a worktree needs its own copy, build it worktree-local; a spawn must not mutate/…/node_modules/@cluesmith/codev/node_modules/better-sqlite3out from under a running CLI.afxpath to re-register an already-running terminal (write the missingbuildersrow, no new PTY), so a half-spawn is recoverable without stop-then-resume.Notes
better-sqlite3to its N-API prebuilt (a normal reinstall) makes it load under both Node versions and clears the symptom — but that is incidental, not a structural fix: the next from-source rebuild re-breaks it.agent-farm/state.ts:10,165;agent-farm/commands/spawn.ts(resume path, no existing-terminal guard);agent-farm/commands/spawn-worktree.ts(startBuilderSession→ Tower client).