Skip to content

afx spawn: CLI-side state.db write crashes on node ABI mismatch → half-spawned, unregistered builder #1093

Description

@waleedkadous

Symptom

On a box with two Node versions (an interactive shell on Node A, the Tower daemon + builders on Node B), afx spawn <id> --protocol … half-completes: worktree created, porch initialized, builder PTY running — but the builder is never registered (absent from afx status, no spawned_by_architect row). Every subsequent afx in that shell then fails with:

better_sqlite3.node compiled against NODE_MODULE_VERSION 147, this Node requires 127

Root cause (verified in code)

A native-module ABI split over one shared better-sqlite3, made fatal by the spawn writing state.db directly from the CLI process:

  • Terminal creationstartBuilderSessiongetTowerClient().createTerminal() (spawn-worktree.ts) goes through the Tower daemon. Runs on the daemon's Node, succeeds → builder PTY is alive.
  • Ownership writeupsertBuilder (agent-farm/state.ts:165) opens the workspace state.db in the afx CLI process via import Database from 'better-sqlite3' (state.ts:10). When the CLI's Node ABI ≠ the ABI the global better-sqlite3 was built for, this throws before the row is written.

Net: terminal exists in Tower, builders row never lands → the builder is live but invisible to afx, and the CLI is broken for the rest of that shell.

The ABI flips whenever something rebuilds better-sqlite3 from source (node-gyp → classic ABI pinned to one Node) instead of using the version-independent N-API prebuilt. New/bleeding-edge Node versions (where a prebuilt may be missing) are exactly where this bites.

Recovery gap

afx spawn <id> --resume is not safe against the live-but-unregistered builder: on resume, spawn still calls startBuilderSession unconditionally — there is no existing-terminal/already-running guard — so it spawns a second PTY on the same worktree (two builders racing one porch/git state). There is no command to re-register an existing terminal (writing just the missing row), and hand-editing state.db is disallowed. Current safe path is stop-the-orphan-then---resume, run under the daemon's Node — fragile and non-obvious.

Proposed fixes

  1. Route the ownership write through the Tower daemon (like terminal-creation already is), instead of opening better-sqlite3 directly in the CLI. This is the root fix:
    • the CLI never loads the native module → its Node version cannot break afx;
    • spawn becomes atomic — the daemon owns both terminal creation and registration, eliminating the terminal-exists-but-no-row half-state.
    • Aligns with the single-owner invariant (state.db has one owner: Tower).
  2. Fail loud on ABI mismatch. When better-sqlite3 fails to load, emit an actionable error ("built for ABI X, running ABI Y — use the install-time Node / npm rebuild") instead of the raw native error.
  3. Never rebuild the shared global better-sqlite3 during a spawn. If a worktree needs its own copy, build it worktree-local; a spawn must not mutate /…/node_modules/@cluesmith/codev/node_modules/better-sqlite3 out from under a running CLI.
  4. Recovery primitive (smaller): an afx path to re-register an already-running terminal (write the missing builders row, no new PTY), so a half-spawn is recoverable without stop-then-resume.

Notes

  • Reported by an external adopter running a multi-architect workspace; specifics scrubbed.
  • Empirically, refreshing the global better-sqlite3 to its N-API prebuilt (a normal reinstall) makes it load under both Node versions and clears the symptom — but that is incidental, not a structural fix: the next from-source rebuild re-breaks it.
  • Code refs: agent-farm/state.ts:10,165; agent-farm/commands/spawn.ts (resume path, no existing-terminal guard); agent-farm/commands/spawn-worktree.ts (startBuilderSession → Tower client).

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/towerArea: Tower server / agent farm CLI

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions