[Bugfix #1094] Fail loud on unverifiable builder id instead of silent misroute to main#1095
Merged
Conversation
…ilent misroute detectCurrentBuilderId() silently fell back to the bare worktree directory name (e.g. bugfix-2461) whenever it could not read the workspace state.db inside a confirmed .builders/<id>/ context. That non-canonical id matches no builders.id, so Tower's affinity resolver (lookupBuilderSpawningArchitect → undefined) drops to the "non-builder sender → main first" branch and a builder's `afx send architect` is silently delivered to main. The real trigger was a Node ABI mismatch making `new Database()` throw, but any state.db open failure (corruption, permissions, lock) produced the same silent misroute. Fix: - detectCurrentBuilderId now throws BuilderIdResolutionError on all three unverifiable paths (state.db missing / unopenable / no matching row). DB-open failures get an actionable message that names the likely better-sqlite3 ABI mismatch and points at reinstalling codev. - send() wraps the detection and calls fatal() on throw, so afx send aborts loudly rather than shipping an unverified `from`. - Tower-side defense-in-depth: resolveAgentInWorkspace warns when a builder-shaped sender has no state.db row (and isn't a registered architect) before falling through to main. Tests: - bugfix-774 fallback tests now assert throws; added DB-open-failure regression + describeStateDbOpenFailure message tests. - new bugfix-1094-tower-guard test for looksLikeBuilderId + the warn path. - send.test.ts: isolate from CWD-based builder detection (chdir to tmpdir) so `from` resolves deterministically; it was built on the old silent fallback. Fixes #1094
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
detectCurrentBuilderId()(packages/codev/src/agent-farm/commands/send.ts) silently fell back to the bare worktree directory name (e.g.bugfix-2461) whenever it could not resolve the canonical builder id from the workspacestate.db— even though CWD was confirmed to be inside.builders/<id>/. That bare name matches nobuilders.id(canonical:builder-bugfix-2461), so Tower's affinity resolver (lookupBuilderSpawningArchitect→undefined) drops to the "non-builder sender → main first" branch and the builder'safx send architectis silently delivered tomaininstead of its spawning architect.This violates "fail fast, never implement fallbacks": a fatal environmental fault was laundered into a subtle, plausible-looking misroute.
Root Cause
In a confirmed
.builders/<id>/context, three paths returned the bare worktree name:!existsSync(dbPath)— workspacestate.dbmissingcatch { return worktreeDirName }—new Database()threw (the real incident: a Node ABI mismatch, ABI 127 on PATH vs ABI 147 the native module was built for)buildersrow"Can't read the DB" is an error condition, not a "this isn't a builder" condition — but the function treated both identically.
Fix
detectCurrentBuilderIdfails loud. It now throwsBuilderIdResolutionErroron all three unverifiable paths instead of returning a bare name. The DB-open-failure message names the likely better-sqlite3 ABI mismatch (NODE_MODULE_VERSION) and points at reinstalling codev under the current node. Contract is now crisp: returns the canonical id, returnsnull(not a builder), or throws (is a builder but id unverifiable).send()aborts loudly. It wraps the detection and callsfatal()on throw, soafx senderrors with a clear message rather than shipping an unverifiedfromthat Tower silently routes tomain.resolveAgentInWorkspacenow emits aconsole.warnwhen a builder-shapedsenderhas nostate.dbrow (and isn't a currently-registered architect) before falling through tomain— so any other source of an unverified builder-like sender becomes visible instead of silent.The happy path is unchanged: with a readable
state.db, the canonical id resolves exactly as before (verified against the real main workspacestate.db— this worktree resolves tobuilder-bugfix-1094).Out of scope (noted for follow-up): proposal item #4's broader "one clear ABI message across the whole CLI (e.g.
afx status)" — this PR surfaces the actionable ABI message on theafx sendpath where the bug lives.Test Plan
bugfix-774-detect-builder-id.test.ts: the two tests that encoded the bare-name fallback now assert aBuilderIdResolutionErrorthrow; added a DB-open-failure regression test (unopenablestate.db⇒ throws, does not return the bare name) plusdescribeStateDbOpenFailuremessage tests (ABI hint vs generic hint).bugfix-1094-tower-guard.test.ts(new):looksLikeBuilderIdheuristic + the warn-then-main-fallback path; asserts no spurious warning for architect senders or no-sender.send.test.ts: isolated from CWD-based builder detection (runs fromtmpdir()) sofromresolves deterministically toarchitect; it was previously built on the now-removed silent fallback.porch check(build + tests) passes.Related: #758 (deferred end-to-end Tower-process affinity-routing test) — an E2E test exercising
afx send architectfrom a real builder worktree would have caught this.Fixes #1094