diff --git a/.changeset/t12009-gateway-autostart-entry-path.md b/.changeset/t12009-gateway-autostart-entry-path.md new file mode 100644 index 000000000..1326bd7ca --- /dev/null +++ b/.changeset/t12009-gateway-autostart-entry-path.md @@ -0,0 +1,18 @@ +--- +id: t12009-gateway-autostart-entry-path +tasks: [T12009] +kind: fix +summary: Layout-proof CLI entry resolution for gateway auto-start — fixes MODULE_NOT_FOUND in packaged installs +--- + +Fixes a P0 regression where `cleo` (bare, on a TTY) always printed "daemon gateway is not reachable" in every packaged npm install. The auto-started gateway child died instantly with `Cannot find module '/home/.../.npm-global/lib/dist/cli/index.js'`. + +**Root cause (T12009):** `resolveCliEntryPath()` used `join(dirname(import.meta.url), '..', '..', '..', '..', '..')` — correct in the source tree where `src/cli/lib/` is 3 directories deep, but the esbuild bundle inlines ALL source into a single file at `/dist/cli/index.js`. Inside the bundle `import.meta.url` IS that entry file, so `dirname` is `/dist/cli/` (only 2 hops to the package root). Five `..` from there escapes the package entirely, landing at the npm prefix `lib/` directory. + +**Fix:** Replace the fixed-depth `..` arithmetic with a walk up the directory tree, looking for the nearest ancestor `package.json` with `name === "@cleocode/cleo"`. The resolver also calls `realpathSync` on the start path so symlinked global-bin installs (`.npm-global/bin/cleo → ../lib/node_modules/@cleocode/cleo/bin/cleo.js`) resolve correctly. A descriptive error is thrown (and caught/surfaced gracefully) when the bundle is missing. + +**Observability (T12009 AC):** `runCockpit` previously discarded `spawnResult.reason` entirely. It now surfaces it as ` auto-start: ` and also appends the last line of `/gateway.err` when readable — that one line would have made T12009 a 1-minute diagnosis. + +**Regression tests:** Four new tests in `gateway-auto-start.test.ts` exercise the resolver against synthetic package hierarchies in a tmp directory: (a) bundled `dist/cli/` layout, (b) symlinked global-bin entry, (c) missing `dist/cli/index.js` → descriptive throw, (d) no `@cleocode/cleo` package found → descriptive throw. All 26 tests pass. + +**Live proof:** Bare `cleo` in a TTY on this machine boots the gateway (`ss -ltn` confirms port 7777 accepting) and renders the full Kanban board ("CLEO Cockpit — Kanban (498 tasks)") without the "not reachable" message. diff --git a/docs/plan/dogfood-harness-question-ledger.md b/docs/plan/dogfood-harness-question-ledger.md index 557948fcf..5eb5d2b1f 100644 --- a/docs/plan/dogfood-harness-question-ledger.md +++ b/docs/plan/dogfood-harness-question-ledger.md @@ -1282,3 +1282,21 @@ Owner surface: CORE provider/model selection — `selectProviderModel` candidate Observed: the provisioning-aware selector (DHQ-081's fix, `T11978`) chose `wouldPickModel=gpt-5.5-pro` (catalog-newest ranking) but the `codex_responses` ChatGPT-account wire rejects it with a 400 — the capability table for that transport only supports `gpt-5.5` and the `gpt-5.5-pro` variant is not served on the ChatGPT-account-backed Codex endpoint. Fix-gen degraded to no-patch as a result. Workaround active: `cleo llm use openai --model gpt-5.5` (a config pin that wins unconditionally in the selector). Answer vehicle: a wire/account capability table that constrains the candidate model set before `release_date` ranking — each transport/wire reports which model IDs it can actually serve; the cross-provider selector filters to the intersection before picking the newest. Remove the manual pin once capability-aware selection is in place. + +### DHQ-096 — bare `cleo` auto-start dead in packaged installs — `resolveCliEntryPath` hardcoded source-tree depth, child died with `MODULE_NOT_FOUND` — **FIXED in this PR** (`T12009`) + +Owner surface: CORE gateway auto-start — `packages/cleo/src/cli/lib/gateway-auto-start.ts` `resolveCliEntryPath`. + +Observed live on v2026.6.15 (`~/.local/state/cleo/gateway.err`): `Cannot find module '/home/keatonhoskins/.npm-global/lib/dist/cli/index.js'`. The auto-started gateway child died instantly on every packaged install; `cleo` (bare, TTY) always printed "daemon gateway is not reachable". Root cause: `resolveCliEntryPath` computed `join(dirname(import.meta.url), '..', '..', '..', '..', '..')` — written for the source layout (`src/cli/lib/` is 3 dirs inside the package), but the esbuild bundle inlines ALL source into `/dist/cli/index.js`. From inside the bundle `import.meta.url` IS `/dist/cli/index.js`, so 5 `..` escapes the package and lands at `.npm-global/lib/`, not the package root. A secondary observability gap: `spawnResult.reason` was discarded by `runCockpit`, so the error was invisible even when looking at the TUI output. + +Fix (`T12009` / PR): replaced fixed-depth arithmetic with a walk up the directory tree to the nearest ancestor `package.json` with `name === "@cleocode/cleo"`, then return `/dist/cli/index.js`. `realpathSync` resolves symlinked global-bin layouts before the walk. Descriptive error surfaced when dist is absent or package not found. `runCockpit` now emits ` auto-start: ` and the last line of `gateway.err` when the gateway stays unreachable. Four regression tests cover (a) bundled layout, (b) symlinked-bin, (c) missing-dist → throw, (d) no package → throw. Live proof: `CLEO Cockpit — Kanban (498 tasks)` rendered; port 7777 confirmed accepting. + +Meta-lesson (DHQ-086 class): merging a batteries-included AC (`T11980`) without a packaged-install e2e of the new spawn path means the regression only surfaces in production. + +### DHQ-097 — `cleo login` hangs after successful OAuth — process never exits — **OPEN** (`T12010` ← `T11679`) + +Owner surface: CORE OAuth login — process exit after token storage. + +Observed: after `cleo login anthropic` completes the browser-approve → token-exchange flow and stores the `sk-ant-oat` credential, the process hangs indefinitely. The user must Ctrl-C to return to the shell. The token IS stored (subsequent `cleo llm test` works), but the CLI process never calls `process.exit()` after the flow completes. Likely cause: an open handle (event emitter, timer, or HTTP server from the OAuth callback listener) that is not torn down after token storage. + +Answer vehicle: identify the open handle via `--expose-gc` / `wtfnode` in dev, ensure the OAuth callback HTTP server is `server.close()`d and all timers are cleared after the token is stored, and add a process exit path (`process.exit(0)`) as a final backstop if handles remain open after a 500 ms grace period. Regression test: run `cleo login anthropic` with a mock token-exchange endpoint and assert the process exits within 2 seconds of the exchange completing. diff --git a/packages/cleo/src/cli/lib/__tests__/gateway-auto-start.test.ts b/packages/cleo/src/cli/lib/__tests__/gateway-auto-start.test.ts index 722d389da..7bf51b852 100644 --- a/packages/cleo/src/cli/lib/__tests__/gateway-auto-start.test.ts +++ b/packages/cleo/src/cli/lib/__tests__/gateway-auto-start.test.ts @@ -1,11 +1,12 @@ /** - * Tests for the gateway auto-start-on-demand helper (T11980). + * Tests for the gateway auto-start-on-demand helper (T11980 · T12009). * * Covers: * - {@link shouldAutoStartGateway} — pure config-flag reader * - {@link probePort} — TCP port availability check * - {@link pollPort} — exponential-backoff polling loop * - {@link spawnGatewayIfDown} — spawn path (mocked child_process) + * - {@link resolveCliEntryPath} — layout-proof CLI entry resolution (T12009) * * Integration: the actual spawn is NOT exercised in unit tests (that would * require a compiled CLI bundle and a free port). The spawn path is exercised @@ -13,16 +14,22 @@ * binds a TCP server so the port probe succeeds. * * @task T11980 + * @task T12009 */ +import * as fs from 'node:fs'; import * as net from 'node:net'; -import { afterAll, beforeAll, describe, expect, it } from 'vitest'; +import * as os from 'node:os'; +import * as path from 'node:path'; +import { pathToFileURL } from 'node:url'; +import { afterAll, afterEach, beforeAll, describe, expect, it } from 'vitest'; import { GATEWAY_DEFAULT_HOST, GATEWAY_DEFAULT_PORT, GATEWAY_WAIT_TIMEOUT_MS, pollPort, probePort, + resolveCliEntryPath, shouldAutoStartGateway, spawnGatewayIfDown, } from '../gateway-auto-start.js'; @@ -264,6 +271,127 @@ describe('spawnGatewayIfDown', () => { }); }); +// --------------------------------------------------------------------------- +// resolveCliEntryPath — layout-proof entry resolution (T12009) +// +// These tests create minimal synthetic package hierarchies in a tmp directory +// to verify the resolver works in the two layouts that matter: +// (a) packaged/bundled layout: /dist/cli/index.js (the bundle itself) +// (b) symlinked global-bin: .npm-global/bin/cleo.js → real pkg entry +// (c) missing dist → descriptive throw +// +// We inject the `startUrl` parameter added for testability — the resolver +// uses `import.meta.url` by default (the real production path). +// --------------------------------------------------------------------------- + +describe('resolveCliEntryPath (T12009 regression)', () => { + /** Temporary root directory cleaned up after each test. */ + let tmpRoot = ''; + + /** + * Create a minimal synthetic @cleocode/cleo package under `root` at the + * given relative path. The package.json is written with `name: "@cleocode/cleo"`. + * If `withDist` is true, the `dist/cli/index.js` placeholder is also created. + * + * @returns The absolute path to the package root. + */ + function makePackage(root: string, relativePkgPath: string, withDist = true): string { + const pkgRoot = path.join(root, relativePkgPath); + fs.mkdirSync(pkgRoot, { recursive: true }); + fs.writeFileSync( + path.join(pkgRoot, 'package.json'), + JSON.stringify({ name: '@cleocode/cleo', version: '0.0.0' }), + ); + if (withDist) { + const distCli = path.join(pkgRoot, 'dist', 'cli'); + fs.mkdirSync(distCli, { recursive: true }); + fs.writeFileSync(path.join(distCli, 'index.js'), '// placeholder'); + } + return pkgRoot; + } + + beforeAll(() => { + tmpRoot = fs.mkdtempSync(path.join(os.tmpdir(), 'cleo-entry-test-')); + }); + + afterEach(() => { + // Clean up sub-directories created inside tmpRoot between tests, but keep + // the root so beforeAll's mkdtemp does not need to be re-run. + for (const entry of fs.readdirSync(tmpRoot)) { + fs.rmSync(path.join(tmpRoot, entry), { recursive: true, force: true }); + } + }); + + afterAll(() => { + fs.rmSync(tmpRoot, { recursive: true, force: true }); + }); + + it('(a) resolves dist/cli/index.js from inside a bundled dist/cli/ layout', () => { + // Simulate the esbuild bundle layout: + // /lib/node_modules/@cleocode/cleo/{package.json,dist/cli/index.js} + // The "calling module" is the bundle itself: dist/cli/index.js (2 hops to pkg root). + const pkgRoot = makePackage(tmpRoot, 'lib/node_modules/@cleocode/cleo'); + const bundleFile = path.join(pkgRoot, 'dist', 'cli', 'index.js'); + const startUrl = pathToFileURL(bundleFile).href; + + const resolved = resolveCliEntryPath(startUrl); + expect(resolved).toBe(bundleFile); + }); + + it('(b) resolves correctly when the start URL is a symlinked global-bin entry', () => { + // Simulate: + // .npm-global/lib/node_modules/@cleocode/cleo/ ← real package + // .npm-global/bin/cleo.js → ../lib/node_modules/@cleocode/cleo/bin/cleo.js + // The resolver must follow the symlink and walk up to find the package root. + const pkgRoot = makePackage(tmpRoot, 'npm-global/lib/node_modules/@cleocode/cleo'); + + // Create a "bin" entry inside the package (real file, not the symlink target itself). + const binDir = path.join(pkgRoot, 'bin'); + fs.mkdirSync(binDir, { recursive: true }); + const binEntry = path.join(binDir, 'cleo.js'); + fs.writeFileSync(binEntry, '#!/usr/bin/env node\n// bin shim'); + + // Create the global-bin symlink pointing at the bin entry. + const globalBinDir = path.join(tmpRoot, 'npm-global', 'bin'); + fs.mkdirSync(globalBinDir, { recursive: true }); + const symlink = path.join(globalBinDir, 'cleo.js'); + fs.symlinkSync(binEntry, symlink); + + // The resolver receives the symlink path (as node would see import.meta.url + // of a file reached via a symlinked bin). + const startUrl = pathToFileURL(symlink).href; + const expected = path.join(pkgRoot, 'dist', 'cli', 'index.js'); + + const resolved = resolveCliEntryPath(startUrl); + expect(resolved).toBe(expected); + }); + + it('(c) throws a descriptive error when dist/cli/index.js is absent', () => { + // Package exists but dist/ was never built. + const pkgRoot = makePackage(tmpRoot, 'no-dist-pkg', /* withDist */ false); + // The calling file is inside the package src/ tree. + const fakeSrc = path.join(pkgRoot, 'src', 'cli', 'lib', 'gateway-auto-start.js'); + fs.mkdirSync(path.dirname(fakeSrc), { recursive: true }); + fs.writeFileSync(fakeSrc, ''); + const startUrl = pathToFileURL(fakeSrc).href; + + expect(() => resolveCliEntryPath(startUrl)).toThrow( + /dist\/cli\/index\.js is missing|dist\/cli\/index\.js/, + ); + }); + + it('(d) throws a descriptive error when no @cleocode/cleo package.json is found', () => { + // Isolated tmp dir with no package.json containing the right name. + const orphanDir = path.join(tmpRoot, 'orphan', 'deep', 'nested'); + fs.mkdirSync(orphanDir, { recursive: true }); + const fakeSrc = path.join(orphanDir, 'index.js'); + fs.writeFileSync(fakeSrc, ''); + const startUrl = pathToFileURL(fakeSrc).href; + + expect(() => resolveCliEntryPath(startUrl)).toThrow(/@cleocode\/cleo|package root/); + }); +}); + // --------------------------------------------------------------------------- // Constants sanity // --------------------------------------------------------------------------- diff --git a/packages/cleo/src/cli/lib/gateway-auto-start.ts b/packages/cleo/src/cli/lib/gateway-auto-start.ts index 2de725f40..4489f209e 100644 --- a/packages/cleo/src/cli/lib/gateway-auto-start.ts +++ b/packages/cleo/src/cli/lib/gateway-auto-start.ts @@ -28,10 +28,11 @@ * * @module * @task T11980 + * @task T12009 */ import { once } from 'node:events'; -import { createWriteStream } from 'node:fs'; +import { createWriteStream, existsSync, readFileSync, realpathSync } from 'node:fs'; import { mkdir } from 'node:fs/promises'; import * as net from 'node:net'; import { dirname, join } from 'node:path'; @@ -60,6 +61,9 @@ const POLL_INITIAL_DELAY_MS = 100; /** Maximum poll interval cap in ms. */ const POLL_MAX_DELAY_MS = 1_000; +/** The npm package name this resolver targets. */ +const CLEO_PACKAGE_NAME = '@cleocode/cleo'; + // --------------------------------------------------------------------------- // Types // --------------------------------------------------------------------------- @@ -185,21 +189,73 @@ export async function pollPort(port: number, host: string, timeoutMs: number): P /** * Resolve the absolute path to the compiled CLI bundle (`dist/cli/index.js`). * - * Walks upward from this compiled module to find `dist/cli/index.js` within - * the same package. Works in both source (`src/`) and compiled (`dist/`) trees - * because the relative distance from this file to the package root is the same. + * Walks **upward** from the given `startUrl` until it finds a `package.json` + * whose `"name"` field equals `"@cleocode/cleo"`, then returns + * `/dist/cli/index.js`. + * + * ## Why the walk, not fixed-depth `..` arithmetic? + * + * The original implementation used `join(thisDir, '..', '..', '..', '..', '..')` + * — valid in the **source tree** (`src/cli/lib/` is 3 dirs deep inside the + * package), but the esbuild **bundle** inlines ALL modules into a single file + * at `/dist/cli/index.js`. From inside the bundle, + * `import.meta.url` IS `/dist/cli/index.js`, so `dirname` is + * `/dist/cli/` (only 2 hops to the package root). Five `..` from there + * escapes the package entirely and reaches the npm prefix's `lib/` directory, + * giving the child process a path that does not exist (T12009). + * + * Walking to the nearest ancestor `package.json` with the right name is + * layout-proof: it finds the same root regardless of whether the module is + * bundled or unbundled, and regardless of the npm prefix or symlink topology. * - * @returns Absolute path to the CLI entry-point JS file. - * @internal + * The start-path is resolved through `realpathSync` so symlinked global-bin + * installs (`.npm-global/bin/cleo → ../lib/node_modules/@cleocode/cleo/bin/cleo.js`) + * anchor to the real package directory before the walk begins. + * + * @param startUrl - `import.meta.url` of the calling module. Injectable for + * tests so the resolver can be exercised against a synthetic file hierarchy + * without a real install. Defaults to this module's own `import.meta.url`. + * @returns Absolute path to `/dist/cli/index.js`. + * @throws {Error} When no `package.json` with `name === "@cleocode/cleo"` is + * found within 20 ancestor directories, or when the resolved + * `dist/cli/index.js` does not exist on disk. */ -function resolveCliEntryPath(): string { - // __dirname in ESM → dirname(fileURLToPath(import.meta.url)) - const thisDir = dirname(fileURLToPath(import.meta.url)); - // From packages/cleo/src/cli/lib/ → packages/cleo/ is 4 levels up - // From packages/cleo/dist/cli/lib/ → packages/cleo/ is 4 levels up - // We go up to the package root then descend to dist/cli/index.js - const pkgRoot = join(thisDir, '..', '..', '..', '..', '..'); - return join(pkgRoot, 'dist', 'cli', 'index.js'); +export function resolveCliEntryPath(startUrl: string = import.meta.url): string { + // Resolve symlinks so packaged global-bin layouts anchor to the real file. + const startFile = realpathSync(fileURLToPath(startUrl)); + let current = dirname(startFile); + + for (let i = 0; i < 20; i++) { + const pkgJsonPath = join(current, 'package.json'); + if (existsSync(pkgJsonPath)) { + let pkgName: string | undefined; + try { + const raw = readFileSync(pkgJsonPath, 'utf8'); + const parsed = JSON.parse(raw) as Record; + pkgName = typeof parsed['name'] === 'string' ? parsed['name'] : undefined; + } catch { + // Malformed package.json — skip this candidate and keep walking. + } + if (pkgName === CLEO_PACKAGE_NAME) { + const entry = join(current, 'dist', 'cli', 'index.js'); + if (!existsSync(entry)) { + throw new Error( + `@cleocode/cleo package root found at ${current} but dist/cli/index.js is missing — ` + + `run \`pnpm --filter @cleocode/cleo run build\` to generate it`, + ); + } + return entry; + } + } + const parent = dirname(current); + if (parent === current) break; // filesystem root — nowhere left to walk + current = parent; + } + + throw new Error( + `could not locate ${CLEO_PACKAGE_NAME} package root starting from ${startFile} — ` + + `checked up to 20 ancestor directories`, + ); } // --------------------------------------------------------------------------- diff --git a/packages/cleo/src/cli/lib/tui/cockpit.ts b/packages/cleo/src/cli/lib/tui/cockpit.ts index efcb7933c..4d15c634f 100644 --- a/packages/cleo/src/cli/lib/tui/cockpit.ts +++ b/packages/cleo/src/cli/lib/tui/cockpit.ts @@ -33,6 +33,8 @@ * @epic T11916 */ +import { readFileSync } from 'node:fs'; +import { join } from 'node:path'; import { createCleoClient } from '@cleocode/core/gateway-client'; import { type SpawnGatewayOptions, spawnGatewayIfDown } from '../gateway-auto-start.js'; import { @@ -573,6 +575,9 @@ export async function runCockpit( // Derive port/host from baseUrl for the auto-start probe. let gatewayAutoStarted = false; const autoStart = options.autoStart !== false; // default true + // Capture the spawn failure reason so we can surface it when the gateway + // remains unreachable (T12009: previously discarded, making diagnosis hard). + let autoStartFailReason: string | undefined; // 1. Fetch home data through the SDK. null ⇒ daemon unreachable. let rows = await fetchTaskRows(baseUrl); @@ -599,12 +604,35 @@ export async function runCockpit( gatewayAutoStarted = spawnResult.spawned; // Re-fetch now that the gateway is up. rows = await fetchTaskRows(baseUrl); + } else { + // Preserve the failure reason for the user-facing error block below. + autoStartFailReason = spawnResult.reason; } } if (rows === null) { sink('CLEO cockpit: the daemon gateway is not reachable.'); sink(` Tried: ${baseUrl}/v1`); + // Surface the auto-start failure reason so the next such defect is + // diagnosable in one run (T12009 — previously this was silently discarded). + if (autoStartFailReason !== undefined) { + sink(` auto-start: ${autoStartFailReason}`); + } + // Append the last line of gateway.err (the child's stderr) when readable — + // that one line would have made T12009 a 1-minute diagnosis. + try { + const { getCleoLogDir } = await import('@cleocode/core'); + const errPath = join(getCleoLogDir(), 'gateway.err'); + const errContent = readFileSync(errPath, 'utf8').trimEnd(); + if (errContent.length > 0) { + const lastLine = errContent.split('\n').at(-1) ?? ''; + if (lastLine.length > 0) { + sink(` gateway.err: ${lastLine}`); + } + } + } catch { + // Non-fatal: log file absent or unreadable — skip silently. + } sink(' Start it with: cleo daemon serve'); sink(' (override the target with: cleo tui --base-url )'); return { outcome: 'daemon-down', baseUrl, piTui: false, gatewayAutoStarted };