Skip to content

bug: sf agent start fails with 'Session exited before init' when invoked from inside a Claude Code session (CLAUDECODE inherited) #103

@komoreka

Description

@komoreka

Summary

sf agent start <id> --prompt \"...\" (and any daemon-driven worker spawn) fails with Failed to start agent: Session exited before init when stoneforge is invoked from inside a Claude Code session — that is, any time `sf` is called from the Claude Code bash tool, or when a director/orchestrator running inside Claude Code dispatches workers.

The cryptic message hides a specific subprocess error from the Anthropic SDK:

Error: Claude Code cannot be launched inside another Claude Code session.
Nested sessions share runtime resources and will crash all active sessions.
To bypass this check, unset the CLAUDECODE environment variable.

The actual cause: packages/smithy/src/providers/claude/headless.ts:334 and packages/smithy/src/providers/claude/interactive.ts:123 explicitly set `CLAUDECODE: '1'` in the env passed to the spawned claude subprocess. Recent versions of the claude binary (verified on 2.1.123) read CLAUDECODE and refuse to start with the error above. The Anthropic SDK surfaces it as `Claude Code process exited with code 1`, and the stoneforge spawner's `waitForInit` swallows it as `Session exited before init` (#102 covers the diagnostic gap).

Why this regressed

Issue #32 (closed 2026-03-22) introduced `CLAUDECODE: '1'` to fix Windows session permission/spawn issues. The intent was "signal to claude that this is a managed session". But the current claude binary's CLAUDECODE check is the opposite: "if CLAUDECODE is set, refuse to nest". Since stoneforge IS the parent (the spawned claude is a fresh top-level session, not nested), having CLAUDECODE in the child env is wrong.

This regression became visible as Claude Code adoption grew among contributors and operators. Anyone running `sf` from Claude Code's bash tool — which includes most developers using Claude Code Desktop or CLI — hits this on every spawn.

Repro

From inside a Claude Code session (any platform with a recent claude):

cd /path/to/any-stoneforge-project
sf agent register testworker --role worker
sf agent start <returned-id> --prompt \"hi\"
# Error: Failed to start agent: Session exited before init

To see the underlying error, wrap the SDK spawn:

import { query } from '@anthropic-ai/claude-agent-sdk';
import { spawn as cpSpawn } from 'node:child_process';

// ... InputQueue boilerplate ...

const opts = {
  cwd: '/path/to/project',
  env: { ...process.env, CLAUDECODE: '1' },  // reproduces stoneforge's env
  permissionMode: 'bypassPermissions',
  spawnClaudeCodeProcess: (spawnOpts) => {
    const child = cpSpawn(spawnOpts.command, spawnOpts.args, {
      cwd: spawnOpts.cwd, env: spawnOpts.env, stdio: ['pipe', 'pipe', 'pipe'],
    });
    child.stderr.on('data', (d) => console.error('[stderr]:', d.toString()));
    return child;
  },
};

stderr shows the nested-session error verbatim.

Fix

Strip CLAUDECODE from the inherited env before passing to the SDK / node-pty:

const env: Record<string, string> = {
  ...(process.env as Record<string, string>),
  ...options.environmentVariables,
};
delete env.CLAUDECODE;

This needs to happen in both `headless.ts` and `interactive.ts`. Stoneforge IS the parent; the spawned claude is a fresh top-level session.

If #32's Windows-specific symptoms reappear without CLAUDECODE, gate the inclusion on `process.platform === 'win32'`. (I have not retested #32's Windows scenario; if a Windows contributor can confirm whether the bare strip is sufficient or platform-gating is needed, that would be valuable.)

Why this matters

This blocks every dispatch operation for users running stoneforge from inside Claude Code. The director use case described in the README ("orchestrate workers from a Claude Code session") cannot work without this fix on macOS/Linux. The cryptic error makes it nearly impossible to diagnose without instrumentation.

Combined with #102 (spawner exit diagnostic enrichment), surfacing the underlying claude error in the spawner's exit message would have made this 5-minute-debuggable instead of multi-hour.

Environment

  • stoneforge: master @ 0a7052a
  • macOS Darwin 25.3.0
  • Claude Code 2.1.123
  • @anthropic-ai/claude-agent-sdk 0.2.45
  • Reproduced 100% of the time when invoked from any Claude Code shell

I have a fix on branch `komoreka:fix/claudecode-nesting-spawn`. Happy to open a PR if there is interest. References #32, #102.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions