Skip to content

CRITICAL BUGFIX (daemon,app): Session spawning via tmux always fails — PID mismatch timeout, hidden errors, orphaned windows (#272)#790

Open
ahundt wants to merge 14 commits intoslopus:mainfrom
ahundt:fix/tmux-session-wizard-bugs
Open

CRITICAL BUGFIX (daemon,app): Session spawning via tmux always fails — PID mismatch timeout, hidden errors, orphaned windows (#272)#790
ahundt wants to merge 14 commits intoslopus:mainfrom
ahundt:fix/tmux-session-wizard-bugs

Conversation

@ahundt
Copy link
Contributor

@ahundt ahundt commented Feb 28, 2026

Problem

Tmux session spawning completely fails: sessions timeout during creation, error messages are hidden from users, and orphaned windows accumulate during cleanup.

Context: PR #272 added the new session wizard GUI with tmux session integration for the mobile app. PR #487 was intended to resolve #272's issues, but expanded scope significantly. When #487 scope diverged from the original discussion, development continued as a separate fork (https://github.com/happier-dev/happier). This PR provides the minimal functionality fixes needed for #272 to work reliably in the main project, focusing on root causes (PID mismatch timeout, error suppression, orphaned window cleanup, async safety, multi-flavor inconsistencies) rather than architectural re-scoping.

Root Causes & Fixes

1. PID mismatch timeout (15s always fires)

  • Daemon tracked shell PID from tmux, but spawned node process reported its own PID in webhook
  • Solution: Use HAPPY_SPAWN_TOKEN (random token) instead — passed via env var, reported in webhook, matched by daemon

2. Hidden daemon error messages

  • GUI threw generic "Session spawning failed" instead of showing daemon's actual error (e.g., missing Z_AI_AUTH_TOKEN)
  • Solution: Propagate daemon error message to GUI modal

3. Orphaned tmux windows

  • stopSession only killed the shell process, never closed the tmux window
  • Solution: Call killWindow() when tmux session exists. Check if process still alive with process.kill(pid, 0) — if Claude running, kill window; if user already exited, leave window alone (becomes independent terminal)

4. Silent Codex auth failure

  • fs.writeFile() was not awaited before spawning process
  • Solution: Add await

5. Timeout could hang daemon

  • execFile timeout option doesn't reject Promise
  • Solution: Explicit timeout with process.kill() and rejection

6. Inconsistent multi-flavor support

  • Tmux path used ternary, non-tmux path used switch for agent selection → divergence risk
  • Gemini auth was wrong (fell through to Claude path)
  • Solution: Unified AGENT_COMMAND_MAP/getAgentCommand() utility; proper Gemini auth

Secondary Improvements

  • Shell detection via #{pane_current_command} (works with custom prompts, not just regex)
  • Tmux -t flag insertion: moved from end (broke display-message -p FORMAT -t session) to after command name
  • Session auto-detection from daemon's own $TMUX env var
  • Absolute process.execPath (fixes NVM/asdf/nix PATH issues)
  • Quoted paths in send-keys (handles spaces)
  • Background window creation with -d (no focus stealing)

No Regressions

  • All 450 existing tests pass
  • New tests added for token matching, shell detection, multi-flavor spawning
  • Linux CI: ✅ PASS (both Node 20 and 24)
  • Windows CI: Pre-existing infrastructure issue, fixed here by cherry picking fix(ci): fix Windows smoke test build failures #792

…s to user

Three bugs fixed:

1. Tmux sessions always timed out (critical):
   - spawnInTmux returns #{pane_pid} = shell PID; spawned node reports
     process.pid = different PID; webhook lookup by hostPid never matched.
   - Fix: generate HAPPY_SPAWN_TOKEN (random hex) before tmux spawn, pass
     it in tmuxEnv, store token→session in tokenToTrackedSession map.
     createSessionMetadata.ts emits spawnToken from process.env.
     onHappySessionWebhook resolves by token first, PID as fallback.
   - Files: run.ts, types.ts, createSessionMetadata.ts

2. New session wizard silently swallowed all daemon errors:
   - else branch at line 1065 threw a generic Error, discarding
     result.errorMessage. Users saw "make sure daemon is running"
     for all failures including z.ai missing Z_AI_AUTH_TOKEN.
   - Fix: check result.type === 'error' and throw result.errorMessage;
     catch block falls through to show actual message when not a
     known network pattern.
   - File: new/index.tsx

3. stopSession left orphaned tmux windows after stop:
   - childProcess is undefined for tmux sessions so kill fell through
     to process.kill(shellPid) without closing the tmux window.
   - Fix: check session.tmuxSessionId first, call tmux.killWindow()
     fire-and-forget to keep stopSession synchronous.
   - File: run.ts

Also carries over prior branch fixes:
- runAcp.ts: mode.description null→undefined for formatOptionalDetail
- run.ts: strip CLAUDECODE env before spawning to prevent nested session error
TEMPORARY HACK: Always spawn new sessions into the 'main' tmux session
instead of using the profile's TMUX_SESSION_NAME setting. This makes
tmux integration testing straightforward.

TODO: Remove this hardcoding and use profile.tmuxConfig.sessionName
after testing is complete.
…s to user

Three bugs fixed:

1. Tmux sessions always timed out (critical):
   - spawnInTmux returns #{pane_pid} = shell PID; spawned node reports
     process.pid = different PID; webhook lookup by hostPid never matched.
   - Fix: generate HAPPY_SPAWN_TOKEN (random hex) before tmux spawn, pass
     it in tmuxEnv, store token→session in tokenToTrackedSession map.
     createSessionMetadata.ts emits spawnToken from process.env.
     onHappySessionWebhook resolves by token first, PID as fallback.
   - Files: run.ts, types.ts, createSessionMetadata.ts

2. New session wizard silently swallowed all daemon errors:
   - else branch at line 1065 threw a generic Error, discarding
     result.errorMessage. Users saw "make sure daemon is running"
     for all failures including z.ai missing Z_AI_AUTH_TOKEN.
   - Fix: check result.type === 'error' and throw result.errorMessage;
     catch block falls through to show actual message when not a
     known network pattern.
   - File: new/index.tsx

3. stopSession didn't kill tmux windows (orphaned windows):
   - childProcess undefined for tmux sessions so kill fell through
     to process.kill(shellPid) without closing the tmux window.
   - Fix: check session.tmuxSessionId first, call tmux.killWindow()
     fire-and-forget to keep stopSession synchronous.
   - File: run.ts

Also fixed tmux command syntax:
- move -P -F flags before other arguments in new-window command
- File: tmux.ts
Three root causes prevented daemon-spawned tmux sessions from working:

1. Env var quoting corruption: spawn() with shell:false passes args
   directly to tmux, so wrapping -e values in double quotes made them
   literal (HOME="/Users/foo" instead of HOME=/Users/foo). This broke
   credential file resolution, causing the spawned process to prompt
   for interactive authentication instead of using existing keys.

2. Shell redirect failure: zsh noclobber prevents >> on non-existent
   files. Changed to > since these are always fresh temp files.

3. Missing spawnToken in Claude path: runClaude.ts built its own
   Metadata object without reading HAPPY_SPAWN_TOKEN from env,
   so the daemon's token-based webhook matching never resolved.

Additional reliability improvements:
- Replace fixed 200ms delay with waitForShellReady() that polls
  capture-pane for a prompt character (up to 5s)
- Split send-keys into literal text (-l flag) + separate Enter key
  to prevent tmux from interpreting >> as key names
- Add debug logging for send-keys results and pane verification
Remove temporary debugging code that was used during development:
- Remove temp log file redirect (output now visible in tmux pane)
- Remove timeout handler file reader (no temp file to read)
- Remove send-keys result logging and pane verification capture
- Remove unnecessary 50ms/200ms delays between send-keys calls
- Revert timeout callback from async to sync (no longer reads files)
…etection

Replace capture-pane regex matching against prompt characters (%$#>±❯→)
with tmux's #{pane_current_command} format variable. Poll until it reports
a known shell name (zsh, bash, fish, etc.), which is deterministic and
works with any custom prompt theme (Prezto, Oh My Zsh, Starship, Powerlevel10k).
- Use process.execPath instead of bare `node` in tmux send-keys command
  to avoid PATH issues on systems using NVM, asdf, or other version managers
  where node isn't available until .zshrc finishes sourcing

- Fix killWindow to use executeTmuxCommand's window parameter for proper
  `-t session:window` targeting instead of passing window name as a
  positional argument (which tmux ignores)

- Pass parsed session name to getTmuxUtilities in stopSession to ensure
  the correct tmux session is targeted

- Set CLAUDECODE='' in tmux env to prevent spawned Claude Code from
  refusing to start with "cannot be launched inside another session"
  (CLAUDECODE is set by Claude Code to detect nesting; the tmux server
  may inherit it even if we filter it from the -e flags)

- Add nushell (nu), elvish, pwsh to knownShells for waitForShellReady
- Add -d flag to new-window so spawned sessions don't switch the user's
  active tmux window away from what they're working on

- Add tmux version check (>= 3.0) before spawning, since the -e flag
  for per-window environment variables was added in tmux 3.0. Fails with
  a clear error message instead of a cryptic tmux parse error on older
  versions.
Quote process.execPath and cliPath so paths with spaces work.
Point directly at dist/index.mjs to skip bin/happy.mjs re-exec.
Re-add --no-warnings --no-deprecation flags.
Fix stale comment in waitForShellReady.
…x spawn

The non-tmux daemon spawn path does not set this flag either. Remote mode
sessions handle permissions through the mobile app RPC flow, not terminal
prompts, so forcing bypass is unnecessary and should be left to the user
profile or mobile app selection.
Resolve session with priority:
1. Profile TMUX_SESSION_NAME env var (explicit user choice)
2. Daemon's own tmux session (if daemon runs inside tmux)
3. Session with the most windows (heuristic for main workspace)
4. spawnInTmux default (first existing session or create 'happy')

Non-tmux fallback path is unchanged: if tmux is unavailable or spawn
fails, regular detached process spawning is used.
executeTmuxCommand appended -t target AFTER all args, which broke
commands with positional arguments (e.g., display-message -p FORMAT).
tmux rejects: `display-message -p #{fmt} -t target` ("too many arguments")
but accepts: `display-message -t target -p #{fmt}`.

This silently broke waitForShellReady — every spawn wasted 5s on a
timeout because display-message with window target always returned
exit code 1. Now shell detection is instant (~0ms).

Also fix resolveTmuxSessionName priority 2 (daemon's own tmux session):
used executeTmuxCommand which forced -t happy, querying the wrong
session. Now uses execFile directly without -t so tmux reads $TMUX.

Tests added:
- 8 unit tests for tmux version string regex parsing
- 8 unit tests for session list parsing (resolveTmuxSessionName logic)
- 3 unit tests for spawnToken in createSessionMetadata
- 7 integration tests exercising real tmux: -d flag, env vars, killWindow,
  pane_current_command detection, PID extraction, CLAUDECODE='', paths
Four critical improvements to ensure proper behavior across all flavors (claude, gemini, codex):

A. Add missing await on Codex fs.writeFile (line 367)
   - Codex auth.json file was not being awaited before process spawn
   - This caused silent failures in Codex session spawning

B. Fix execFile Promise timeout handling (lines 83-97)
   - tmux session resolution could hang indefinitely if tmux didn't respond
   - Added explicit timeout rejection with process.kill on timeout
   - Used pattern already in codebase for process health checks

C. Unify agent command mapping with data-driven approach (lines 28-48)
   - Replace inconsistent ternary (tmux) and switch (non-tmux) patterns
   - Add AGENT_COMMAND_MAP constant and getAgentCommand() utility
   - Use same logic for both tmux and non-tmux spawn paths

D. Complete Gemini authentication support (line 372-374)
   - Was falling through to Claude path (CLAUDE_CODE_OAUTH_TOKEN)
   - Add proper Gemini auth handling (GOOGLE_API_KEY)
   - Now handles all three flavors: claude, codex, gemini

Additional improvements:
- Smart window cleanup in stopSession (lines 745-766)
  - Check if process still alive via process.kill(pid, 0) before killing window
  - Only kill window if Claude still running, leave alone if user exited
  - Prevents loss of user's terminal after they quit the CLI

All tests pass (450 pass, 1 pre-existing failure unrelated to these changes)
Windows CI fails with:
  Cannot find module 'D:\a\happy\happy\node_modules\node_modules\typescript\bin\tsc'

Root cause: npx on Windows resolves tsc via a double-nested node_modules path in
Yarn 1.22 workspace context. The postinstall script also runs build scripts from
the wrong directory on Windows (Yarn 1.22 bug, GitHub Issue #6175).

Fix:
- packages/happy-wire/package.json: use 'tsc' directly instead of 'npx tsc'
  (matches existing 'typecheck' script which already works on all platforms)
- packages/happy-cli/package.json: same change for consistency
- .github/workflows/cli-smoke-test.yml (Windows job only):
  - SKIP_HAPPY_WIRE_BUILD=1 during install to skip broken postinstall on Windows
  - explicit 'cd packages/happy-wire && yarn build' step after install so
    happy-coder build can find happy-wire's dist files
@ahundt ahundt force-pushed the fix/tmux-session-wizard-bugs branch from 58c01a6 to e79aa55 Compare March 1, 2026 02:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant