diff --git a/PLAN.md b/PLAN.md new file mode 100644 index 00000000..84558f2d --- /dev/null +++ b/PLAN.md @@ -0,0 +1,554 @@ +# Apra Fleet -- OS Service Lifecycle Implementation Plan + +> Make apra-fleet behave like a normal OS service: top-level start/stop/restart/status +> verbs, per-user service registration folded into install/uninstall, and cross-platform +> support for Windows (schtasks), Linux (systemd --user), and macOS (launchd LaunchAgent) +> -- all without elevation or admin rights. Extends PR #273 on feat/mcp-sse-transport. + +--- + +## Design Summary + +The implementation adds three new layers to the existing codebase: + +1. **Service Manager Adapter** (`src/services/service-manager/`) -- a single TypeScript + interface (`ServiceManager`) with three OS-specific implementations (Windows schtasks, + Linux systemd, macOS launchd). The factory selects the right adapter at runtime via + `process.platform`. All adapters operate in per-user scope with no elevation. + +2. **Graceful Shutdown Endpoint** -- a `POST /shutdown` handler on the existing HTTP + server (localhost-only, same trust boundary as `/mcp`). This enables cross-platform + graceful stop without relying on OS signal semantics (Windows cannot send SIGTERM to + external processes). + +3. **CLI Verbs** (`src/cli/start.ts`, `stop.ts`, `restart.ts`, `status.ts`) -- thin + command modules wired into the existing dispatch table in `src/index.ts`. Each verb + is idempotent. `start` goes through the service manager when a unit is installed, + otherwise spawns the process directly. `stop` always uses HTTP POST /shutdown directly + -- it is service-agnostic (stopping the process is the same whether a service unit + exists or not) and never routes through the adapter. `status` queries both + server.json/health and the service manager. + +**Stop call path (unified):** The CLI `stop` verb, and all internal stop flows +(adapter unregister, uninstall cleanup), use the same mechanism: read server.json for +the server URL, POST /shutdown to trigger graceful exit (code 0), poll pid for up to 5s, +fallback force-kill. Because the process exits cleanly (code 0), service managers +configured with Restart=on-failure (systemd) and KeepAlive.SuccessfulExit=false (launchd) +will NOT auto-restart. The adapter interface includes a `stop()` method for use within +`unregister()` and for completeness, but the CLI stop verb bypasses the adapter -- it +calls the POST /shutdown flow directly since the mechanism is identical regardless of +whether a service unit is installed. + +**Service unit configuration by OS:** +- **Windows:** Per-user Scheduled Task "ApraFleet" with at-logon trigger, /rl limited + (no elevation). A wrapper .bat script in BIN_DIR handles stdout/stderr redirection to + the log file (schtasks cannot redirect output natively). +- **Linux:** systemd user unit at `~/.config/systemd/user/apra-fleet.service`. + Type=simple, Restart=on-failure, StandardOutput/StandardError=append:logPath. + loginctl enable-linger attempted with a warning on failure. +- **macOS:** LaunchAgent plist at `~/Library/LaunchAgents/com.apra-fleet.server.plist`. + RunAtLoad=true, KeepAlive.SuccessfulExit=false (restart on crash, not on clean exit). + StandardOutPath/StandardErrorPath point to the log file. + +**Graceful stop mechanism:** The server's existing SIGINT/SIGTERM handlers exit with +code 0 after cleaning up server.json, lock file, and connections. Service managers +configured with Restart=on-failure (systemd) and KeepAlive.SuccessfulExit=false (launchd) +will NOT restart the process after a clean exit. This means the CLI `stop` command +(which triggers a clean exit via /shutdown) is compatible with managed services. + +--- + +## Verb x OS Matrix + +The table below defines the exact behavior for every verb on every OS, both when a +service unit is installed and when it is not. No "and similarly for X" -- each cell is +explicit. + +### start + +| OS | Service Installed | No Service Installed | +|---------|------------------------------------------------------------|---------------------------------------------------------| +| Windows | `schtasks /run /tn "ApraFleet"` | Spawn detached: `apra-fleet.exe --transport http` | +| Linux | `systemctl --user start apra-fleet` | Spawn detached: `apra-fleet --transport http` | +| macOS | `launchctl kickstart gui//com.apra-fleet.server` | Spawn detached: `apra-fleet --transport http` | + +All: Idempotent -- checkRunningInstance() first; if already running, report and exit 0. +When spawning directly, stdout/stderr redirect to ~/.apra-fleet/data/fleet.log. + +### stop + +| OS | Behavior | +|---------|---------------------------------------------------------------------------------------| +| Windows | Read server.json -> POST /shutdown -> wait up to 5s for exit -> fallback taskkill /F | +| Linux | Read server.json -> POST /shutdown -> wait up to 5s for exit -> fallback kill -TERM | +| macOS | Read server.json -> POST /shutdown -> wait up to 5s for exit -> fallback kill -TERM | + +All: Idempotent -- if not running (server.json missing or pid dead), report and exit 0. +Clean up stale server.json and lock file if found. The HTTP /shutdown approach is used +on all OSes for consistency; service managers do not restart because the process exits 0. + +### restart + +| OS | Behavior | +|---------|-----------------------| +| Windows | stop (above) then start (above) | +| Linux | stop (above) then start (above) | +| macOS | stop (above) then start (above) | + +### status + +| OS | Behavior | +|---------|---------------------------------------------------------------------------------------| +| Windows | server.json + GET /health + `schtasks /query /tn "ApraFleet" /fo csv /nh` | +| Linux | server.json + GET /health + `systemctl --user is-active` + `is-enabled` | +| macOS | server.json + GET /health + `launchctl print gui//com.apra-fleet.server` | + +All: Works whether or not service unit is installed. Reports: running/stopped, pid, port, +url, version, uptime, active sessions, service unit state (installed/not, enabled/not). + +### install (extended) + +| OS | Additional Steps (after existing install) | +|---------|---------------------------------------------------------------------------------------| +| Windows | Write wrapper.bat to BIN_DIR. `schtasks /create /tn "ApraFleet" /tr "" /sc onlogon /rl limited /f`. `schtasks /run /tn "ApraFleet"`. | +| Linux | Write unit file to ~/.config/systemd/user/apra-fleet.service. `systemctl --user daemon-reload`. `systemctl --user enable apra-fleet`. `systemctl --user start apra-fleet`. Attempt `loginctl enable-linger $USER` (warn on failure). | +| macOS | Write plist to ~/Library/LaunchAgents/com.apra-fleet.server.plist. `launchctl bootout gui//com.apra-fleet.server` (tolerate "not loaded" error). Then `launchctl bootstrap gui/ `. | + +All: Only when --transport http (default). Skipped for --transport stdio. Skipped in +dev mode (non-SEA). Server is running immediately after install. + +### uninstall (extended) + +| OS | Additional Steps (before existing uninstall) | +|---------|---------------------------------------------------------------------------------------| +| Windows | POST /shutdown (graceful stop). `schtasks /delete /tn "ApraFleet" /f`. Remove wrapper.bat. | +| Linux | `systemctl --user stop apra-fleet`. `systemctl --user disable apra-fleet`. Remove unit file. `systemctl --user daemon-reload`. | +| macOS | `launchctl bootout gui//com.apra-fleet.server`. Remove plist file. | + +All: Idempotent -- each step tolerates "not found" errors. Replaces the existing +isApraFleetRunning/killApraFleet approach with graceful /shutdown + service cleanup. + +--- + +## Tasks + +### Phase 1: Platform Service Foundation + +Front-loads the two riskiest assumptions: (a) per-user service management without +elevation on all three OSes, (b) cross-platform graceful stop. If schtasks/systemctl/ +launchctl cannot be called without elevation, this phase fails immediately -- before +any CLI verb or install integration work is done. + +#### Task 1: Shutdown endpoint and service constants + +- **Change:** Add a POST /shutdown endpoint to the HTTP server in http-transport.ts. + When hit, send a 200 JSON response (`{ "status": "shutting-down" }`), then trigger + graceful shutdown after a 100ms delay by emitting the process SIGINT event (which + fires the existing shutdown handler chain in index.ts). Add LOG_FILE_PATH constant + to paths.ts (`~/.apra-fleet/data/fleet.log`). Create the service-manager types file + with service name constants: WINDOWS_TASK_NAME="ApraFleet", + LINUX_UNIT_NAME="apra-fleet.service", + MACOS_PLIST_LABEL="com.apra-fleet.server". +- **Files:** src/services/http-transport.ts, src/paths.ts, + src/services/service-manager/types.ts (new) +- **Tier:** cheap +- **Done when:** POST to /shutdown triggers clean server shutdown (server.json deleted, + lock released, process exits 0). LOG_FILE_PATH and service name constants exported. +- **Blockers:** None -- builds on existing HTTP handler infrastructure. + +#### Task 2: ServiceManager interface and factory + +- **Change:** Define the ServiceManager interface with methods: register(binaryPath, + args, logPath), unregister(), start(), stop(), query() returning ServiceStatus, + isInstalled() returning boolean. ServiceStatus includes fields: installed, running, + pid (optional), enabled (optional). Create a factory function getServiceManager() + that returns the correct adapter based on process.platform ('win32' -> Windows, + 'linux' -> Linux, 'darwin' -> macOS). For unsupported platforms, return a no-op stub + that logs a warning and returns safe defaults (installed=false, running=false). +- **Files:** src/services/service-manager/types.ts (extend), + src/services/service-manager/index.ts (new) +- **Tier:** standard +- **Done when:** Interface compiles. Factory returns per-platform implementation. Stub + adapter returns safe defaults without throwing on unsupported platforms. +- **Blockers:** None. + +#### Task 3: Windows Scheduled Task adapter + +- **Change:** Implement WindowsServiceManager class. + - register(binaryPath, args, logPath): Write a wrapper batch script to BIN_DIR + (`apra-fleet-service.bat`) that runs the binary with args and redirects + stdout/stderr to logPath. Create a per-user Scheduled Task via + `schtasks /create /tn "ApraFleet" /tr "" /sc onlogon /rl limited /f`. + No elevation required for per-user tasks. + - unregister(): `schtasks /delete /tn "ApraFleet" /f`. Remove wrapper script. + Tolerate "task not found" error. + - start(): `schtasks /run /tn "ApraFleet"`. + - stop(): Read server.json for URL. POST /shutdown. Wait up to 5s for process exit + (poll pid). Fallback: `taskkill /F /PID `. + - query(): Parse `schtasks /query /tn "ApraFleet" /fo csv /nh` output. Extract + status (Running/Ready/Disabled) and combine with server.json data. + - isInstalled(): Run `schtasks /query /tn "ApraFleet"` -- success means installed. +- **Files:** src/services/service-manager/windows.ts (new) +- **Tier:** standard +- **Done when:** All methods implemented. No UAC prompt triggered. Commands use + child_process.execFile (not shell) where possible for safety. +- **Blockers:** "Log on as a batch job" right -- may be restricted on domain-joined + machines. See risk register. + +#### Task 4: Linux systemd user unit adapter + +- **Change:** Implement LinuxServiceManager class. + - register(binaryPath, args, logPath): Write a systemd user unit file to + `~/.config/systemd/user/apra-fleet.service` with [Unit] Description, [Service] + Type=simple, ExecStart= , Restart=on-failure, + StandardOutput=append:, StandardError=append:, [Install] + WantedBy=default.target. Run `systemctl --user daemon-reload` then + `systemctl --user enable apra-fleet`. Attempt `loginctl enable-linger $USER` and + warn (not error) if it fails. + - unregister(): First stop the server via POST /shutdown (same as stop() above). + Then `systemctl --user disable apra-fleet`, + `systemctl --user stop apra-fleet` (tolerate not-running -- informs systemd the + unit is being removed), remove unit file, `systemctl --user daemon-reload`. + - start(): `systemctl --user start apra-fleet`. + - stop(): Read server.json for URL. POST /shutdown. Wait up to 5s for process exit + (poll pid). Fallback: kill -TERM . This matches the Windows and macOS adapters + and the CLI stop verb for cross-platform consistency. (systemctl --user stop would + also work since it sends SIGTERM, but POST /shutdown is preferred so all three + adapters share the same contract.) Tolerate not-running. + - query(): `systemctl --user is-active apra-fleet` (active/inactive/failed), + `systemctl --user is-enabled apra-fleet` (enabled/disabled). + - isInstalled(): Check if unit file exists at the expected path. + - Non-systemd detection: Before any operation, check for systemd user bus + (XDG_RUNTIME_DIR + /run/user//systemd). If absent, throw with clear message: + "systemd user mode is not available. Service management requires systemd." +- **Files:** src/services/service-manager/linux.ts (new) +- **Tier:** standard +- **Done when:** All methods implemented. Non-systemd systems get a clear, actionable + error. loginctl linger is attempted with a non-fatal warning on failure. +- **Blockers:** loginctl enable-linger may need root. See risk register. + +#### Task 5: macOS launchd LaunchAgent adapter + +- **Change:** Implement MacOSServiceManager class. + - register(binaryPath, args, logPath): Write a plist to + `~/Library/LaunchAgents/com.apra-fleet.server.plist` with Label, ProgramArguments + (array: [binaryPath, ...args]), RunAtLoad=true, KeepAlive with + SuccessfulExit=false, StandardOutPath=logPath, StandardErrorPath=logPath. Before + loading, call `launchctl bootout gui//com.apra-fleet.server` and tolerate + "not loaded" / "no such process" errors -- this makes register() idempotent + (launchctl bootstrap fails with "service already loaded" if called twice without + bootout). Then load via `launchctl bootstrap gui/ `. + - unregister(): `launchctl bootout gui//com.apra-fleet.server`. Remove plist. + Tolerate "not loaded" error. + - start(): `launchctl kickstart gui//com.apra-fleet.server`. + - stop(): POST /shutdown to server URL from server.json (same as Windows approach -- + clean exit 0 prevents KeepAlive restart). Wait up to 5s, fallback kill -TERM. + - query(): Parse output of `launchctl print gui//com.apra-fleet.server` for + pid and state. If command fails (not loaded), return installed=false. + - isInstalled(): Check plist file exists at expected path. + - uid retrieval: Use `id -u` or process.getuid() to get the current user's uid for + the gui/ domain specifier. +- **Files:** src/services/service-manager/macos.ts (new) +- **Tier:** standard +- **Done when:** All methods implemented. No elevation required. bootstrap/bootout + API used (available since macOS 10.10). +- **Blockers:** None significant. See risk register for macOS version note. + +#### Task 6: Service manager unit tests + +- **Change:** Write vitest tests for all three adapters. Use vi.mock to mock + child_process.execFile and child_process.execFileSync. For each adapter, test: + register (verifies correct command/args), unregister (verifies cleanup commands), + start (verifies start command), stop (verifies graceful shutdown attempt), query + (mock command output, verify parsed ServiceStatus), isInstalled (mock success/failure). + Test edge cases: already registered (idempotent register), not installed (idempotent + unregister), process not running (stop is no-op), non-systemd Linux (clear error + thrown). Use vi.hoisted for mock definitions per existing test conventions. +- **Files:** tests/service-manager.test.ts (new) +- **Tier:** standard +- **Done when:** Tests cover all adapter methods and key error paths. All pass. + Existing test suite (npm test) stays fully green. +- **Blockers:** None -- tests mock OS commands, no real services created. + +#### Task 6.5: MCP session capability logging (apra-fleet-projects-78g) + +- **Change:** In http-transport.ts, extract clientInfo and capabilities from the + initialize request body (parsedBody is already in scope in the isInitializeRequest + block). After sessions.set() in onsessioninitialized, call logLine to record: + session ID, client name, client version, client capability keys, and whether + experimental['claude/channel'] was declared. Import logLine from + utils/log-helpers.js (already used in the file). Example format: + `logLine('session', 'new sid= client=/ caps= channel=true')`. + Store clientInfo on a local variable captured in the initialize block closure so it + is available in onsessioninitialized. Also log session close with the same sid so + sessions can be correlated in logs. +- **Files:** src/services/http-transport.ts +- **Tier:** cheap +- **Done when:** Each new MCP session logs client name, version, capabilities, and + channel flag. Session close is logged. Existing tests pass. ASCII-only. +- **Blockers:** None. + +#### VERIFY: Platform Service Foundation +- Run full test suite (npm test) +- Confirm all Phase 1 changes compile cleanly +- Confirm no regressions in existing tests +- Report: tests passing, adapter coverage, any issues + +--- + +### Phase 2: CLI Verbs + +Build the four new top-level commands. Each is a thin module in src/cli/ wired into +the dispatch table in src/index.ts. + +#### Task 7: start and stop commands + +- **Change:** Create src/cli/start.ts with exported runStart(args). Logic: + (1) checkRunningInstance() -- if running, log "Server already running at + pid=" and exit 0 (idempotent). (2) Get service manager via getServiceManager(). + If service is installed, call serviceManager.start(). (3) If no service installed, + spawn the binary in detached mode with stdout/stderr redirected to LOG_FILE_PATH. + **Binary path resolution:** In SEA mode, the binary is at the stable installed path + (`BIN_DIR + 'apra-fleet'` or `'apra-fleet.exe'` from src/cli/config.ts). In dev mode + (non-SEA), the command is `process.execPath` (the Node.js binary) with args + `[path.join(findProjectRoot(), 'dist', 'index.js'), '--transport', 'http']` -- using + the same `findProjectRoot()` function from src/cli/install.ts that walks up from + __dirname looking for version.json. Import `findProjectRoot` from install.ts (it is + already exported) or extract it to a shared util. Both modes append + `['--transport', 'http']` to the args. Wait 2s then verify server started via + checkRunningInstance. Report success or failure. + Create src/cli/stop.ts with exported runStop(args). Logic: (1) checkRunningInstance() + -- if not running, log "Server is not running." and exit 0 (idempotent). (2) Read URL + from server.json. POST /shutdown to the URL. The stop verb does NOT go through the + service manager adapter -- stopping the running process is service-agnostic. (3) Poll + pid alive every 500ms for up to 5s. (4) If process still alive after timeout (or if + /shutdown returned an error -- e.g. the running binary predates the /shutdown endpoint), + force kill: process.kill(pid, 'SIGTERM') on Unix, taskkill /F /PID on Windows. + (5) Clean up stale server.json and lock file. Report "Server stopped." + Note on version skew: if an older binary without the /shutdown endpoint is running, the + POST will fail (404 or connection error). The fallback force-kill path handles this + correctly -- the 5s poll detects the process is still alive and proceeds to kill it. + Wire both commands into src/index.ts dispatch: `arg === 'start'` and `arg === 'stop'` + with dynamic imports, same pattern as existing install/uninstall/secret/auth dispatch. +- **Files:** src/cli/start.ts (new), src/cli/stop.ts (new), src/index.ts +- **Tier:** cheap +- **Done when:** `apra-fleet start` starts the server (or reports already running). + `apra-fleet stop` stops the server gracefully (or reports not running). Both are + idempotent with exit code 0. +- **Blockers:** Depends on Phase 1 (service manager, /shutdown endpoint). + +#### Task 8: restart command + +- **Change:** Create src/cli/restart.ts with exported runRestart(args). Import and call + runStop(args) then runStart(args). Wire into src/index.ts dispatch table. +- **Files:** src/cli/restart.ts (new), src/index.ts +- **Tier:** cheap +- **Done when:** `apra-fleet restart` stops then starts the server. Works whether or not + the server was running (stop is idempotent). +- **Blockers:** Depends on T7. + +#### Task 9: status command + +- **Change:** Create src/cli/status.ts with exported runStatus(args). Logic: + (1) Read server.json -- if present and pid alive, GET /health to obtain version, + uptime, sessions, port, url. (2) Query service manager via getServiceManager().query() + for unit state (installed, enabled, running from service perspective). (3) Format + output: + ``` + apra-fleet status + State: running | stopped + PID: + Port: + URL: + Version: + Uptime: + Sessions: + Service: installed (enabled) | installed (disabled) | not installed + ``` + If server is not running, show "State: stopped" and omit pid/port/url/uptime/sessions. + Service line always shown regardless of server state. + Wire into src/index.ts dispatch table. +- **Files:** src/cli/status.ts (new), src/index.ts +- **Tier:** standard +- **Done when:** `apra-fleet status` shows all required fields. Works correctly whether + server is running or not, and whether service unit is installed or not. +- **Blockers:** Depends on Phase 1 (service manager query). + +#### Task 10: CLI verb tests and --help update + +- **Change:** Update the --help output in src/index.ts to include the four new verbs: + ``` + apra-fleet start Start the fleet server + apra-fleet stop Stop the fleet server + apra-fleet restart Restart the fleet server + apra-fleet status Show server and service status + ``` + Write tests in tests/cli-verbs.test.ts covering: start when already running (idempotent), + start when not running (spawns process or uses service manager), stop when running + (sends /shutdown), stop when not running (idempotent), restart (stop then start), + status with running server (full output), status with stopped server (partial output), + status with/without service installed. Mock checkRunningInstance, HTTP calls, service + manager, and child_process.spawn. +- **Files:** src/index.ts, tests/cli-verbs.test.ts (new) +- **Tier:** standard +- **Done when:** --help lists all verbs. Tests cover all verb logic and edge cases. All + tests pass. Pre-commit ASCII hook passes. +- **Blockers:** None. + +#### VERIFY: CLI Verbs +- Run full test suite (npm test) +- Verify `apra-fleet --help` includes all new verbs +- Confirm no regressions in existing tests +- Report: tests passing, verb behavior verified + +--- + +### Phase 3: Install/Uninstall Integration + +Wire the service manager adapter into the existing install and uninstall commands. The +existing install steps (binary, hooks, scripts, settings, MCP, skills) are unchanged; +service registration is additive. For uninstall, service removal is prepended. + +#### Task 11: Extend install to register and start service + +- **Change:** In src/cli/install.ts, after the existing final step (Beads tracker + install + permissions + install-config.json), add a new step: + (1) If transport === 'http' and isSea() (installed binary exists), call + serviceManager.register(binaryPath, ['--transport', 'http'], LOG_FILE_PATH) then + serviceManager.start(). (2) If transport === 'stdio', skip service registration (stdio + transport is per-client, not a persistent service). (3) In dev mode (!isSea()), skip + service registration but optionally start the server directly. + Update the install output to include a "Service: registered and running" line. + Update totalSteps calculation to include the new step. +- **Files:** src/cli/install.ts +- **Tier:** standard +- **Done when:** `apra-fleet install` registers the per-user service and the server is + running immediately afterward. A fresh MCP client connects without any manual step. + The existing install behavior (binary, hooks, settings, MCP, skills) is unchanged. +- **Blockers:** Service manager adapter must be complete (Phase 1). + +#### Task 12: Extend uninstall to stop and remove service + +- **Change:** In src/cli/uninstall.ts, before the existing provider cleanup loop, add: + (1) If server is running, stop it gracefully via POST /shutdown (replacing the existing + isApraFleetRunning/killApraFleet approach -- which does a hard kill -- with the graceful + /shutdown endpoint). Wait for exit. (2) Call serviceManager.unregister() to remove the + service unit on the current OS. Tolerate "not installed" (idempotent). + The existing cleanup steps (settings cleanup, skill removal, binary removal) remain + unchanged. The --force flag triggers the graceful /shutdown approach instead of the + old killApraFleet hard kill. +- **Files:** src/cli/uninstall.ts +- **Tier:** standard +- **Done when:** `apra-fleet uninstall` stops the server gracefully, removes the service + unit, and removes MCP config. No orphaned service units, plist files, or scheduled + tasks remain. +- **Blockers:** Depends on T11 (service registration during install). + +#### Task 13: Install/uninstall service integration tests + +- **Change:** Add or extend tests covering: (1) install with HTTP transport calls + serviceManager.register and start, (2) install with stdio transport skips service + registration, (3) install in dev mode skips service registration, (4) uninstall calls + graceful /shutdown and serviceManager.unregister, (5) uninstall with no service + installed is idempotent (unregister tolerates "not found"), (6) uninstall with server + not running skips /shutdown (idempotent). Mock service manager and HTTP calls. +- **Files:** tests/install.test.ts (extend or new), tests/uninstall.test.ts (new) +- **Tier:** standard +- **Done when:** Tests verify service lifecycle during install/uninstall. Existing + install tests remain unchanged and passing. +- **Blockers:** None. + +#### VERIFY: Install/Uninstall Integration +- Run full test suite (npm test) +- Confirm install registers service and server starts +- Confirm uninstall removes service cleanly with no orphans +- Report: tests passing, no regressions + +--- + +### Phase 4: Documentation + +#### Task 14: Update README with service model and verbs + +- **Change:** Add a "Service Management" section to README.md documenting: + - The four new verbs: start, stop, restart, status (with usage examples) + - Automatic service registration during install (per-user, no elevation) + - Per-OS mechanisms at a glance (schtasks, systemd, launchd) + - Log file location (~/.apra-fleet/data/fleet.log) + - Troubleshooting: how to check logs, restart after issues, verify service state + Update the existing command reference table to include the new verbs. Update the + install/uninstall sections to mention service registration/removal. +- **Files:** README.md +- **Tier:** cheap +- **Done when:** README documents all service verbs and behavior. ASCII-only. +- **Blockers:** None. + +#### Task 15: Update architecture docs with service manager adapter + +- **Change:** Add a "Service Manager" section to docs/architecture.md documenting: + - The adapter pattern: ServiceManager interface + per-OS implementations + - How install/uninstall interact with the service manager + - The /shutdown endpoint and why it exists (cross-platform graceful stop) + - The verb -> adapter -> OS command flow + - How the singleton lifecycle interacts with service management (startup lock, + server.json, clean exit preventing auto-restart) + Update the existing "Singleton lifecycle" paragraph to reference service management. + Update the ASCII diagram to show the service manager layer. +- **Files:** docs/architecture.md +- **Tier:** cheap +- **Done when:** Architecture docs explain the service manager design. ASCII-only. +- **Blockers:** None. + +#### VERIFY: Documentation +- Confirm ASCII-only in all doc files (pre-commit hook) +- Confirm docs accurately reflect the planned implementation +- Report: docs updated, hook passes + +--- + +## Risk Register + +| Risk | Impact | Mitigation | +|------|--------|------------| +| Windows schtasks /end is TerminateProcess (hard kill, not SIGTERM) | High | Never use schtasks /end for graceful stop. Always use HTTP /shutdown endpoint. Force kill via taskkill only as last-resort fallback. | +| loginctl enable-linger may require root on some Linux distros | Medium | Attempt and warn (non-fatal). Server starts on login but may not persist across reboots without an active session on those systems. Document the manual sudo command in README. | +| Non-systemd Linux (Alpine, older distros, containers, WSL1) | Medium | Detect systemd absence at register time. Return clear, actionable error. start/stop/status CLI verbs still work via direct process management and HTTP /shutdown -- only automatic service registration is unavailable. | +| Windows "Log on as a batch job" right restricted (domain-joined) | Medium | Detect schtasks /create failure. Provide actionable error naming the specific right. start/stop/status still work via direct process management. | +| launchctl API differences across macOS versions | Low | Use bootstrap/bootout/kickstart API (available since macOS 10.10 Yosemite, 2014). All currently-supported macOS versions have this API. | +| Binary path changes after update break service unit | Medium | install command re-registers the service unit (updates binary path). update command calls install --force, which also re-registers. Document this interaction. | +| Backward compat: existing install/uninstall behavior changes | Medium | Service registration is purely additive -- all existing install steps unchanged. Service removal is prepended to uninstall. All existing tests must pass. | +| Concurrent start race (two starts at the same time) | Low | Existing claimStartupLock prevents double-start. The binary exits 0 when another instance is running (checkRunningInstance). No change needed. | +| /shutdown endpoint security | Low | Localhost-only binding (127.0.0.1). Same trust boundary as the /mcp endpoint, which has full tool access. No auth token needed (parity with existing MCP surface). | +| Stale server.json after crash or kill -9 | Low | Existing checkRunningInstance validates pid + /health and cleans up stale files. No change needed. | + +--- + +## Deferred Items + +Items explicitly out of scope for this sprint but tracked for follow-up: + +- **Log rotation:** The fleet service log (~/.apra-fleet/data/fleet.log) will grow + unboundedly. A future task should add log rotation -- either a size-based rotation at + server startup (rename fleet.log to fleet.log.1, cap at N files) or integration with + OS-native log rotation (logrotate on Linux, newsyslog on macOS). For this sprint, the + log file is append-only with no rotation. +- **TLS / auth on HTTP endpoint:** The /shutdown and /mcp endpoints are localhost-only + (127.0.0.1) and share the same trust boundary. No authentication is added in this + sprint. A follow-up could add a bearer token if the threat model changes. + +--- + +## Notes + +- Each task should result in a git commit +- Verify tasks are checkpoints -- stop and report after each one +- Base branch: feat/mcp-sse-transport (extends PR #273 -- no new branch) +- Implementation branch: feat/mcp-sse-transport (commit directly onto this branch) +- Service name constants: + - Windows task: "ApraFleet" + - Linux unit: "apra-fleet.service" + - macOS label: "com.apra-fleet.server" +- Log file: ~/.apra-fleet/data/fleet.log +- /shutdown endpoint: POST http://127.0.0.1:/shutdown (localhost-only) +- The /shutdown endpoint reuses the existing SIGINT handler chain in index.ts +- ASCII-only in all committed files (pre-commit hook enforced) diff --git a/README.md b/README.md index f8fe4529..c94e8cd8 100644 --- a/README.md +++ b/README.md @@ -233,6 +233,105 @@ reviewer Opus 4.7 final review Provider strengths, role recommendations, and gotchas: [docs/provider-guide.md](docs/provider-guide.md). +## Transport + +Fleet runs as a singleton service on your machine. When you start it, the server +listens on port 7523 by default and multiple LLM clients (Claude Code, Gemini, +Copilot, Codex) connect concurrently to the same fleet instance. + +### HTTP+SSE Transport (default) + +By default, fleet uses the **HTTP+SSE transport** -- clients connect over HTTP and +receive server-push notifications over Server-Sent Events (SSE). + +```bash +apra-fleet # Start HTTP server (default) +apra-fleet --transport http # Explicitly use HTTP +``` + +When the server starts, it writes a `server.json` file to `~/.apra-fleet/` containing: +```json +{ + "pid": 12345, + "port": 7523, + "url": "http://localhost:7523/mcp", + "version": "x.y.z", + "startedAt": "2026-05-19T..." +} +``` + +If port 7523 is busy, the server falls back to port 0 (OS-assigned random port) and +records the actual port in `server.json`. You can override the default port with the +`APRA_FLEET_PORT` environment variable. + +**Multiple clients, one server.** When a second LLM client starts, it reads +`server.json`, detects the running server, and connects to it. All clients share the +same fleet instance -- no restart needed. When you close all clients, the server +keeps running (as a singleton service on your machine). It shuts down on explicit +exit (`apra-fleet --shutdown` tool) or on system reboot. + +**Re-register with HTTP.** When you upgrade or re-install Fleet, run: +```bash +apra-fleet install # Registers fleet with HTTP transport (default) +``` + +### Event Bus + +The event bus is an internal notification system. When a subsystem (like credential +storage) completes an operation, it emits an event, and the HTTP server broadcasts +the notification to all connected clients via SSE. This lets clients respond +immediately to fleet events without polling. + +### Backward Compatibility: stdio Transport + +Existing fleets can continue using the stdio transport: + +```bash +apra-fleet --transport stdio # Use legacy stdio transport +apra-fleet --stdio # Alias for --transport stdio +``` + +When you run `apra-fleet install --transport stdio`, the MCP config keeps the old +command-based format (no HTTP URL). The server's behavior is identical to pre-HTTP +versions: it reads JSON-RPC from stdin, writes responses to stdout, and communicates +with one client at a time via the stdio pipe. + +If you want to stay on stdio for now, run: +```bash +apra-fleet install --transport stdio +``` + +If you later switch back to HTTP, re-run the default install: +```bash +apra-fleet install # Switches to HTTP transport +``` + +## Service Mode + +Fleet keeps a singleton server running so all your LLM clients share one instance. +Registering it as an OS service keeps it alive across terminal sessions -- the server +survives terminal close and restarts automatically on login: + +- Windows: a per-user Scheduled Task (Task Scheduler, OnLogon trigger) +- Linux: a systemd user unit (`systemctl --user`) +- macOS: a LaunchAgent in `~/Library/LaunchAgents/` + +Four verbs manage the lifecycle directly: + +``` +apra-fleet start # start the server (idempotent -- exits cleanly if already running) +apra-fleet stop # graceful shutdown: POST /shutdown, poll, force-kill fallback +apra-fleet restart # stop then start +apra-fleet status # state, PID, port, uptime, version, and OS service status +``` + +`install` and `uninstall` include service registration. Running +`apra-fleet install` on a packaged binary with the HTTP transport (the default) +registers and starts the OS service automatically -- no extra step. +`apra-fleet uninstall` stops and deregisters the service before removing files. +Service registration failures are non-fatal: a warning is printed and the install +continues. + ## The PM skill The **PM skill** is Fleet's reference workflow for **software development** diff --git a/docs/architecture.md b/docs/architecture.md index 32afcc55..6cd5f223 100644 --- a/docs/architecture.md +++ b/docs/architecture.md @@ -1,4 +1,4 @@ - + @@ -6,20 +6,20 @@ ## Why This Exists -AI coding agents are powerful on a single machine. But real work spans many machines — a dev server, a staging box, a GPU trainer, a production host. Today, if you want Claude Code working across all of them, you SSH in manually, run prompts one at a time, and copy files by hand. There's no single pane of glass. +AI coding agents are powerful on a single machine. But real work spans many machines - a dev server, a staging box, a GPU trainer, a production host. Today, if you want Claude Code working across all of them, you SSH in manually, run prompts one at a time, and copy files by hand. There's no single pane of glass. -Apra Fleet gives one Claude instance the ability to orchestrate many. Register machines, push files, run prompts, monitor health — all through natural language from your terminal. One master, many members. +Apra Fleet gives one Claude instance the ability to orchestrate many. Register machines, push files, run prompts, monitor health - all through natural language from your terminal. One master, many members. ## Conceptual Model The system has three layers of abstraction: -**Fleet** → **Members** → **Sessions** +**Fleet** -> **Members** -> **Sessions** -A *fleet* is the collection of all registered machines. A *member* is one machine with a working directory — the unit you talk to. A *session* is a conversation thread on a member — Claude remembers context across prompts within a session, and you can reset it to start fresh. +A *fleet* is the collection of all registered machines. A *member* is one machine with a working directory - the unit you talk to. A *session* is a conversation thread on a member - Claude remembers context across prompts within a session, and you can reset it to start fresh. Members come in two flavors: -- **Remote members** communicate over SSH. They can be any machine you can reach — Linux VMs, macOS servers, Windows boxes. +- **Remote members** communicate over SSH. They can be any machine you can reach - Linux VMs, macOS servers, Windows boxes. - **Local members** run on the same machine as the master, in a different folder. No SSH needed. Useful for isolating work into separate project directories without spinning up another machine. This distinction is hidden behind a **Strategy pattern**: every tool interacts with members through a uniform interface. The strategy implementation (remote via SSH, or local via child process) is selected at runtime based on member type. Tools never know or care which kind of member they're talking to. @@ -27,32 +27,32 @@ This distinction is hidden behind a **Strategy pattern**: every tool interacts w ## How It Fits Together ``` -┌────────────────────────────────────────────────────┐ -│ Master Machine │ -│ │ -│ Claude Code CLI ◄──stdio──► Apra Fleet Server │ -│ │ │ -│ ┌──────────┴──────────┐ │ -│ │ Member Strategy │ │ -│ │ (uniform interface)│ │ -│ └──┬─────────────┬───┘ │ -│ │ │ │ -│ Remote Strategy Local Strategy │ -│ (ssh2 + sftp) (child_process + fs) │ -│ │ │ │ -│ SSH│ local exec │ -└───────────────────────┼─────────────┼──────────────┘ - │ │ - ┌────────────┘ └──► /other/project/ - ▼ (same machine) - ┌──────────────┐ - │ Remote Member │ - │ (any OS, │ - │ any provider)│ - └──────────────┘ -``` - -The MCP server speaks **stdio** — the standard transport for Claude Code MCP servers. Claude sends JSON-RPC tool calls, the server executes them, returns results. No HTTP, no ports to open. ++------------------------------------------------------+ +| Master Machine | +| | +| Claude Code CLI <--stdio--> Apra Fleet Server | +| | | +| +---------+---------+ | +| | Member Strategy | | +| | (uniform interface)| | +| +--+------------+---+ | +| | | | +| Remote Strategy Local Strategy | +| (ssh2 + sftp) (child_process + fs) | +| | | | +| SSH| local exec | ++-------------------+------------+--+---------------+ + | | + +--------+ +---> /other/project/ + | (same machine) + +------+----------+ + | Remote Member | + | (any OS, | + | any provider) | + +----------------+ +``` + +The MCP server speaks **stdio** - the standard transport for Claude Code MCP servers. Claude sends JSON-RPC tool calls, the server executes them, returns results. No HTTP, no ports to open. ## Layers @@ -70,6 +70,161 @@ The codebase follows a strict layering: Each layer only depends on the layers below it. Tools never import other tools. Services don't know about the MCP protocol. +## Transport Layer + +Fleet supports two MCP transports: HTTP+SSE (default) and stdio (legacy). + +### HTTP+SSE Transport (Default) + +The HTTP transport runs as a **singleton service** on your machine. A single fleet +server listens on port 7523 and multiple LLM clients connect concurrently. Each +client gets its own session with a dedicated `McpServer` instance inside the fleet +process, so tool calls and state are isolated per client. + +``` + Client 1 (Claude Code) Client 2 (Gemini) + | | + +-------------+---+----------+ + | + HTTP + SSE | + | + +-------+-------+ + | Singleton | + | Fleet Server | + | (port 7523) | + +-------+-------+ + | + +-----------|----------+ + | | | + McpServer McpServer Tool Registry + (Session 1) (Session 2) (shared) + | | + +------+----+ + | + Event Bus (notifications) +``` + +**Per-session McpServer model:** When a client connects, the fleet creates a new +`McpServer` instance for that session. This isolates tool call state, session storage, +and concurrent requests. Multiple clients can call the same tool simultaneously +without interfering with each other. + +**Event bus:** The fleet's internal event bus (`FleetEventMap`) carries notifications +from subsystems (e.g., `credential:stored` when out-of-band auth completes) to all +connected clients via SSE `notifications/message`. This is the publish-subscribe +mechanism for server-initiated events. + +**Singleton lifecycle:** The server starts on-demand the first time an LLM client +connects. Subsequent clients reuse the running server. The server keeps running until +explicitly shut down (via `shutdown_server` tool, SIGINT, SIGTERM, or system reboot). +This is intentional - the singleton is a long-lived service, not a per-request +process. Restarting it has a cost (tool re-registration, SSH connection repool, +stall detector restart). + +**server.json discovery:** When the server starts, it writes `~/.apra-fleet/server.json` +with `{ pid, port, url, version, startedAt }`. Clients discover the running instance +by reading this file and verifying the process is alive and the port responds to +`/health` endpoint. The double-check (process.kill(pid, 0) + HTTP health request) +detects stale entries and cleans them up. + +**Localhost-only binding:** The fleet server binds to `127.0.0.1` only, never +`0.0.0.0`. This ensures only local processes can connect -- no network exposure. + +### Stdio Transport (Legacy) + +When `--transport stdio` is used, the fleet runs in the legacy mode: one MCP server +process per client connection. The server reads JSON-RPC from stdin, writes responses +to stdout, and terminates when the client disconnects. No HTTP, no singleton, no +event bus. Tools work identically; the transport layer differs. + +### Event Flow Subsystem -> Notification + +When an event is emitted on the event bus: + +1. **Subsystem** (e.g., `auth-socket.ts`) calls `fleetEvents.emit('credential:stored', { name: ... })` +2. **Event Bus** (`event-bus.ts`) delivers the event to all registered subscribers +3. **HTTP Transport** (`http-transport.ts`) receives the event in its subscriber callback +4. **Per-session McpServer** sends a `notifications/message` to each connected client over SSE +5. **Client** receives the notification in its SSE stream handler + +This is the publish-subscribe pattern: producers emit to the bus, subscribers (the +HTTP transport) are notified, and the transport broadcasts to all session clients. + +## Service Manager + +The `ServiceManager` component registers and controls the fleet server as an OS +background service. It uses an adapter pattern so the CLI verbs (`start`, `stop`, +`restart`, `status`) and the `install`/`uninstall` commands work identically on every +platform. + +### Interface + +`src/services/service-manager/types.ts` defines the contract: + +``` +interface ServiceManager { + register(binaryPath, args, logPath): Promise + unregister(): Promise + start(): Promise + stop(): Promise + query(): Promise + isInstalled(): Promise +} + +interface ServiceStatus { + installed: boolean + running: boolean + pid?: number + enabled?: boolean +} +``` + +Service name constants are also in `types.ts`: `WINDOWS_TASK_NAME`, +`LINUX_UNIT_NAME`, `MACOS_PLIST_LABEL`. + +### Platform Adapters + +``` +src/services/service-manager/ + types.ts - ServiceManager interface, ServiceStatus, service name constants + index.ts - getServiceManager() factory, gracefulStopByServerJson(), NoopServiceManager + windows.ts - WindowsServiceManager (schtasks per-user Scheduled Task) + linux.ts - LinuxServiceManager (systemd --user unit) + macos.ts - MacOSServiceManager (launchd LaunchAgent plist) +``` + +- **WindowsServiceManager**: writes a wrapper `.bat` file and creates a per-user + Scheduled Task with an `OnLogon` trigger via `schtasks /create`. `start`, `stop`, + and `query` use `schtasks /run`, `/end`, and `/query`. +- **LinuxServiceManager**: writes a systemd user unit file, then runs `daemon-reload`, + `enable`, and `loginctl enable-linger`. `start`, `stop`, and `query` use + `systemctl --user`. +- **MacOSServiceManager**: writes a plist to `~/Library/LaunchAgents/` and bootstraps + it with `launchctl bootstrap`. `KeepAlive.SuccessfulExit=false` prevents launchd + from restarting on a clean exit. `start`, `stop`, and `query` use `launchctl`. + +### Factory + +`getServiceManager()` in `index.ts` selects the right adapter at runtime via a +dynamic `import()` keyed on `process.platform`: + +``` +win32 -> WindowsServiceManager +linux -> LinuxServiceManager +darwin -> MacOSServiceManager +other -> NoopServiceManager (warns once; all methods are safe no-ops) +``` + +`NoopServiceManager` ensures the CLI verbs work on unsupported platforms without +crashing -- they simply have no effect. + +### Graceful Stop + +`gracefulStopByServerJson()` (exported from `index.ts`) reads +`~/.apra-fleet/server.json`, POSTs to the `/shutdown` endpoint, then polls the +process at 500 ms intervals for up to 5 s. If the process does not exit in time, +it falls back to `taskkill /F` on Windows or `SIGTERM` on Unix. + ## Provider Abstraction Fleet supports five LLM providers: Claude Code, Google Antigravity CLI (agy), OpenAI Codex CLI, GitHub Copilot CLI, and Gemini CLI. Members can mix providers within a single fleet. @@ -79,18 +234,18 @@ Fleet supports five LLM providers: Claude Code, Google Antigravity CLI (agy), Op Each member has an optional `llmProvider` field (`'claude' | 'agy' | 'codex' | 'copilot' | 'gemini'`). When absent, it defaults to `'claude'` for backwards compatibility. Every tool that interacts with the member's LLM CLI resolves the provider via `getProvider(agent.llmProvider)` and delegates CLI-specific concerns to the `ProviderAdapter` interface. ``` -┌──────────┐ getProvider() ┌─────────────────┐ -│ Tool │ ───────────────────► │ ProviderAdapter │ -│ (generic)│ │ (per-provider) │ -└──────────┘ └────────┬─────────┘ - │ supplies: - cliCommand() - buildPromptCommand() - parseResponse() - classifyError() - authEnvVar - processName - ... ++----------+ getProvider() +----------------+ +| Tool | --------+----------> | ProviderAdapter | +| (generic)| | (per-provider) | ++----------+ +--------+--------+ + | supplies: + cliCommand() + buildPromptCommand() + parseResponse() + classifyError() + authEnvVar + processName + ... ``` The `OsCommands` layer sits below this: it handles OS-specific shell wrapping (PATH prepend, PowerShell syntax, base64 decode) and delegates CLI-specific parts (binary name, flags, JSON format) to the provider. @@ -110,7 +265,7 @@ src/providers/ ### Mix-and-Match Fleet -A fleet can have members on different providers simultaneously. The PM dispatches work to members by name — it doesn't need to know which LLM backend each member uses. The fleet server resolves the correct CLI commands per member at runtime. +A fleet can have members on different providers simultaneously. The PM dispatches work to members by name - it doesn't need to know which LLM backend each member uses. The fleet server resolves the correct CLI commands per member at runtime. ``` PM (orchestrator, Claude) @@ -136,11 +291,11 @@ See `docs/provider-matrix.md` for the full comparison table. ### Strategy Pattern for Member Types -Rather than scattering `if (agent.agentType === 'local')` checks across every tool, the local/remote distinction lives in a single place: the strategy factory. Tools call `getStrategy(agent).execCommand(...)` and get back the same result shape regardless of how it was executed. Adding a third member type (e.g., Docker containers, cloud VMs with API-based access) means writing one new strategy class — no tool changes. +Rather than scattering `if (agent.agentType === 'local')` checks across every tool, the local/remote distinction lives in a single place: the strategy factory. Tools call `getStrategy(agent).execCommand(...)` and get back the same result shape regardless of how it was executed. Adding a third member type (e.g., Docker containers, cloud VMs with API-based access) means writing one new strategy class - no tool changes. ### Passwords Encrypted at Rest -SSH passwords are encrypted with AES-256-GCM before being written to the registry file. The encryption key is derived from the machine's identity (hostname + OS username), so the registry file is meaningless if copied to another machine. This isn't meant to stop a determined attacker with root access — it prevents accidental plaintext exposure in backups, screenshots, or config file shares. +SSH passwords are encrypted with AES-256-GCM before being written to the registry file. The encryption key is derived from the machine's identity (hostname + OS username), so the registry file is meaningless if copied to another machine. This isn't meant to stop a determined attacker with root access - it prevents accidental plaintext exposure in backups, screenshots, or config file shares. ### Connection Pooling with Idle Timeout @@ -148,15 +303,15 @@ SSH connections are expensive to establish (TCP + key exchange + auth). The serv ### Base64 Prompt Encoding -Prompts sent to remote members are base64-encoded before being passed through SSH. This sidesteps the shell escaping nightmare of nested quoting across SSH → bash → claude CLI, across different operating systems. The remote member decodes before passing to Claude. +Prompts sent to remote members are base64-encoded before being passed through SSH. This sidesteps the shell escaping nightmare of nested quoting across SSH -> bash -> claude CLI, across different operating systems. The remote member decodes before passing to Claude. ### Session Persistence -Each member stores an optional `sessionId` — a Claude conversation thread ID. When `resume=true` (the default), subsequent prompts continue the same conversation, so the remote Claude has full context of prior exchanges. Resetting a session is an explicit action, not an accident. +Each member stores an optional `sessionId` - a Claude conversation thread ID. When `resume=true` (the default), subsequent prompts continue the same conversation, so the remote Claude has full context of prior exchanges. Resetting a session is an explicit action, not an accident. ### File-Based Registry -All fleet state lives in `~/.apra-fleet/data/registry.json` — a single JSON file in the user's home directory. It's deliberately not in the project directory (won't be git-committed accidentally) and not in a database (no server to run, no migrations). For a fleet of dozens of members, JSON is more than sufficient. +All fleet state lives in `~/.apra-fleet/data/registry.json` - a single JSON file in the user's home directory. It's deliberately not in the project directory (won't be git-committed accidentally) and not in a database (no server to run, no migrations). For a fleet of dozens of members, JSON is more than sufficient. ### Duplicate Folder Prevention @@ -166,18 +321,18 @@ Two members cannot share the same working directory on the same device. For remo The tools break into natural groups. Each group has detailed documentation: -**[Lifecycle](tools-lifecycle.md)** — `register_member`, `list_members`, `update_member`, `remove_member`, `shutdown_server` +**[Lifecycle](tools-lifecycle.md)** - `register_member`, `list_members`, `update_member`, `remove_member`, `shutdown_server` Manage the fleet roster and server lifecycle. Registration validates connectivity, detects the OS, and checks that Claude CLI is available. Removal includes best-effort cleanup of auth credentials on the member. -**[Work](tools-work.md)** — `send_files`, `execute_prompt`, `execute_command`, `reset_session` +**[Work](tools-work.md)** - `send_files`, `execute_prompt`, `execute_command`, `reset_session` The core workflow. Push files to a member, run prompts against it, run shell commands directly, manage conversation sessions. -**[Infrastructure](tools-infrastructure.md)** — `provision_llm_auth`, `setup_ssh_key`, `update_llm_cli` +**[Infrastructure](tools-infrastructure.md)** - `provision_llm_auth`, `setup_ssh_key`, `update_llm_cli` One-time setup and maintenance. Provision auth (copy OAuth credentials or deploy API key for any provider), migrate from password to key auth, update the LLM CLI on members. -**[Observability](tools-observability.md)** — `fleet_status`, `member_detail` +**[Observability](tools-observability.md)** - `fleet_status`, `member_detail` Two-layer monitoring. `fleet_status` gives a quick summary table across all members with fleet-aware busy detection (distinguishes between Claude processes serving this member vs unrelated Claude activity). `member_detail` drills into one member with connectivity, CLI version, session state, and system resource metrics. ## Cross-Platform Support -Members can run Windows, macOS, or Linux. The `platform.ts` utility generates the right shell commands for each OS — different commands for checking processes, reading memory, setting environment variables. The OS is auto-detected during registration (`uname -s` on Unix, `cmd /c ver` on Windows) and stored in the member record so subsequent tool calls don't need to re-detect. +Members can run Windows, macOS, or Linux. The `platform.ts` utility generates the right shell commands for each OS - different commands for checking processes, reading memory, setting environment variables. The OS is auto-detected during registration (`uname -s` on Unix, `cmd /c ver` on Windows) and stored in the member record so subsequent tool calls don't need to re-detect. diff --git a/feedback.md b/feedback.md new file mode 100644 index 00000000..1c445cfb --- /dev/null +++ b/feedback.md @@ -0,0 +1,657 @@ +# OS Service Lifecycle -- Plan Review + +**Reviewer:** rbnvk +**Date:** 2026-05-19 12:38:29-0400 +**Verdict:** CHANGES NEEDED + +> See the recent git history of this file to understand the context of this review. + +--- + +## 1. Template Checklist + +### 1.1 Does every task have clear "done" criteria? + +**PASS.** Every task (T1--T15) has an explicit "Done when" block with testable conditions. +T6 and T10 (test tasks) specify coverage targets and the requirement that `npm test` stays +green. T14 and T15 (docs) specify ASCII-only enforcement. No ambiguity. + +### 1.2 High cohesion within each task, low coupling between tasks? + +**PASS.** Each task has a single concern: T1 is the shutdown endpoint + constants, T2 is +the interface + factory, T3--T5 are one adapter each, T7 bundles start+stop (these share +the same server.json/PID lifecycle and are natural counterparts), T9 is status, T11 is +install extension, T12 is uninstall extension. Clean boundaries. + +### 1.3 Are key abstractions and shared interfaces in the earliest tasks? + +**PASS.** The ServiceManager interface (T2) and service constants (T1) land in Phase 1 +before any consumer task. The /shutdown endpoint (T1) is also correctly front-loaded since +it is consumed by adapters (T3 Windows stop, T5 macOS stop) and the stop CLI verb (T7). + +### 1.4 Is the riskiest assumption validated in Task 1? + +**PASS.** Phase 1 front-loads the two riskiest assumptions: (a) per-user service management +without elevation across all three OSes (Tasks 3--5), and (b) cross-platform graceful stop +via the /shutdown endpoint (Task 1). The plan explicitly acknowledges this sequencing in +the Phase 1 preamble: "If schtasks/systemctl/launchctl cannot be called without elevation, +this phase fails immediately." + +T1 is the shutdown endpoint + constants (cheap), which is the foundation but not itself the +risky assumption. The risky per-user-no-elevation validation happens in T3--T5. This is +acceptable because the interface (T2) needs to exist before the adapters can be written, +and T1 provides the /shutdown endpoint that T3 and T5 depend on for their stop +implementations. + +### 1.5 Later tasks reuse early abstractions (DRY)? + +**PASS.** T7 (start/stop CLI) calls `getServiceManager()` from T2. T9 (status) calls +`serviceManager.query()`. T11 (install) calls `serviceManager.register()` + +`serviceManager.start()`. T12 (uninstall) calls `serviceManager.unregister()`. The adapter +pattern is consistently reused throughout. + +### 1.6 Phase boundaries at cohesion boundaries? + +**PASS.** Phase 1 (adapter layer) is self-contained -- produces a testable service manager +with mocked tests. Phase 2 (CLI verbs) builds on Phase 1 and produces functional commands. +Phase 3 (install/uninstall integration) wires everything together. Phase 4 (docs) is +standalone. Each phase is reviewable and testable independently. + +### 1.7 Are tiers monotonically non-decreasing within each phase? + +**FAIL (HIGH-1).** Phase 2 tiers are: T7 cheap, T8 cheap, T9 standard, T10 standard. +That is monotonically non-decreasing -- fine. Phase 1 tiers: T1 cheap, T2 standard, +T3 standard, T4 standard, T5 standard, T6 standard -- fine. Phase 3: T11 standard, +T12 standard, T13 standard -- fine. Phase 4: T14 cheap, T15 cheap -- fine. + +Actually, on closer inspection this is all compliant. **Changing to PASS.** Retracted. + +**PASS.** All phases have monotonically non-decreasing tiers. + +### 1.8 Each task completable in one session? + +**PASS.** All tasks are scoped to a single file or a small set of related files. The +largest tasks (T3--T5, one adapter each) are well-bounded: a single class implementing a +known interface with 5--6 methods, each method being a shell command wrapper. T7 bundles +start+stop but these are thin CLI modules (~60 lines each). Reasonable for one session. + +### 1.9 Dependencies satisfied in order? + +**PASS.** T1 has no dependencies. T2 depends on T1 (types file). T3--T5 depend on T2 +(interface). T6 depends on T3--T5. T7 depends on Phase 1. T8 depends on T7. T9 depends +on Phase 1. T10 depends on T7--T9. T11 depends on Phase 1. T12 depends on T11. T13 +depends on T11--T12. T14--T15 have no code dependencies. All valid. + +### 1.10 Any vague tasks that two developers would interpret differently? + +**FAIL (HIGH-2).** Task 7 (start command) says: "for dev mode, use process.execPath (node) +with args [dist/index.js, --transport, http]" but does not specify how to determine the +path to dist/index.js in dev mode. In install.ts, the existing code uses `findProjectRoot()` +to locate the project root. T7 needs to specify whether to use the same mechanism or +hardcode a relative path. Two developers would make different choices here. + +Additionally, T7's stop command says "POST /shutdown to the URL" but the /shutdown endpoint +is defined in T1 as being added to http-transport.ts. The stop command must also handle +the case where the server is running but the /shutdown endpoint is not yet deployed (e.g., +an older version of the binary is running from a previous install). The plan does not +address this version-skew scenario. **NOTE** (not blocking): the fallback kill path +covers this, but it should be explicitly called out. + +### 1.11 Any hidden dependencies between tasks? + +**PASS.** No hidden dependencies found. T7's stop command depends on the /shutdown endpoint +from T1, which is correctly listed as a Phase 1 dependency. T11's service registration +depends on knowing the binary path, which is already computed in install.ts. + +### 1.12 Does the plan include a risk register? + +**PASS.** The risk register covers 10 risks with impact and mitigation. It addresses: +schtasks /end hard kill, loginctl linger requiring root, non-systemd Linux, Windows batch +job rights, macOS API versions, binary path on update, backward compat, concurrent start +race, /shutdown security, and stale server.json. This is thorough. + +### 1.13 Does the plan align with requirements.md intent? + +**FAIL (HIGH-3).** The Notes section states "Base branch: main" but requirements.md +explicitly says "Base Branch: feat/mcp-sse-transport" and "Commit directly onto that +branch; no new branch." This is a direct contradiction with the requirements. The +implementation branch line is correct (feat/mcp-sse-transport) but the base branch line +is wrong. This must be fixed to avoid confusion -- a doer might create the PR against +main instead of extending PR #273. + +--- + +## 2. Risk Checklist (from prep) + +### Risk 1: Per-user service registration with NO elevation + +**PASS.** The plan explicitly addresses all three OSes: +- **Windows:** `schtasks /create ... /rl limited` (no elevation). Risk register notes + "Log on as a batch job" right restriction on domain-joined machines with mitigation. +- **Linux:** `systemctl --user` (no elevation). loginctl enable-linger attempted with + non-fatal warning. Non-systemd detection throws actionable error. +- **macOS:** `launchctl bootstrap gui/` (no elevation, LaunchAgent not LaunchDaemon). + +All three are well-specified. The dual-path (service vs. direct) fallback is defined. + +### Risk 2: Graceful shutdown on all platforms + +**PASS.** The plan introduces a POST /shutdown endpoint (T1) that triggers the existing +SIGINT handler chain. This is the correct solution for Windows where schtasks /end does +TerminateProcess. The risk register explicitly calls out "Never use schtasks /end for +graceful stop." Stop always goes through HTTP /shutdown on all OSes, with a force-kill +fallback after 5s timeout. This ensures server.json and lock cleanup. + +The plan correctly configures service managers to NOT restart after clean exit: +- systemd: Restart=on-failure (exit 0 = no restart) +- launchd: KeepAlive.SuccessfulExit=false (exit 0 = no restart) +- Windows: schtasks at-logon trigger only (no restart semantics) + +### Risk 3: Interplay with server.json and singleton lock + +**PASS.** The plan reuses the existing checkRunningInstance() (validates pid + /health) for +the start command's idempotency check. The stop command cleans up stale server.json and +lock file after the server exits. The risk register notes that the existing stale-file +cleanup mechanism is sufficient. The existing claimStartupLock prevents concurrent starts. + +### Risk 4: start with vs. without a service unit installed + +**PASS.** The Verb x OS Matrix explicitly defines both columns ("Service Installed" vs. +"No Service Installed") for the start verb. When no service is installed, the binary is +spawned detached with stdout/stderr redirected to LOG_FILE_PATH. T7 spells out both paths +with the `isInstalled()` check determining which path to take. + +### Risk 5: Idempotency of every verb x every OS + +**PASS.** The plan states idempotency for each verb: +- start: checkRunningInstance() first, exit 0 if running. +- stop: if not running, report and exit 0. Clean up stale files. +- install: schtasks /create uses /f (force/overwrite). systemctl enable is idempotent. + launchctl bootstrap may need bootout first -- see HIGH-4 below. +- uninstall: each step tolerates "not found" errors. + +**FAIL (HIGH-4).** The install verb for macOS calls `launchctl bootstrap gui/ +` but does NOT call `launchctl bootout` first. If the service is already +loaded (e.g., user runs `install` twice), `launchctl bootstrap` will fail with "service +already loaded." The plan must specify that install either (a) calls bootout before +bootstrap, or (b) catches the "already loaded" error and proceeds. This is a real +idempotency gap. Windows schtasks uses /f which handles re-registration. Linux systemctl +enable is inherently idempotent. Only macOS bootstrap has this issue. + +### Risk 6: No regression to existing install/uninstall MCP-config behaviour + +**PASS.** T11 explicitly states: "after the existing final step (Beads tracker install + +permissions + install-config.json), add a new step." The existing install steps are +unchanged. T12 states service removal is "prepended" to the existing uninstall steps. +The risk register includes "backward compat" and notes service registration is "purely +additive." + +### Risk 7: Log file redirection + +**PASS.** LOG_FILE_PATH is defined in T1 as `~/.apra-fleet/data/fleet.log`. Per-OS +mechanisms are specified: +- Windows: wrapper.bat handles redirection (schtasks cannot redirect natively). +- Linux: StandardOutput=append:, StandardError=append:. +- macOS: StandardOutPath=logPath, StandardErrorPath=logPath. +- Direct spawn (no service): stdout/stderr redirect to LOG_FILE_PATH. + +**NOTE:** No log rotation strategy is mentioned. The log file will grow unboundedly. This +is not blocking for this sprint but should be tracked as a follow-up. + +### Risk 8: Binary path in service units + +**PASS.** T11 specifies `serviceManager.register(binaryPath, ...)` where binaryPath is +the installed binary from install.ts (BIN_DIR + binary name). The risk register notes +"install command re-registers the service unit (updates binary path)" and "update command +calls install --force, which also re-registers." + +### Risk 9: status command richness with/without a unit installed + +**PASS.** T9 specifies the full output format with all required fields (pid, port, url, +version, uptime, sessions, service state). The "Service" line is "always shown regardless +of server state" and covers installed/enabled/disabled/not-installed states. When server +is stopped, pid/port/url/uptime/sessions are omitted. This matches requirements. + +### Risk 10: Port fallback interaction + +**PASS.** The stop command reads the URL from server.json (which contains the actual port +the server bound to, whether 7523 or a fallback port). The status command also reads +server.json. This correctly handles the port fallback case without assuming the default +port. + +--- + +## 3. Verb x OS Matrix Completeness + +**PASS.** The plan includes an explicit Verb x OS Matrix section with tables for start, +stop, restart, status, install, and uninstall. Each cell specifies the exact OS command +or behavior. No "and similarly for X" is used. Each OS is explicitly covered for every +verb. This directly satisfies the requirements.md mandate: "The plan MUST explicitly walk +through all three OSes for every verb." + +--- + +## 4. Acceptance Criteria Mapping + +Mapping each acceptance criterion from requirements.md to plan tasks: + +1. "install registers a per-user service and server is running immediately" -> T11. **PASS.** +2. "Server comes back after reboot/re-login on all three OSes" -> T3 (at-logon trigger), + T4 (WantedBy=default.target + linger), T5 (RunAtLoad=true). **PASS.** +3. "start/stop/restart/status work idempotently on all OSes" -> T7, T8, T9. **PASS.** +4. "status reports pid/port/url/version/uptime/sessions and service-unit state" -> T9. + **PASS.** +5. "uninstall stops server and removes service unit and MCP config" -> T12. **PASS.** +6. "No elevation/admin/root or UAC" -> All adapter tasks (T3--T5) use per-user commands. + **PASS.** +7. "Tests cover verb logic and per-OS adapter" -> T6, T10, T13. **PASS.** +8. "Docs updated" -> T14, T15. **PASS.** + +All acceptance criteria map to at least one task. No gaps. + +--- + +## 5. VERIFY Checkpoint Placement + +**PASS.** VERIFY checkpoints are placed at the end of each phase: +- After Phase 1 (T6): run tests, confirm compile, no regressions. +- After Phase 2 (T10): run tests, verify --help, no regressions. +- After Phase 3 (T13): run tests, confirm install/uninstall lifecycle, no regressions. +- After Phase 4 (T15): confirm ASCII-only, docs accurate. + +Each checkpoint specifies what to verify and asks for a report. Correct placement. + +--- + +## 6. Additional Findings + +### HIGH-5: Linux adapter stop() uses systemctl --user stop, not /shutdown + +Task 4 (Linux adapter) defines stop() as: "`systemctl --user stop apra-fleet` (sends +SIGTERM, handled gracefully by existing handler)." However, the Verb x OS Matrix for +`stop` says all OSes use "Read server.json -> POST /shutdown -> wait -> fallback." +Meanwhile, Task 7 (stop CLI verb) always uses the HTTP /shutdown approach regardless of +OS. + +There is a contradiction: the Linux adapter's `stop()` method uses `systemctl --user stop` +(which sends SIGTERM), but the CLI stop verb bypasses the adapter entirely and uses HTTP +/shutdown. This means `serviceManager.stop()` on Linux differs from the CLI stop behavior. + +This matters because T12 (uninstall) calls the graceful /shutdown approach, but T11 +(install) calls `serviceManager.start()` which could later be stopped by either path. + +The plan needs to clarify: does the CLI `stop` verb call `serviceManager.stop()` or does +it always go through HTTP /shutdown directly? If the latter, what is +`serviceManager.stop()` used for? If the adapter's stop() is never called by any CLI verb, +it is dead code. If it IS called, then the Linux adapter's use of `systemctl --user stop` +is fine (SIGTERM is handled gracefully), but the Windows adapter's stop() also uses HTTP +/shutdown, creating an inconsistency in the adapter interface contract. + +**Resolution needed:** Either (a) make all adapters' stop() use HTTP /shutdown for +consistency and have the CLI call `serviceManager.stop()`, or (b) have the CLI always +bypass the adapter for stop and document that `serviceManager.stop()` is an internal +method for the unregister flow only, or (c) clarify the exact call path in T7 and T12. + +--- + +## Summary + +The plan is well-structured with clear task boundaries, proper dependency ordering, +front-loaded risk validation, and a thorough Verb x OS Matrix. The ServiceManager adapter +pattern is sound and the /shutdown endpoint elegantly solves the cross-platform graceful +stop problem. All acceptance criteria map to tasks. + +**Three blocking items must be resolved:** + +- **HIGH-2:** T7 (start/stop CLI) is underspecified for dev-mode binary path resolution. + Clarify how dist/index.js is located. +- **HIGH-3:** Notes section says "Base branch: main" -- contradicts requirements.md which + mandates feat/mcp-sse-transport. Fix the Notes. +- **HIGH-4:** macOS install idempotency gap -- `launchctl bootstrap` will fail if already + loaded. Must bootout first or handle the error. +- **HIGH-5:** Contradictory stop() semantics between the Linux adapter (systemctl stop), + the CLI stop verb (HTTP /shutdown), and the Verb x OS Matrix. Clarify the call path. + +**Non-blocking notes:** +- No log rotation strategy (track as follow-up). +- Version-skew scenario for /shutdown endpoint not explicitly addressed (fallback kill + covers it, but should be noted). + +--- +--- + +# OS Service Lifecycle -- Plan Re-Review + +**Reviewer:** rbnvk +**Date:** 2026-05-19 12:50:00-0400 +**Verdict:** APPROVED + +> See the recent git history of this file to understand the context of this review. +> Prior review: 2026-05-19 12:38:29-0400 -- CHANGES NEEDED with 4 HIGH findings. +> Doer revised PLAN.md in commit 7f712fb to address all findings. + +--- + +## Prior HIGH Findings -- Resolution Verification + +### HIGH-2: Task 7 dev-mode binary path resolution + +**RESOLVED.** T7 now explicitly specifies: "In dev mode (non-SEA), the command is +`process.execPath` (the Node.js binary) with args +`[path.join(findProjectRoot(), 'dist', 'index.js'), '--transport', 'http']` -- using the +same `findProjectRoot()` function from src/cli/install.ts that walks up from __dirname +looking for version.json. Import `findProjectRoot` from install.ts (it is already exported) +or extract it to a shared util." + +This is unambiguous -- two developers would make the same choice. The version-skew concern +from the NOTE is also now addressed: T7 explicitly documents the fallback: "if an older +binary without the /shutdown endpoint is running, the POST will fail (404 or connection +error). The fallback force-kill path handles this correctly." + +### HIGH-3: Base branch contradiction + +**RESOLVED.** Notes section now reads: "Base branch: feat/mcp-sse-transport (extends +PR #273 -- no new branch)" and "Implementation branch: feat/mcp-sse-transport (commit +directly onto this branch)." This matches requirements.md exactly. No remaining references +to "main" as base branch anywhere in PLAN.md. + +### HIGH-4: macOS install idempotency gap + +**RESOLVED.** T5 register() now includes: "Before loading, call `launchctl bootout +gui//com.apra-fleet.server` and tolerate 'not loaded' / 'no such process' errors -- +this makes register() idempotent." The Verb x OS Matrix install row for macOS also +reflects this: "launchctl bootout ... (tolerate 'not loaded' error). Then launchctl +bootstrap ..." Both the adapter task and the matrix are consistent. + +### HIGH-5: Contradictory stop() semantics + +**RESOLVED.** The Design Summary now includes a dedicated "Stop call path (unified)" +paragraph that clarifies the architecture: + +1. The CLI `stop` verb bypasses the adapter entirely and calls POST /shutdown directly + (since stopping the process is service-agnostic). +2. All three adapters' stop() methods use the same POST /shutdown mechanism (not + systemctl stop or OS-specific commands) for cross-platform consistency. +3. serviceManager.stop() exists for use within unregister() and for interface + completeness, but the CLI never routes through it. + +T4 (Linux adapter) stop() now reads: "Read server.json for URL. POST /shutdown. Wait up +to 5s for process exit (poll pid). Fallback: kill -TERM . This matches the Windows +and macOS adapters." The prior contradiction with `systemctl --user stop` is fully +eliminated. All three adapters share the same contract. + +--- + +## Structural Re-Verification + +### Task slicing / ordering / dependencies + +**PASS.** No changes to task boundaries or ordering. The revisions were surgical -- +clarifications within T4, T5, and T7 without altering the phase structure. + +### Tier assignments and monotonicity + +**PASS.** No tier changes. Phase 1: cheap -> standard (5x). Phase 2: cheap (2x) -> +standard (2x). Phase 3: standard (3x). Phase 4: cheap (2x). All monotonically +non-decreasing within each phase. + +### VERIFY checkpoint placement + +**PASS.** All four VERIFY blocks remain at phase boundaries. No changes. + +### Acceptance criteria mapping + +**PASS.** All 8 acceptance criteria from requirements.md still map to tasks. The +revisions did not remove or alter any task's scope. Verified: + +1. install -> service running immediately: T11. Covered. +2. Reboot/re-login persistence: T3 (at-logon), T4 (linger), T5 (RunAtLoad). Covered. +3. Verb idempotency on all OSes: T7, T8, T9. Covered. +4. status richness: T9. Covered. +5. uninstall cleanup: T12. Covered. +6. No elevation: T3--T5 per-user commands. Covered. +7. Test coverage: T6, T10, T13. Covered. +8. Docs: T14, T15. Covered. + +### Deferred Items section + +**PASS.** New "Deferred Items" section exists and tracks: (1) log rotation -- noted as +append-only with no rotation for this sprint, with follow-up approaches listed +(size-based rotation, OS-native logrotate/newsyslog); (2) TLS/auth on HTTP endpoint. +Both are correctly scoped out. + +### New problems introduced by revision + +**NONE FOUND.** The revisions are clean clarifications that do not introduce new +ambiguity, contradictions, or gaps. The Verb x OS Matrix, risk register, and task +descriptions remain internally consistent after the changes. + +--- + +## Summary + +All four HIGH findings from the initial review are fully resolved. The plan is +well-structured with clear task boundaries, proper dependency ordering, front-loaded risk +validation, a complete Verb x OS Matrix, and a unified stop call path. The Deferred Items +section tracks log rotation and TLS/auth as follow-ups. All acceptance criteria map to +tasks. No new issues introduced by the revision. + +**Verdict: APPROVED.** The plan is ready for implementation. + +--- +--- + +# Phase 1 (Platform Service Foundation) -- Code Review + +**Reviewer:** rbnvk +**Date:** 2026-05-19 18:15:00-0400 +**Verdict:** APPROVED + +> Commits reviewed: 9963198 (T2), 98115b9 (T3), 93da1fa (T4), 1be25ed (T5), 490ead1 (T6), 224cd11 (T6.5) +> Build: PASS (tsc clean). Tests: 1372 passed, 6 skipped. ASCII: clean (all reviewed files). + +--- + +## 1. Platform Command Correctness + +### Windows (src/services/service-manager/windows.ts) + +**PASS.** `schtasks /create /tn ApraFleet /tr /sc onlogon /rl limited /f` is +correct: `/sc onlogon` triggers at user login, `/rl limited` runs without elevation, `/f` +forces overwrite for idempotency. `/run` starts the task. `/delete /f` removes it without +confirmation. `/query /fo csv /nh` returns machine-parseable output. All flags verified +against schtasks documentation. + +Stop path correctly uses `gracefulStopByServerJson` with a `taskkill /F /PID` fallback, +avoiding `schtasks /end` (which does TerminateProcess). Matches the plan's unified stop +semantics. + +### Linux (src/services/service-manager/linux.ts) + +**PASS.** All `systemctl` calls use `--user` flag throughout. `daemon-reload` after writing +the unit file. `enable` to set up WantedBy symlink. `loginctl enable-linger` attempted +with `console.warn` on failure (non-fatal, as specified in plan). The `checkSystemd()` +guard validates `/run/user//systemd` exists before any systemd operation. + +Unit file content is correct: `Type=simple`, `Restart=on-failure` (no restart on clean +exit), `StandardOutput=append:`, `StandardError=append:`, +`WantedBy=default.target`. + +### macOS (src/services/service-manager/macos.ts) + +**PASS.** `launchctl bootstrap gui/ ` for registration, `launchctl bootout +gui//