diff --git a/docs/memory/architecture.md b/docs/memory/architecture.md index 25cd0115..b7b92a05 100644 --- a/docs/memory/architecture.md +++ b/docs/memory/architecture.md @@ -62,7 +62,7 @@ | `logging_config.py` | Structured JSON logging (captured by Vector); OTel trace ID in log entries for log-trace correlation (RELIABILITY-002) | | `redis_breaker_util.py` | Shared Redis plumbing (fail-open client, Lua `ScriptCache`, decode helpers) used by both circuit breakers | -**OpenTelemetry tracing** (RELIABILITY-002): auto-instrumentation for FastAPI/httpx/Redis; `traceparent` propagated through inter-agent calls; OTLP/gRPC export to `trinity-otel-collector:4317`; `OTEL_ENABLED=1`, sampling via `OTEL_SAMPLE_RATE` (default 10%). +**OpenTelemetry tracing** (RELIABILITY-002): auto-instrumentation for FastAPI/httpx/Redis; `traceparent` propagated through inter-agent calls; OTLP/gRPC export to `trinity-otel-collector:4317`; `OTEL_ENABLED=1`, `OTEL_SAMPLE_RATE` (default 10%). **Routers (`routers/`)** — 53 router modules: @@ -150,7 +150,7 @@ - `operator_intake_service.py` - Fire-and-forget, once-per-install opt-in operator intake POST at first-run; owns `installation_id` (trinity-enterprise#38) *Execution & Scheduling:* -- `task_execution_service.py` - Unified task execution lifecycle: slot mgmt, activity tracking, sanitization (EXEC-024). Runs the #678 reader-race auto-retry: on an empty result (502 dict body with `num_turns < 5`, `raw_message_count == 0`, `parse_failure_count == 0`) it fires one in-line retry with the **same** `execution_id` capped at 300s, persisting `retry_count` and rolling previous-attempt cost into the terminal write. Records dispatch-breaker outcomes — see [Circuit Breakers](#circuit-breakers-transport--dispatch-526). Hosts `apply_result` (the shared terminal applier) and the 202 dispatch-and-return path — see [Fire-and-Forget Dispatch](#fire-and-forget-dispatch-1083) +- `task_execution_service.py` - Unified task execution lifecycle: slot mgmt, activity tracking, sanitization (EXEC-024); #678 reader-race auto-retry (see [task-execution-service.md](feature-flows/task-execution-service.md)); records dispatch-breaker outcomes (see [Circuit Breakers](#circuit-breakers-transport--dispatch-526)); hosts `apply_result` + the 202 dispatch path (see [Fire-and-Forget Dispatch](#fire-and-forget-dispatch-1083)) - `capacity_manager.py` - Unified capacity facade for admit/release/status — see [Capacity & Backlog](#capacity--backlog-428) - `slot_service.py` - Internal to `CapacityManager`: atomic N-ary capacity counter (Redis ZSET, dynamic per-agent TTL) (CAPACITY-001) - `backlog_service.py` - Internal to `CapacityManager`: persistent SQLite FIFO overflow store with drain-on-release (BACKLOG-001) @@ -232,13 +232,13 @@ Channel DB modules: `db/slack_channels.py` (workspace connections, channel-agent **Real-time:** WebSocket client at `utils/websocket.js` with auto-reconnect; tracks `_eid` and replays via `last-event-id` — see [Real-time Delivery](#real-time-delivery-reliability-003-306). -**Top-nav IA — Operations (#1109):** former Health (`/monitoring`), Ops (`/operating-room`), and Executions (`/executions`) nav entries are one **Operations** entry (`views/Operations.vue`, route `/operations`) — a `?tab=`-driven tabbed view: Needs Response · Notifications · Health · Executions · Resolved. Tab content lives in embeddable `components/MonitoringPanel.vue` / `ExecutionsPanel.vue`; tabs toggle by `v-if` so store-owned polling tears down on tab-leave. Health tab is admin-gated (non-admin `?tab=health` deep links coerced to default). NavBar carries one unified badge (pending operator-queue + notifications, critical-pulse). Legacy `/monitoring`, `/executions`, `/operating-room`, `/events` routes redirect (query-preserving) to the matching tab. Per-execution detail route (`/agents/:name/executions/:executionId`) unchanged. +**Top-nav IA — Operations (#1109):** former Health/Ops/Executions nav entries are one **Operations** entry (`views/Operations.vue`, `/operations`) — a `?tab=`-driven view: Needs Response · Notifications · Health · Executions · Resolved. Tab content in embeddable `components/MonitoringPanel.vue` / `ExecutionsPanel.vue`; tabs toggle by `v-if` so store-owned polling tears down on leave. Health tab admin-gated. NavBar carries one unified badge (pending operator-queue + notifications, critical-pulse). Legacy `/monitoring`, `/executions`, `/operating-room`, `/events` redirect (query-preserving) to the matching tab. -**Tab overflow — `components/OverflowTabs.vue` (#1114):** reusable "priority+" tab strip used by Agent Detail: a hidden mirror row measures every tab's width (incl. badge) plus a worst-case "More" button; the visible row renders what fits and collapses the trailing remainder into a "More ▾" disclosure menu. Re-measures on container resize and `document.fonts.ready`; defaults to all-inline before first measure (no first-paint snap). When the active tab is overflowed, the trigger reflects it (active underline + dot). Keyboard/touch accessible, dark-mode aware; `v-model` over `activeTab` so `?tab=` deep-linking is unaffected. Generic by design so `Operations.vue` can adopt it. +**Tab overflow — `components/OverflowTabs.vue` (#1114):** reusable "priority+" tab strip for Agent Detail: a hidden mirror row measures each tab's width plus a worst-case "More" button; the visible row renders what fits and collapses the rest into a "More ▾" menu. Re-measures on resize + `document.fonts.ready`; all-inline before first measure (no first-paint snap). The trigger reflects an overflowed active tab. Keyboard/touch accessible, dark-mode aware; `v-model` over `activeTab` so `?tab=` deep-linking is unaffected. -**Agent Detail Overview tab (#1107):** `components/OverviewPanel.vue` is the default landing tab — owns "trend over the last few days" while the persistent `AgentHeader` owns "now + cost" (Overview does not re-render the header's live gauges/cost/git/autonomy/circuit surfaces). Sections: About lead, needs-attention count + Operations link (hidden at zero), trend charts, health panel (uptime/latency lines clamped to ≤7d by `agent_health_checks` retention), recent-activity drill-in, footprint chips. Charts: `StackedBarChart.vue` (CSS/flexbox stacked bars — correct-by-construction per-segment tooltips) for executions-by-type; `TrendLineChart.vue` (uPlot, dark-mode-aware) for line series. `InfoPanel.vue` leads with About + "What You Can Ask" and tucks `template.yaml` metadata behind a `
` disclosure. +**Agent Detail Overview tab (#1107):** `components/OverviewPanel.vue` is the default landing tab — owns "trend over the last few days" while the persistent `AgentHeader` owns "now + cost" (no duplicate live gauges). Sections: About lead, needs-attention count + Operations link (hidden at zero), trend charts, health panel (uptime/latency clamped ≤7d by `agent_health_checks` retention), recent-activity drill-in, footprint chips. Charts: `StackedBarChart.vue` (CSS/flexbox) for executions-by-type; `TrendLineChart.vue` (uPlot) for line series. `InfoPanel.vue` leads with About + "What You Can Ask", `template.yaml` metadata behind a `
`. -**Collaboration Dashboard** (`views/AgentCollaboration.vue`, `components/AgentNode.vue`, `stores/collaborations.js`): Vue Flow node graph of agent-to-agent communication. Draggable status-colored nodes, edges animated 3s on collaboration, real-time activity feed, replay mode with time-range filtering over the database-backed activity timeline, localStorage node-position persistence. Detection: the backend chat endpoint accepts an `X-Source-Agent` header and broadcasts `agent_collaboration` WebSocket events; `activity_service` broadcasts `agent_activity` events (`activity_type`: chat_start/chat_end/tool_call/schedule_start/schedule_end/agent_collaboration; `activity_state`: started/completed/failed). +**Collaboration Dashboard** (`views/AgentCollaboration.vue`, `components/AgentNode.vue`, `stores/collaborations.js`): Vue Flow node graph of agent-to-agent communication — draggable status-colored nodes, edges animated 3s on collaboration, real-time activity feed, replay with time-range filtering, localStorage node positions. Detection: the backend chat endpoint accepts `X-Source-Agent` and broadcasts `agent_collaboration` WS events; `activity_service` broadcasts `agent_activity` (`activity_type`: chat_start/chat_end/tool_call/schedule_start/schedule_end/agent_collaboration; `activity_state`: started/completed/failed). ### MCP Server (`src/mcp-server/`) @@ -336,19 +336,19 @@ Canonical home for each multi-component feature. Endpoint signatures live in [AP A Trinity **harness IS an `AgentRuntime`** — the pluggable execution engine inside the agent container. Three ship today: **Claude Code** (default), **Gemini CLI**, and **OpenAI Codex** (#1187). `AGENT_RUNTIME` (container env, set from `template.yaml runtime:` via `crud.py`; also a `trinity.agent-runtime` label) selects one; `runtime_adapter.get_runtime()` is the factory — it **validates** the value against `KNOWN_RUNTIMES` and raises on an unknown one rather than silently defaulting to Claude. -**ABC** (`agent_server/services/runtime_adapter.py`): `execute` (chat), `execute_headless` (stateless task), `configure_mcp`, `is_available`, `get_default_model`, `get_context_window`, plus a non-abstract `capabilities()` classmethod returning a `RuntimeCapabilities` dataclass (`chat_continuity`, `session_tab_resume`, `mcp_support`, `cost_reporting: "native"|"estimated"`) — conservative by default so a new runtime that forgets to override is treated as least-capable. Each runtime is a singleton (`get__runtime()`). +**ABC** (`agent_server/services/runtime_adapter.py`): `execute` (chat), `execute_headless` (stateless task), `configure_mcp`, `is_available`, `get_default_model`, `get_context_window`, plus a non-abstract `capabilities()` returning a `RuntimeCapabilities` dataclass (`chat_continuity`, `session_tab_resume`, `mcp_support`, `cost_reporting: "native"|"estimated"`) — conservative by default (an un-overridden runtime is least-capable). Each runtime is a singleton (`get__runtime()`). -**Codex** (`codex_runtime.py`, built independently on the per-runtime primitives — NOT a shared helper, so it never inherits Gemini's blanket `kill_cgroup_orphans()`): `codex exec --json` → JSONL events (`thread.started`→session id, `turn.completed.usage`→tokens where `reasoning_output_tokens` is a SUBSET of `output_tokens`, `item.completed`→response/tool activity, `turn.failed`/`error`); `-o/--output-last-message` is the authoritative result (read-then-delete in `finally`); `codex exec resume ` for chat continuity; cost estimated via `CODEX_PRICING`. Concurrency-safe orphan cleanup via `_drain_bounded` (`kill_cgroup_orphans(extra_pids=…)` preserves sibling executions). Error→HTTP: auth→503, rate→429, runtime-unavailable→**500** (not 503 — avoids the AUTH collision), pipe-drop→**502** (SUB-003 guard). +**Codex** (`codex_runtime.py`, built independently on the per-runtime primitives — NOT a shared helper, so it never inherits Gemini's blanket `kill_cgroup_orphans()`): `codex exec --json` → JSONL events (`thread.started`→session id, `turn.completed.usage`→tokens with `reasoning_output_tokens` ⊂ `output_tokens`, `item.completed`→activity, `turn.failed`/`error`); `-o/--output-last-message` is the authoritative result (read-then-delete in `finally`); `codex exec resume ` for continuity; cost estimated via `CODEX_PRICING`. Concurrency-safe orphan cleanup via `_drain_bounded` (`kill_cgroup_orphans(extra_pids=…)` preserves siblings). Error→HTTP: auth→503, rate→429, runtime-unavailable→**500** (avoids the AUTH collision), pipe-drop→**502** (SUB-003 guard). -**Parity surface** (every runtime must wire these — see the [Harness Authoring Guide](harness-authoring-guide.md)): platform **system prompt** (Codex prepends it + mirrors `CLAUDE.md`→`AGENTS.md` at startup), **sandbox** (`_resolve_sandbox_mode`: normal mode → `--sandbox danger-full-access` — Codex's own bubblewrap sandbox can't create a user namespace inside the hardened agent container (`bwrap: No permissions to create a new namespace`), which blocks every shell tool, so it's dropped and the Trinity container is the sole boundary, same posture as Claude/Gemini; **read-only mode** → `--sandbox read-only`, read from `~/.trinity/read-only-config.json` since the Claude PreToolUse hook doesn't apply — a fail-closed read-only enforcement story for Codex is a fast-follow), **guardrails** (`_load_guardrails()`; unmappable Claude tool-names are surfaced in logs, not silently dropped), and **credential sanitization** (`utils/credential_sanitizer` over response + logs). Codex credentials: `OPENAI_API_KEY` from process env else parsed from `/home/developer/.env` (CRED-002; not exported into the agent-server process), injected into the subprocess; `CODEX_HOME` is relocated under `$TMPDIR` (gitignored) so codex state never dirties the repo. Codex agents skip Claude-subscription auto-assign in `crud.py`/`lifecycle.py` (`is_claude_runtime`). Backend reads nothing runtime-specific in MVP: it still infers AUTH from HTTP 503; `ExecutionMetadata.status`/`error_code` ship unused (fast-follow). The **Session tab** is gated off for runtimes lacking `session_tab_resume` (one backend constant `RUNTIMES_WITHOUT_SESSION_TAB_RESUME` in `sessions.py` runs a stateless turn; frontend hides the tab). MCP: `_configure_codex_mcp_servers`/`_inject_codex_mcp` write `$CODEX_HOME/config.toml` directly, the Trinity HTTP MCP referencing the token via `bearer_token_env_var` (never persisted as a literal). The platform prompt is **runtime-aware** (`platform_prompt_service.get_platform_system_prompt(runtime=…)`/`compose_system_prompt(runtime=…)`, threaded from `routers/chat.py` + `task_execution_service.py` via the `trinity.agent-runtime` label resolved best-effort by `docker_service.get_agent_runtime`): for Codex it strips the Claude-only `mcp__trinity__` tool-name prefix (which otherwise made Codex emit `unknown MCP server`) and references the auto-discovered `trinity` tools by bare name; Claude/Gemini/unknown keep the canonical naming. Frontend: `RuntimeBadge.vue` codex case, `AgentDetail.vue` default model + terminal map, `AgentTerminal.vue` `codex` mode. +**Parity surface** — every runtime must wire these (Codex specifics in [codex-runtime.md](feature-flows/codex-runtime.md); contract in the [Harness Authoring Guide](harness-authoring-guide.md)): platform **system prompt**, **sandbox** (`_resolve_sandbox_mode`: normal → `--sandbox danger-full-access` since Codex's bubblewrap can't namespace inside the hardened container; read-only → `--sandbox read-only` from `~/.trinity/read-only-config.json` — fail-closed Codex read-only is a fast-follow), **guardrails** (`_load_guardrails()`; unmapped Claude tool-names logged, not dropped), **credential sanitization** (`utils/credential_sanitizer`). The **Session tab** is gated off for runtimes lacking `session_tab_resume` (backend constant `RUNTIMES_WITHOUT_SESSION_TAB_RESUME` in `sessions.py` → stateless turn; frontend hides the tab). The platform prompt is **runtime-aware** (`platform_prompt_service.get_platform_system_prompt(runtime=…)`/`compose_system_prompt(runtime=…)`, threaded from `routers/chat.py` + `task_execution_service.py` via the `trinity.agent-runtime` label): for Codex it strips the Claude-only `mcp__trinity__` prefix (else `unknown MCP server`) and uses bare `trinity` tool names. Backend reads nothing runtime-specific in MVP (infers AUTH from HTTP 503; `ExecutionMetadata.status`/`error_code` unused — fast-follow). Codex agents skip Claude-subscription auto-assign (`is_claude_runtime`). ### Capacity & Backlog (#428) -`CapacityManager` (CAPACITY-CONSOLIDATE) is the single public API for admit/release/status across `/chat` (`max_concurrent=max_parallel_tasks`, `queue_in_memory` policy) and `/task` (`queue_persistent` policy). It composes two private internals — `slot_service.py` (atomic N-ary counter, Redis ZSET `agent:slots:{name}`, dynamic per-agent TTL) and `backlog_service.py` (SQLite FIFO over `schedule_executions.status='queued'`, drain-on-release) — and owns the in-memory overflow store (Redis LIST, depth 3). +`CapacityManager` (CAPACITY-CONSOLIDATE) is the single public API for admit/release/status across `/chat` (`max_concurrent=max_parallel_tasks`, `queue_in_memory`) and `/task` (`queue_persistent`). It composes two private internals — `slot_service.py` (atomic N-ary counter, Redis ZSET `agent:slots:{name}`, dynamic per-agent TTL) and `backlog_service.py` (SQLite FIFO over `schedule_executions.status='queued'`, drain-on-release) — and owns the in-memory overflow store (Redis LIST, depth 3). See [capacity-management.md](feature-flows/capacity-management.md). -`run_maintenance()` every 60s: expires stale queued tasks (>24h), drains orphans after restart, runs the #526 breaker-aware backstop (below), and on each successful sweep writes a unix-timestamp heartbeat to Redis `canary:drain_tick_at` (read by canary B-02 to distinguish stuck drains from "drain just hasn't run yet"; written at sweep END so a mid-sweep crash leaves the cursor stale and trips the check). +`run_maintenance()` every 60s: expires stale queued tasks (>24h), drains orphans after restart, runs the #526 breaker-aware backstop, and on each successful sweep writes a unix-timestamp heartbeat to Redis `canary:drain_tick_at` (read by canary B-02; written at sweep END so a mid-sweep crash leaves the cursor stale and trips the check). -**Status-as-projection (#1082):** `schedule_executions.status` is a CAS-guarded *projection* of an execution's terminal event — the agent process registry is the runtime authority for "is running"; no backend reader treats `status='running'` as the standalone authority (cleanup-watchdog readers use it only as a candidate filter, then confirm against the agent registry / Redis before any destructive write). In the **backend `db/schedules.py` module**, every `update(schedule_executions)` that writes `status` carries a status precondition in its `WHERE`, including `update_execution_to_queued` (the overflow re-queue), whose `AND status == RUNNING` guard (#1082) closes the one gap where a stale re-queue could resurrect a terminal row into `queued` (E-02 phantom-reversal class). A static + behavioural regression guard in `tests/unit/test_schedule_status_observability.py` keeps the invariant (the static guard is file-scoped to `db/schedules.py` — see the test's blind-spots note). **Not yet covered (tracked #1082 follow-up):** the standalone scheduler process (`src/scheduler/`) writes the same `trinity.db` with raw-SQL status writers (`scheduler/database.py::update_execution_status`, `::schedule_retry`) that are *not* CAS-guarded; the cron path compensates with a non-atomic read-then-check (`scheduler/service.py` SCHED-ASYNC-001) but the retry-failure path does not, so a late backend `SUCCESS` can still be clobbered there. See [status-as-projection.md](feature-flows/status-as-projection.md). +**Status-as-projection (#1082):** `schedule_executions.status` is a CAS-guarded *projection* of an execution's terminal event — the agent process registry is the runtime authority for "is running"; no backend reader treats `status='running'` as standalone authority (cleanup-watchdog readers use it as a candidate filter, then confirm against the registry/Redis before any destructive write). In `db/schedules.py` every `update(schedule_executions)` writing `status` carries a status precondition in its `WHERE` (incl. `update_execution_to_queued`'s `AND status == RUNNING` guard, closing the E-02 phantom-reversal gap); kept by `tests/unit/test_schedule_status_observability.py`. **Not yet covered (#1082 follow-up):** the standalone scheduler (`src/scheduler/`) writes the same DB with raw-SQL, non-CAS status writers — a late backend `SUCCESS` can still be clobbered on the retry-failure path. See [status-as-projection.md](feature-flows/status-as-projection.md). ### Circuit Breakers (transport + dispatch, #526) @@ -358,10 +358,10 @@ Two independent per-agent breakers, separate Redis namespaces and separate Lua, **Dispatch breaker** (`dispatch_breaker.py`, key `agent:dispatch:{name}`, RELIABILITY-007): producer-side, fed *only* by execution outcomes in `task_execution_service` — counts **AUTH only** (`error_code == AUTH`, agent answers HTTP 503), NOT TIMEOUT/AGENT_ERROR (D10). Consecutive-failure machine `closed → open → half-open(probe) → closed`; default threshold 3, base cooldown 30s, exponential backoff (D9). `record_outcome(error_code)` returns the `(prior, new)` transition; the **caller** backgrounds the drain on `→open` (no `capacity`/`db` import in the breaker → no circular dep, D3). Never raises. `record_failure("missed_heartbeat")` is the #307 seam. `record_success` is a no-op write (Lua early-return) when already closed with zero failures, so healthy agents don't churn Redis. Gating: per-agent `circuit_breaker_enabled` (default OFF) AND global `DISPATCH_BREAKER_ENABLED` must both be on. -**Execution-path flow:** -- `CapacityManager.acquire(...)` gates on the dispatch breaker at the TOP (before the overflow branch). A deny (open within cooldown, or a sibling holds the probe) raises `CircuitOpen` before any slot/overflow work — a doomed task is never enqueued (**no-enqueue invariant**, D2). When open and the call holds the half-open **probe**, the probe is admitted ONLY into a free slot — if slots are full it fast-fails rather than enqueuing, so the probe always leads to a recorded dispatch instead of a verdict-less backlog row that would stall backoff (F1). -- `task_execution_service` is the single execution path, so it records every outcome: `record_outcome(None)` at the success terminal (resets), `record_outcome(AUTH)` at the HTTP-error terminal (counts). On `→open` it backgrounds `_fail_backlog_and_audit` via `_spawn_bg` (holds a strong task ref so the fire-and-forget drain can't be GC'd mid-flight): `db.fail_queued_for_agent` → FAILED + clear in-memory queue + audit. Catches `CircuitOpen` from `acquire` → `TaskExecutionResult(CIRCUIT_OPEN)` + FAILED row. The step-3b pre-dispatch check also fast-fails on a non-probe-consuming `state == "open"` read, but ONLY on the backlog-drain path (`slot_already_held and not dispatch_gate_checked`) so it never blocks a probe an upstream `acquire` already admitted. -- **Backstop**: if the inline drain task is lost or its DB write throws, the 60s `run_maintenance` sweep (`_backstop_open_breaker_backlog`) re-fails the queued backlog for any still-open breaker (~60s worst case, not the 24h generic expiry; bounded to agents with queued rows). +**Execution-path flow** (details in [dispatch-circuit-breaker.md](feature-flows/dispatch-circuit-breaker.md)): +- `CapacityManager.acquire(...)` gates the breaker at the TOP (before overflow). A deny raises `CircuitOpen` before any slot/overflow work — a doomed task is never enqueued (**no-enqueue invariant**, D2). A half-open **probe** is admitted ONLY into a free slot (full → fast-fail, never a verdict-less backlog row that stalls backoff, F1). +- `task_execution_service` (single execution path) records every outcome: `record_outcome(None)` at success (resets), `record_outcome(AUTH)` at the HTTP-error terminal (counts). On `→open` it backgrounds `_fail_backlog_and_audit` via `_spawn_bg` (`db.fail_queued_for_agent` → FAILED + clear queue + audit); catches `CircuitOpen` → `TaskExecutionResult(CIRCUIT_OPEN)` + FAILED row. The step-3b pre-dispatch check fast-fails on `state == "open"` only on the backlog-drain path (`slot_already_held and not dispatch_gate_checked`), never blocking an already-admitted probe. +- **Backstop**: if the inline drain is lost, the 60s `run_maintenance` sweep (`_backstop_open_breaker_backlog`) re-fails queued backlog for any still-open breaker (~60s, not the 24h generic expiry). API: `GET`/`PUT /api/agents/{name}/circuit-breaker` (owner-only toggle), `POST .../circuit-breaker/reset` (admin-only; resets BOTH breakers) — see API Endpoints. @@ -381,9 +381,9 @@ Removes backend-thread pinning for autonomous turns by construction: an eligible Additive push-heartbeat layer; the 30s `monitoring_service` loop (lifespan-resumed, default-off, #1121) stays authoritative for aggregate status when enabled. -**Agent side** (`agent_server/heartbeat.py`): 5s loop, gated on both `TRINITY_BACKEND_URL` and `TRINITY_MCP_API_KEY` being present. POSTs `{memory_mb, active_executions, uptime_s}` to `POST /api/agents/{name}/heartbeat`, authenticated with the agent's own agent-scoped MCP key (Option B — least privilege, no master secret injected). `memory_mb` from `/proc/self/status` VmRSS (no psutil). Sleeps-first and swallows **all** exceptions — a failed beat is silent by design; the backend watch loop acts on absence. +**Agent side** (`agent_server/heartbeat.py`): 5s loop, gated on `TRINITY_BACKEND_URL` + `TRINITY_MCP_API_KEY`. POSTs `{memory_mb, active_executions, uptime_s}` to `POST /api/agents/{name}/heartbeat`, authenticated with the agent's own agent-scoped MCP key (least privilege, no master secret). `memory_mb` from `/proc/self/status` VmRSS (no psutil). Sleeps-first and swallows **all** exceptions — a failed beat is silent by design; the backend watch loop acts on absence. -**Backend side** (`heartbeat_service.py`): owns all Redis heartbeat keys — `record_heartbeat` (SETEX 15s + persistent `seen` marker), `read_heartbeat`, `heartbeat_status`/`heartbeat_status_bulk` (one pipelined round-trip, D4), `authorize_heartbeat` (403 unless the key is agent-scoped and its `agent_name` matches the path; user/system/null keys rejected; validated with `track_usage=False` so a 5s beat doesn't amplify `usage_count`). Keys: +**Backend side** (`heartbeat_service.py`): owns all Redis heartbeat keys — `record_heartbeat` (SETEX 15s + persistent `seen` marker), `read_heartbeat`, `heartbeat_status`/`heartbeat_status_bulk` (one pipelined round-trip, D4), `authorize_heartbeat` (403 unless the key is agent-scoped and its `agent_name` matches the path; user/system/null rejected; validated `track_usage=False`). Keys: ``` agent:heartbeat:{name} → STRING, 15s TTL. JSON {ts, memory_mb, active_executions, uptime_s} @@ -392,35 +392,35 @@ agent:heartbeat:seen:{name} → STRING "1", no TTL. Absent ⇒ unsupported (ol agent:heartbeat:misses:{name} → STRING(int), ~60s TTL. Consecutive-miss counter; never persisted to SQLite ``` -**Watch loop**: 5s (staggered +10s), batched Redis pipeline over `seen`-marked agents, 3-miss guard. Fires a soft, cooldown-debounced operator alert (via the existing `monitoring_alerts` path) **only on the alive→stale transition**, and a recovery notification when beats resume (only after a prior downgrade) — one alert per loss episode. Writes no health-check rows. `clear_heartbeat(name)` deletes all three keys; called best-effort on agent delete and rename (old name) — `seen` has no TTL, so without this it would leak one permanent key per agent and orphan renamed names. The five `heartbeat_*` fields surface on `GET /api/monitoring/status` via a single batched Redis read. +**Watch loop**: 5s (staggered +10s), batched Redis pipeline over `seen`-marked agents, 3-miss guard. Fires a soft, cooldown-debounced operator alert (`monitoring_alerts` path) **only on the alive→stale transition**, plus a recovery notification when beats resume after a prior downgrade — one alert per loss episode. Writes no health-check rows. `clear_heartbeat(name)` deletes all three keys, best-effort on agent delete and rename (old name) — `seen` has no TTL, so otherwise it leaks one permanent key per agent. The five `heartbeat_*` fields surface on `GET /api/monitoring/status` via one batched Redis read. ### Idempotency (RELIABILITY-006, #525) -Trigger-boundary dedup — policy in Architectural Invariant #18, table DDL under `idempotency_keys`. `services/idempotency_service.py` (key derivation + `begin`/`complete`/`fail`) over `db/idempotency.py`. The `(scope, key)` PRIMARY KEY is the atomic claim: `claim()` INSERTs an `in_flight` row; a concurrent loser catches `IntegrityError` and reads the surviving row — cross-process safe across uvicorn workers and the standalone scheduler (shared SQLite file). Lifecycle: `claim` → (`attach_execution`) → `complete` (stores `response_snapshot` for replay) or `release` (deletes the in_flight row so a failed attempt can retry; never deletes a `completed` row). Rows older than 24h are treated as expired and re-claimed; the cleanup service purges them (`idempotency_purge_expired`). Duplicates within 24h short-circuit with the original result + `X-Idempotent-Replay: true`; an in-flight duplicate returns 409. Fail-open — a key never blocks a real execution. +Trigger-boundary dedup — policy in Architectural Invariant #18, table DDL under `idempotency_keys`, details in [idempotency-keys.md](feature-flows/idempotency-keys.md). `services/idempotency_service.py` (`begin`/`complete`/`fail`) over `db/idempotency.py`. The `(scope, key)` PRIMARY KEY is the atomic claim: `claim()` INSERTs an `in_flight` row; a concurrent loser catches `IntegrityError` and reads the surviving row (cross-process safe over the shared SQLite file). Lifecycle: `claim` → (`attach_execution`) → `complete` (stores `response_snapshot` for replay) or `release` (deletes the in_flight row so retry is possible; never deletes a `completed` row). Rows >24h expire and re-claim (cleanup purges via `idempotency_purge_expired`). Duplicates within 24h short-circuit with the original result + `X-Idempotent-Replay: true`; an in-flight duplicate returns 409. Fail-open. ### Subscription Token Rotation via Hot-Reload (#1089) -Rotating an agent's subscription token used to recreate the container, making "rotate a credential" and "kill every in-flight turn" the same operation (#1037). Token rotation now hot-reloads the running container; recreate is reserved for image/template/auth-**mode** changes (TARGET_ARCHITECTURE §Agent Runtime). The agent server authenticates Claude purely from `CLAUDE_CODE_OAUTH_TOKEN` (no `.credentials.json`) and is a single uvicorn worker, so mutating its process env makes the **next** subprocess use the new token while in-flight subprocesses finish on the old one. +Rotating an agent's subscription token used to recreate the container, making "rotate a credential" and "kill every in-flight turn" the same operation (#1037). Rotation now hot-reloads the running container; recreate is reserved for image/template/auth-**mode** changes. The agent server authenticates Claude purely from `CLAUDE_CODE_OAUTH_TOKEN` and is a single uvicorn worker, so mutating its process env makes the **next** subprocess use the new token while in-flight ones finish on the old. -Backend orchestration in `services/subscription_auto_switch.py`: `_hot_reload_subscription_token(agent_name)` POSTs the agent's current DB token to the agent-server `POST /api/credentials/reload-token`, falling back to `_restart_agent` on a 404 (old base image), transport failure, or missing token (`no_container`/`not_running` short-circuit otherwise). Three producer paths converted, all under the #799 `agent_switch_lock`: **auto-switch** (`_perform_auto_switch`, SUB-003), **manual sub→sub reassignment** (`PUT /api/subscriptions/agents/{name}` — auth-mode changes none/api-key→sub still recreate), and **key rollover** (`reload_subscription_for_all_agents(sub_id)` fans a best-effort reload across every running agent on a re-registered subscription). Durable override (`/var/lib/trinity/oauth-token`) + `startup.sh` read make a rotation survive a plain restart — see the agent-server [Durable subscription-token override](#agent-containers) note. Agent-server endpoint mirroring follows Invariant #5. +Backend orchestration in `services/subscription_auto_switch.py`: `_hot_reload_subscription_token(agent_name)` POSTs the DB token to the agent-server `POST /api/credentials/reload-token`, falling back to `_restart_agent` on 404/transport failure/missing token. Three producer paths converted, all under the #799 `agent_switch_lock`: **auto-switch** (`_perform_auto_switch`, SUB-003), **manual sub→sub reassignment** (`PUT /api/subscriptions/agents/{name}`; auth-mode changes still recreate), and **key rollover** (`reload_subscription_for_all_agents(sub_id)` fans a best-effort reload across running agents). Durable override (`/var/lib/trinity/oauth-token`) + `startup.sh` read make a rotation survive a plain restart. Agent-server mirroring follows Invariant #5. ### Real-time Delivery (RELIABILITY-003, #306) -**Transport** (`event_bus.py`): Redis Streams. `ConnectionManager`/`FilteredWebSocketManager` are thin shims that `XADD` to the MAXLEN-trimmed `trinity:events` stream; one `StreamDispatcher` per backend process runs `XREAD BLOCK` and fans out to registered clients, evicting a client after 3 consecutive delivery failures. New broadcast sites keep calling `manager.broadcast(...)` / `filtered_manager.broadcast_filtered(...)` — never publish to the stream directly (Invariant #10). +**Transport** (`event_bus.py`, details in [websocket-event-bus.md](feature-flows/websocket-event-bus.md)): Redis Streams. `ConnectionManager`/`FilteredWebSocketManager` are thin shims that `XADD` to the MAXLEN-trimmed `trinity:events` stream; one `StreamDispatcher` per backend process runs `XREAD BLOCK` and fans out, evicting a client after 3 consecutive delivery failures. New broadcast sites keep calling `manager.broadcast(...)` / `filtered_manager.broadcast_filtered(...)` — never publish to the stream directly (Invariant #10). -**Reconnect replay**: `/ws` and `/ws/events` accept `?last-event-id=`, regex-gated (`^\d+-\d+$`) by `validate_last_event_id()` before reaching `XRANGE`; malformed input is ignored. Catchup capped at `REPLAY_GAP_LIMIT=5000` — larger gaps return `{"type": "resync_required", "reason": "gap_too_large"}`. Authorization (`accessible_agents` for `/ws/events`) is re-applied on replay, not just live fan-out. The frontend tracks `_eid` on every message, appends `&last-event-id=` on reconnect, and on `resync_required` clears the cursor and refetches authoritative state via REST. +**Reconnect replay**: `/ws` and `/ws/events` accept `?last-event-id=`, regex-gated (`^\d+-\d+$`) by `validate_last_event_id()` before `XRANGE`. Catchup capped at `REPLAY_GAP_LIMIT=5000` — larger gaps return `{"type": "resync_required", "reason": "gap_too_large"}`. Authorization (`accessible_agents` for `/ws/events`) is re-applied on replay. The frontend tracks `_eid` per message, appends `&last-event-id=` on reconnect, and on `resync_required` clears the cursor and refetches via REST. -**WebSocket auth** (C-002, #550): `/ws` uses single-use opaque tickets instead of a JWT in the URL: authenticated `POST /api/ws/ticket` mints a 32-byte urlsafe ticket (Redis, 30s TTL); the client connects with `/ws?ticket=...`; backend atomically `GETDEL`s (single-use) and only then accepts. Closes the JWT-leak surface (nginx logs, browser history, proxies); CSWSH mitigated because minting requires the JWT in an `Authorization` header, which CORS blocks cross-origin. `/ws/events` still accepts `?token=trinity_mcp_*` for documented external scripts — MCP keys are scoped, named, revocable. `mint_ticket` takes optional `ttl_seconds` (default 30s, ceiling 600s); VoIP mints call-bound tickets (`scope="voip:{call_id}"`, 180s) since Twilio can't send a JWT and PSTN dial+ring exceeds 30s. Implementation: `services/ws_ticket_service.py` + `routers/ws_tickets.py`. +**WebSocket auth** (C-002, #550): `/ws` uses single-use opaque tickets, not a JWT in the URL: `POST /api/ws/ticket` mints a 32-byte urlsafe ticket (Redis, 30s TTL); client connects `/ws?ticket=...`; backend atomically `GETDEL`s then accepts. Closes the JWT-leak surface (nginx logs, history, proxies); CSWSH mitigated because minting needs the JWT in an `Authorization` header (CORS-blocked cross-origin). `/ws/events` still accepts `?token=trinity_mcp_*` for external scripts (scoped, revocable). `mint_ticket` optional `ttl_seconds` (default 30s, ceiling 600s); VoIP mints call-bound tickets (`scope="voip:{call_id}"`, 180s) since PSTN dial+ring exceeds 30s. Impl: `services/ws_ticket_service.py` + `routers/ws_tickets.py`. ### Soft Delete, Retention & Recovery (#834, #772) -**Agent soft-delete (Phase 1a):** `DELETE /api/agents/{name}` sets `agent_ownership.deleted_at` instead of hard-deleting; child rows are preserved (recoverable until purge). `is_agent_name_reserved()` sees soft-deleted rows, so the name can't be reused before purge. The scheduler's `list_all_enabled_schedules()` joins `agent_ownership` and filters `deleted_at IS NULL`, so a soft-deleted agent's schedules stop firing immediately. +**Agent soft-delete (Phase 1a):** `DELETE /api/agents/{name}` sets `agent_ownership.deleted_at` (child rows preserved, recoverable until purge). `is_agent_name_reserved()` sees soft-deleted rows, so the name can't be reused before purge. The scheduler's `list_all_enabled_schedules()` filters `deleted_at IS NULL`, so schedules stop firing immediately. -**Schedule soft-delete (Phase 1b):** `DELETE .../schedules/{id}` sets `agent_schedules.deleted_at`; the row and its `schedule_executions` are preserved. All schedule read paths — including cron firing in both the backend and the standalone scheduler process — filter `deleted_at IS NULL`. `delete_schedule()` is idempotent on an already-soft-deleted row. +**Schedule soft-delete (Phase 1b):** `DELETE .../schedules/{id}` sets `agent_schedules.deleted_at` (row + `schedule_executions` preserved). All read paths — incl. cron firing in backend and standalone scheduler — filter `deleted_at IS NULL`. `delete_schedule()` is idempotent on an already-soft-deleted row. -**Admin recovery (Phase 1c):** metadata-only (`deleted_at → NULL`) via the `/api/admin/soft-deleted/*` endpoints. Agent recovery does NOT recreate the container (`needs_container_recreate=true`; operator runs `POST /api/agents/{name}/start`); schedule recovery rejoins the firing list next poll if enabled. Audit events `agent_lifecycle:recover` / `schedule_recover`. Response models `SoftDeletedAgent`/`SoftDeletedSchedule` in `models.py`. +**Admin recovery (Phase 1c):** metadata-only (`deleted_at → NULL`) via `/api/admin/soft-deleted/*`. Agent recovery does NOT recreate the container (`needs_container_recreate=true`; operator runs `POST /api/agents/{name}/start`); schedule recovery rejoins the firing list next poll if enabled. Audit `agent_lifecycle:recover` / `schedule_recover`. Models `SoftDeletedAgent`/`SoftDeletedSchedule`. -**Cleanup Service sweeps** (every 5 min): #772 retention — nulls `schedule_executions.execution_log` past `execution_log_retention_days` (default 30), DELETEs terminal `schedule_executions` past `execution_row_retention_days` (default 90), DELETEs `agent_health_checks` past `health_check_retention_days` (default 7). #834 purges — hard-deletes `agent_ownership` rows soft-deleted longer than `agent_soft_delete_retention_days` (default 180, `0`=disabled), cascading children via the #816 `purge_agent_ownership`/`cascade_delete` primitive; hard-deletes `agent_schedules` rows past `schedule_soft_delete_retention_days` (default 30, `0`=disabled) via `purge_schedule()`, which cascades the row's `schedule_executions` (no #816 chain — schedules have no registered children). Each sweep capped at 5000 rows/cycle (first post-deploy backfill spans hours, not minutes); `0` disables a sweep; `PRAGMA wal_checkpoint(TRUNCATE)` when any sweep reclaims rows. Also purges expired `idempotency_keys`. **Startup hook (#740):** one-shot `mark_orphan_loops_interrupted()` flips `agent_loops` rows left `queued`/`running` after a restart to `interrupted` (`stop_reason="interrupted"`); loops do not auto-resume. +**Cleanup Service sweeps** (every 5 min): #772 retention — nulls `schedule_executions.execution_log` past `execution_log_retention_days` (default 30), DELETEs terminal `schedule_executions` past `execution_row_retention_days` (default 90), DELETEs `agent_health_checks` past `health_check_retention_days` (default 7). #834 purges — hard-deletes `agent_ownership` rows soft-deleted past `agent_soft_delete_retention_days` (default 180, `0`=off), cascading children via the #816 `purge_agent_ownership`/`cascade_delete` primitive; hard-deletes `agent_schedules` past `schedule_soft_delete_retention_days` (default 30, `0`=off) via `purge_schedule()`, cascading its `schedule_executions`. Each sweep capped at 5000 rows/cycle; `0` disables; `PRAGMA wal_checkpoint(TRUNCATE)` when any sweep reclaims rows. Also purges expired `idempotency_keys`. **Startup hook (#740):** one-shot `mark_orphan_loops_interrupted()` flips `agent_loops` rows left `queued`/`running` after a restart to `interrupted` (`stop_reason="interrupted"`); no auto-resume. ### Sequential Agent Loops (#740, UI #1106) @@ -432,45 +432,45 @@ Bounded sequential task execution against one agent. Runner is an in-process `as `--resume`-default chat surface: each turn reattaches via `claude --print --resume `, preserving tool-result memory, mid-skill state, and reasoning state across turns. Strictly parallel to `chat_sessions`/`chat_messages` — no FK, no shared state; separate router (`routers/sessions.py`), store (`stores/sessions.js`), component (`SessionPanel.vue`). `cached_claude_session_id` is the load-bearing field. -**Unified Chat tab (#1112):** Agent Detail shows a single **Chat** tab (no separate Session tab) carrying a **Session-mode toggle**, default ON. ON renders `SessionPanel.vue` (`--resume` continuity); OFF renders the legacy stateless `ChatPanel.vue`. The toggle swaps the surface in-place (`v-if` on `effectiveChatMode`); the choice persists per-user in `localStorage['trinity.chatMode']`. Session mode is available only when `sessionsStore.sessionTabEnabled` AND the runtime has `--resume` (not Codex) — otherwise the toggle is hidden and the tab falls back to legacy (never zero chat surfaces). Routing: legacy `?tab=session` aliases to the `chat` tab (`TAB_ALIASES`, `AgentDetail.vue`) and hints session mode; ExecutionDetail "continue as chat" (`?tab=chat&resumeSessionId=…`) forces legacy for that landing via a transient, non-persisted `routeForcedMode` so the legacy `ChatPanel` owns the resume without rewriting the saved preference. +**Unified Chat tab (#1112):** Agent Detail shows a single **Chat** tab carrying a **Session-mode toggle**, default ON. ON → `SessionPanel.vue` (`--resume` continuity); OFF → legacy stateless `ChatPanel.vue`. The toggle swaps in-place (`v-if` on `effectiveChatMode`); choice persists per-user in `localStorage['trinity.chatMode']`. Session mode available only when `sessionsStore.sessionTabEnabled` AND the runtime has `--resume` (not Codex) — else the toggle hides and the tab falls back to legacy (never zero chat surfaces). Routing: legacy `?tab=session` aliases to `chat` (`TAB_ALIASES`) and hints session mode; ExecutionDetail "continue as chat" (`?tab=chat&resumeSessionId=…`) forces legacy via a transient `routeForcedMode` without rewriting the saved preference. -**Turn semantics** (`POST .../sessions/{id}/message`, synchronous): always passes `persist_session=True` to the agent. Resume-failure fallback: if the cached UUID's JSONL is missing, clear the cache, increment `consecutive_resume_failures`, retry once cold (counter reset on next success). Two Redis gates, both with dynamic TTL = `db.get_execution_timeout(agent) + 30s` capped at 7230s: (1) per-`(agent, claude_uuid)` resume lock `session_lock:{agent}:{uuid}` (async wait, 30s ceiling, 429 on contention) serialises concurrent `--resume` calls to prevent JSONL corruption; keyed `session_lock:cold:{session_id}` for cold turns (#779); (2) per-session in-flight sentinel `session_inflight:{session_id}` drives `turn_in_progress` on the GET endpoint so the UI can reattach on KeepAlive activation (#759). +**Turn semantics** (`POST .../sessions/{id}/message`, synchronous; details in [session-tab.md](feature-flows/session-tab.md)): always passes `persist_session=True`. Resume-failure fallback: missing cached-UUID JSONL → clear cache, increment `consecutive_resume_failures`, retry once cold (reset on next success). Two Redis gates, dynamic TTL = `get_execution_timeout(agent) + 30s` capped 7230s: (1) resume lock `session_lock:{agent}:{uuid}` (`session_lock:cold:{session_id}` for cold, #779) serialises `--resume` to prevent JSONL corruption (429 on contention); (2) in-flight sentinel `session_inflight:{session_id}` drives `turn_in_progress` for UI reattach (#759). **Access & gating:** all endpoints per-user scoped (owners cannot see other users' sessions) and return 404 — not 403 — on mismatch to avoid leaking session-id existence. All return 404 when `is_session_tab_enabled()` is false; flag `system_settings.session_tab_enabled` (or `SESSION_TAB_ENABLED` env), default ON. -**JSONL reaping** (`session_cleanup_service.py`): default 6h cycle diffs each running agent's `~/.claude/projects/-home-developer/.jsonl` set against `agent_sessions.cached_claude_session_id` and deletes JSONLs outside the keep set with mtime older than `min_age_seconds` (default 1h race guard). Synchronous best-effort `reap_jsonl()` also fires on user-initiated reset/delete. Uses `execute_command_in_container` (no agent-server endpoint). Headless-task JSONLs (timeout > 600s auto-enables persistence for the #678 stdout-race recovery in `agent_server/services/jsonl_recovery.py`) aren't in `agent_sessions`, so they fall out of the keep set and the same sweep removes them. +**JSONL reaping** (`session_cleanup_service.py`): default 6h cycle diffs each running agent's `~/.claude/projects/-home-developer/.jsonl` set against `agent_sessions.cached_claude_session_id`, deleting JSONLs outside the keep set with mtime older than `min_age_seconds` (default 1h race guard). Synchronous `reap_jsonl()` also fires on reset/delete. Uses `execute_command_in_container` (no agent-server endpoint). Headless-task JSONLs (timeout > 600s auto-enables persistence for the #678 stdout-race recovery in `agent_server/services/jsonl_recovery.py`) aren't in `agent_sessions`, so they fall out of the keep set and the same sweep removes them. ### Outbound File Sharing (FILES-001) -Per-agent opt-in (`agent_ownership.file_sharing_enabled`). The agent writes to `/home/developer/public/` (Docker volume `agent-{name}-public`); on share, the backend extracts the named file via Docker SDK `get_archive` — it never mounts the agent workspace (filesystem-isolated blast radius) — and stores bytes at `/data/agent-files/{file_id}` under the existing `trinity-data` volume (no compose changes). `agent_shared_files_service.py` handles path validation, MIME blocklist, quota, extraction, URL building. +Per-agent opt-in (`agent_ownership.file_sharing_enabled`). The agent writes to `/home/developer/public/` (Docker volume `agent-{name}-public`); on share, the backend extracts the named file via Docker SDK `get_archive` (never mounts the workspace — isolated blast radius) and stores bytes at `/data/agent-files/{file_id}`. `agent_shared_files_service.py` handles path validation, MIME blocklist, quota, extraction, URL building. -Download URL: `{public_chat_url}/api/files/{file_id}?sig={token}` — the param is `?sig=` (NOT `?download_token=`) so the credential sanitizer's `.*TOKEN.*` pattern doesn't redact it in agent transcripts; `/api/*` rides existing Vite/nginx proxy rules. Cascades are manual per platform convention: the agent delete handler removes rows + on-disk files + volume; `rename_agent()` updates `agent_name` across 17 tables. MCP tool `share_file`; endpoints under [Outbound File Sharing](#outbound-file-sharing-files-001-1) in API Endpoints. +Download URL: `{public_chat_url}/api/files/{file_id}?sig={token}` — `?sig=` (NOT `?download_token=`) so the credential sanitizer's `.*TOKEN.*` pattern doesn't redact it in transcripts. Cascades manual per platform convention: agent delete removes rows + files + volume; `rename_agent()` updates `agent_name` across 17 tables. MCP tool `share_file`. ### Agent Runtime Data — `data_paths` + Snapshot/Export (#1169) Declared runtime data (SQLite DBs, datasets) over the **existing durable home volume** — **no separate volume, no platform schema change** (snapshots are filesystem artifacts; audit rides `audit_log`). The agent home (`/home/developer`) is already a persistent named Docker volume (`agent-{name}-workspace`) that survives recreate/upgrade/template-repull/sub-switch, so data under `/home/developer/data` is already durable; this feature adds only the **declaration + export/import** surface. -**Declaration:** a template's `template.yaml data_paths:` (globs under `data/`) is surfaced by `template_service` (github + local builders) and materialized at creation by `crud.py` → `git_service.materialize_data_paths()`: writes `~/.trinity/data-paths.yaml` (quoted-heredoc, glob-safe) AND appends `data/` + each declared path to the agent's **own** `.gitignore` (idempotent `grep -qxF`, never the fleet-wide `_GITIGNORE_PATTERNS`). Opt-in — an empty list is a complete no-op. The S4 persistent-state functions and data_paths now share one extracted primitive (`materialize_trinity_yaml_list`/`_read_trinity_yaml_list`, heredoc delimiter parameterized so persistent_state keeps `PSTATE_EOF`). +**Declaration:** a template's `template.yaml data_paths:` (globs under `data/`) is surfaced by `template_service` and materialized at creation by `crud.py` → `git_service.materialize_data_paths()`: writes `~/.trinity/data-paths.yaml` (quoted-heredoc) AND appends `data/` + each path to the agent's **own** `.gitignore` (idempotent `grep -qxF`, never the fleet-wide `_GITIGNORE_PATTERNS`). Opt-in — empty list is a no-op. Shares one primitive with S4 persistent-state (`materialize_trinity_yaml_list`/`_read_trinity_yaml_list`, heredoc delimiter parameterized). -**Export** (`routers/agent_data.py`, `POST /api/agents/{name}/data/export`, owner/admin): streams `container_get_archive("/home/developer/data")` (workspace never mounted) → temp file under `/data/agent-data-tmp` → `StreamingResponse` (temp removed via `BackgroundTask`); `AGENT_DATA_EXPORT_MAX_BYTES` (default 5 GiB) → 413; the tar embeds a self-describing `manifest.json` (data-paths + agent/version). Missing `data/` → a manifest-only tar, not a 500. `?format=base64` returns the tar inline as JSON up to `AGENT_DATA_INLINE_MAX_BYTES` (default 10 MiB) for the MCP surface. Export is a naturally-idempotent read (accepts `Idempotency-Key` for contract consistency; creates no execution). +**Export** (`routers/agent_data.py`, `POST /api/agents/{name}/data/export`, owner/admin): streams `container_get_archive("/home/developer/data")` → temp file under `/data/agent-data-tmp` → `StreamingResponse` (temp removed via `BackgroundTask`); `AGENT_DATA_EXPORT_MAX_BYTES` (default 5 GiB) → 413; the tar embeds a self-describing `manifest.json`. Missing `data/` → manifest-only tar, not 500. `?format=base64` returns the tar inline as JSON up to `AGENT_DATA_INLINE_MAX_BYTES` (default 10 MiB) for MCP. Naturally-idempotent read (accepts `Idempotency-Key`; creates no execution). -**Import** (`POST /api/agents/{name}/data/import`, owner/admin): proxies the uploaded tar to the agent-server `POST /api/agent-server/restore` primitive, whose `restore_from_tar` enforces the `data/**` allowlist and rejects absolute/`..` traversal; deduped via `Idempotency-Key` (Invariant #18). Both endpoints are serialized per agent by a cross-worker Redis op lock (`agent:data_op:{name}`, SETNX+TTL, fail-open, 409 on contention). MCP tools `export_agent_data`/`import_agent_data` (Invariant #13) carry the base64 tar — "move an agent" = template URL + `.credentials.enc` + data tar. System agents are out of scope. **PR2 (deferred):** scheduled snapshots + `~/.trinity/pre-snapshot` SQLite-quiesce hook + retention + rename/purge snapshot-dir cascade. +**Import** (`POST /api/agents/{name}/data/import`, owner/admin): proxies the uploaded tar to the agent-server `POST /api/agent-server/restore` primitive (`restore_from_tar` enforces the `data/**` allowlist, rejects absolute/`..`); deduped via `Idempotency-Key`. Both endpoints serialized per agent by a cross-worker Redis op lock (`agent:data_op:{name}`, SETNX+TTL, fail-open, 409 on contention). MCP tools `export_agent_data`/`import_agent_data` carry the base64 tar — "move an agent" = template URL + `.credentials.enc` + data tar. System agents out of scope. **PR2 (deferred):** scheduled snapshots + `~/.trinity/pre-snapshot` quiesce hook + retention + rename/purge cascade. ### Git Sync Health (#389/#390) **Agent side:** 15-min `auto_sync` heartbeat loop in the agent server (gated by `GIT_SYNC_AUTO`; default-on for non-source-mode GitHub-template agents) stages/commits/pushes in-container changes and writes the outcome to `.trinity/sync-state.json` (S1a). -**Backend side:** `SyncHealthService` polls git-enabled agents every 60s, upserts `agent_sync_state` (`consecutive_failures` incremented on fail, reset on success; `ahead_working`/`behind_working` make external writes to the working branch visible — P6), and emits `sync_failing` operator-queue entries at ≥3 consecutive failures (S1). Powers the dashboard sync-health dot, `GET /api/agents/sync-health`, and `GET /api/fleet/sync-audit` — whose `duplicate_binding` flag marks agents sharing a `(github_repo, working_branch)` pair with another non-source-mode agent (the §P5 silent-clobber setup) (S6, #390). +**Backend side** (details in [git-sync-health.md](feature-flows/git-sync-health.md)): `SyncHealthService` polls git-enabled agents every 60s, upserts `agent_sync_state` (`consecutive_failures` ++ on fail / reset on success; `ahead_working`/`behind_working` expose working-branch divergence, P6), emits `sync_failing` operator-queue entries at ≥3 failures (S1). Powers the dashboard sync dot, `GET /api/agents/sync-health`, and `GET /api/fleet/sync-audit` — whose `duplicate_binding` flag marks agents sharing a `(github_repo, working_branch)` pair (§P5 silent-clobber setup) (S6, #390). **Recovery (S3, #384):** `POST /api/agents/{name}/git/reset-to-main-preserve-state` adopts `origin/main`, snapshots the S4 persistent-state allowlist first, overlays it back, force-with-lease pushes — safe recovery for parallel-history deadlock (P2/P3). 409 with `X-Conflict-Type: agent_busy | no_git_config | no_remote_main`. Per-agent toggles: auto-sync flag and freeze-schedules-if-sync-failing flag (see API Endpoints). ### VoIP Telephony (VOIP-001, #1056) -Outbound phone calls from agents via Twilio Media Streams + Gemini Live. Feature-flag gated: `voip_available = VOIP_ENABLED && bool(GEMINI_API_KEY)`, default OFF; also requires a per-agent `voip_bindings` row (Twilio-voice creds, validated via Twilio Account fetch, AuthToken AES-256-GCM encrypted). A voice transport, NOT a text `ChannelAdapter`. +Outbound phone calls from agents via Twilio Media Streams + Gemini Live (details in [voip-telephony.md](feature-flows/voip-telephony.md)). Feature-flag gated: `voip_available = VOIP_ENABLED && bool(GEMINI_API_KEY)`, default OFF; also requires a per-agent `voip_bindings` row (Twilio-voice creds, validated via Twilio Account fetch, AuthToken AES-256-GCM encrypted). A voice transport, NOT a text `ChannelAdapter`. **Call flow:** MCP tool `call_user` → `POST /api/agents/{name}/voip/call` → `voip_service.py`: gate checks (flag/binding) + abuse controls (rate limit per `(owner, destination)`, durable per-agent daily cap), stages a Gemini session intent in Redis keyed by `call_id` (distinct from the `vs_` VoiceSession id), mints a call-bound WSS ticket, calls Twilio `calls.create()`. Never calls `connect_and_stream` itself (cross-worker safety — the WS handler does). Optional `Idempotency-Key` honored (Invariant #18). -**Media bridge** (`transports/twilio_media_stream.py`, WS `/api/voip/voice/{call_id}`): `accept()`-then-authenticate — Twilio does NOT forward the `` query string, so the call-bound ticket arrives as a `` in the first `start` frame (`start.customParameters.ticket`), read after handshake (#1073); `?ticket=` honored as fallback for non-Twilio/diagnostic clients. Then scope check (`voip:{call_id}`), `GETDEL` staged intent (consume-once), create the Gemini `VoiceSession` on the connecting worker, run the unmodified `connect_and_stream`. Per-connection `_CallBridge`: inbound μ-law→PCM resample, outbound queue + paced 20ms 160-byte μ-law sender, `clear`-on-barge-in, `streamSid` capture; teardown ties Gemini-end→Twilio-close + SETNX-guarded single transcript save (`source="voice"`) + post-call dispatch. Codec helpers in `transports/voip_audio.py` — pure stdlib `audioop` (`ulaw8k_to_pcm16k`, `pcm24k_to_ulaw8k` direct 3:1, `pop_frames`), per-direction `ratecv` state carried across chunks (anti-click); `audioop-lts` pinned for Python ≥ 3.13. +**Media bridge** (`transports/twilio_media_stream.py`, WS `/api/voip/voice/{call_id}`): `accept()`-then-authenticate — Twilio does NOT forward the `` query string, so the call-bound ticket arrives as `start.customParameters.ticket` in the first `start` frame, read after handshake (#1073); `?ticket=` fallback for non-Twilio clients. Then scope check (`voip:{call_id}`), `GETDEL` staged intent (consume-once), create the Gemini `VoiceSession` on the connecting worker, run the unmodified `connect_and_stream`. Per-connection `_CallBridge`: inbound μ-law→PCM resample, outbound paced 20ms 160-byte μ-law sender, `clear`-on-barge-in, `streamSid` capture; teardown ties Gemini-end→Twilio-close + SETNX-guarded single transcript save + post-call dispatch. Codec helpers in `transports/voip_audio.py` (stdlib `audioop`, per-direction `ratecv` state for anti-click; `audioop-lts` pinned for Python ≥ 3.13). **Post-call:** transcript persisted to `chat_messages` (`source="voice"`) and dispatched to the main agent via `task_execution_service.execute_task(triggered_by="voip")` (default ON). Phase 2 column `inbound_number` reserved in `voip_bindings`. @@ -503,14 +503,12 @@ idiom), via `GET /api/agents/{name}/compatibility`, and the MCP tool `docs/agent-validation-spec.md` (single source of truth, sync-tested against `spec.py`). -Package `services/compatibility/` mirrors the deterministic `canary/` library: -`spec.py` (catalog: id→severity/type/category/runtime/auto_fixable), `collector.py`, -`static_checks.py`, `ai_checks.py`, `fixes.py`, `__init__.py` (`build_report`/`apply_fix`). +Package `services/compatibility/` mirrors the deterministic `canary/` library (`spec.py` catalog, `collector.py`, `static_checks.py`, `ai_checks.py`, `fixes.py`, `__init__.py` → `build_report`/`apply_fix`). Details in [agent-compatibility-validation.md](feature-flows/agent-compatibility-validation.md). -- **Collector**: ONE `docker exec` runs an in-container `python3` script (base64-injected) that walks a FIXED path allowlist and emits ONE JSON snapshot — per-file `{exists,size,binary,truncated,content}` with 256 KB/file + 2 MB/total caps; secret-bearing files (`.env`, generated `.mcp.json`) are **existence-only** (content never leaves the container). Backend `json.loads` once → `unavailable` on any failure (never 500). Stopped container is detected via `docker_service` *before* exec → degraded report showing the last persisted result. Legacy `workspace/` dir via `git_service._detect_git_dir`. -- **Checks**: pure `(snapshot)→[Check]` functions, unit-testable with fixtures. `[STATIC]` deterministic (run always, free); `[AI]` LLM-judged (Claude Haiku, batched by category, tool-use structured output, **iterate-expected** so an omitted check becomes `skipped` not vanished, fail-open on no-key/error). **AI severity is capped at SOFT** — HARD is reserved for STATIC. Claude-only checks (`CLAUDE.md`, `.claude/` skills) are skipped for non-Claude runtimes (#1187). Secret values are never echoed; AI payloads are redacted and exclude secret files. -- **Persistence** (`agent_compatibility_results`, latest-snapshot-per-agent, upsert): STATIC recomputes live each read; persisted AI verdicts merge in so findings show on every Overview load without re-spending tokens (`?include_ai=true` / "Re-run" forces fresh AI). Departs from the issue's original "no DB table" note — see requirements §41. Cascade/rename via the `AGENT_REFS` registry. -- **Auto-fix** (`POST .../compatibility/fix`, owner/admin): the 10 gitignore checks; reuses `git_service._GITIGNORE_PATTERNS`; per-agent Redis lock (`compat_fix:{name}`); atomic base64 write-back; G-001 removes a blanket `.claude/` line by exact-line match (never substring). **No auto-commit** — uncommitted until the agent's next git sync. Creates no execution, so Invariant #18 doesn't apply. +- **Collector**: ONE `docker exec` runs a base64-injected `python3` script walking a FIXED path allowlist → ONE JSON snapshot (per-file `{exists,size,binary,truncated,content}`, 256 KB/file + 2 MB/total caps); secret-bearing files (`.env`, `.mcp.json`) are **existence-only**. Backend `json.loads` once → `unavailable` on any failure (never 500); a stopped container → degraded report from the last persisted result. +- **Checks**: pure `(snapshot)→[Check]` functions. `[STATIC]` deterministic (always, free); `[AI]` LLM-judged (Claude Haiku, batched by category, tool-use structured output, **iterate-expected**, fail-open on no-key/error). **AI severity capped at SOFT** — HARD reserved for STATIC. Claude-only checks skipped for non-Claude runtimes (#1187). Secret values never echoed; AI payloads redacted. +- **Persistence** (`agent_compatibility_results`, latest-snapshot-per-agent, upsert): STATIC recomputes live; persisted AI verdicts merge in so findings show on every Overview load without re-spending tokens (`?include_ai=true` / "Re-run" forces fresh AI; requirements §41). Cascade/rename via `AGENT_REFS`. +- **Auto-fix** (`POST .../compatibility/fix`, owner/admin): the 10 gitignore checks; reuses `git_service._GITIGNORE_PATTERNS`; per-agent Redis lock (`compat_fix:{name}`); atomic base64 write-back; G-001 removes a blanket `.claude/` line by exact-line match. **No auto-commit** — uncommitted until next git sync. Creates no execution. --- @@ -638,8 +636,8 @@ The per-agent VoIP config + voice-picker UI lives in the agent Settings/Sharing | GET/PUT/DELETE | `/api/agents/{name}/schedules/{id}` | Get / update (same 400 on timeout) / soft-delete | | POST | `/api/agents/{name}/schedules/{id}/enable` · `/disable` · `/trigger` | Enable / disable / manual trigger | | GET | `/api/agents/{name}/schedules/{id}/executions` | Execution history | -| GET | `/api/agents/{name}/schedules/analytics-summary` | **Per-schedule performance rollups for the whole agent** in one compact call (#1115). `?window=` ∈ {7d,14d,30d}→168/336/720h (422 otherwise). One row per **non-deleted** schedule (zero-run schedules included): terminal `success_rate` (`None`→`—` when zero terminal), `avg_duration_ms` (NULL-skip), `cost_total`, `context_avg`, `tool_call_total`, last-run outcome. Backs BOTH the Overview "Schedules performance" section AND the Schedules-tab inline stats from a single fetch (no N round-trips). **Declared before `/{id}` in `routers/schedules.py`** so the literal `analytics-summary` isn't captured as a `schedule_id` (Invariant #4). DB: `get_agent_schedules_summary` (generalises #868). Tool-call totals parsed over the newest 5,000 rows agent-wide (`tool_calls_sampled` flag) | -| GET | `/api/agents/{name}/schedules/{id}/analytics` | Per-schedule analytics: counts, success rate, duration p50/p95/p99, cost, tool-call top-5, daily timeline. `?window_hours=` ∈ {24,168,720}, default 168 (#868). Percentiles Python-side over the newest 5,000 success rows (`sampled:true` reported when capped); counts + timeline full-set; UTC day buckets gap-filled. Tenant boundary in the DB layer (`agent_name` passed through) — `AuthorizedAgent` validates only the path agent name, NOT that `schedule_id` belongs to it, so the DB-layer filter is the actual boundary. Soft-deleted schedules 404 | +| GET | `/api/agents/{name}/schedules/analytics-summary` | **Per-schedule performance rollups for the whole agent** in one call (#1115). `?window=` ∈ {7d,14d,30d}→168/336/720h (422 else). One row per **non-deleted** schedule (zero-run included): terminal `success_rate` (`None`→`—`), `avg_duration_ms` (NULL-skip), `cost_total`, `context_avg`, `tool_call_total`, last-run outcome. Backs both the Overview "Schedules performance" section and the Schedules-tab inline stats from one fetch. **Declared before `/{id}`** so `analytics-summary` isn't captured as a `schedule_id` (Invariant #4). DB `get_agent_schedules_summary`; tool-call totals over the newest 5,000 rows (`tool_calls_sampled`) | +| GET | `/api/agents/{name}/schedules/{id}/analytics` | Per-schedule analytics: counts, success rate, duration p50/p95/p99, cost, tool-call top-5, daily timeline. `?window_hours=` ∈ {24,168,720}, default 168 (#868). Percentiles Python-side over the newest 5,000 success rows (`sampled:true` when capped); counts + timeline full-set; UTC buckets gap-filled. Tenant boundary in the DB layer (`agent_name` passed through) — `AuthorizedAgent` validates only the path agent name, not that `schedule_id` belongs to it. Soft-deleted schedules 404 | | POST/GET/DELETE | `/api/agents/{name}/schedules/{id}/webhook` | Generate/rotate token · status + URL · revoke (WEBHOOK-001) | ### Webhook Triggers (WEBHOOK-001) @@ -757,7 +755,7 @@ Coverage: agent lifecycle, auth, sharing, credentials, settings, rename; request | Method | Path | Description | |--------|------|-------------| | GET/PUT/DELETE | `/api/settings/mcp-url` | Get (any auth user) / set / reset-to-auto-detect (admin-only) MCP server URL | -| GET | `/api/settings/feature-flags` | Public-safe UI gating flags (any auth user): `session_tab_enabled`, `voice_available` (`VOICE_ENABLED && GEMINI_API_KEY`), `workspace_available` (voice AND `WORKSPACE_ENABLED`, opt-in #860), `voip_available` (#1056), `mcp_agent_chat_pull_enabled` (#946 pull-pilot routing; observability-only — the routing gate is the MCP server's own read of `MCP_AGENT_CHAT_PULL_ENABLED`; default OFF, not a UI surface), `enterprise_features` (registered enterprise modules; empty in OSS-only builds or under `TRINITY_OSS_ONLY=1`) (#847) | +| GET | `/api/settings/feature-flags` | Public-safe UI gating flags (any auth user): `session_tab_enabled`, `voice_available` (`VOICE_ENABLED && GEMINI_API_KEY`), `workspace_available` (voice AND `WORKSPACE_ENABLED`, #860), `voip_available` (#1056), `mcp_agent_chat_pull_enabled` (#946 observability-only; routing gate is the MCP server's own `MCP_AGENT_CHAT_PULL_ENABLED`, default OFF), `enterprise_features` (registered enterprise modules; empty in OSS-only or `TRINITY_OSS_ONLY=1`) (#847) | | GET/PUT | `/api/settings/agent-defaults/resources` | Fleet-wide default CPU/memory for new containers (admin-only; CPU 1/2/4/8/16, memory 1g–32g) (RES-001) | | GET/PUT | `/api/settings/agent-defaults/access-policy` | Fleet-wide default `require_email` for new agents (admin-only, #1129). Stored in `system_settings`, **secure-by-default ON** (code fallback when unset — no migration); seeds `agent_ownership.require_email` at creation (`register_agent_owner`) for **new** agents only, never rewrites existing rows; owners still override per agent via `PUT /api/agents/{name}/access-policy` | @@ -785,7 +783,7 @@ These are structural patterns that must be preserved. Breaking them causes casca 2. **DB Layer: Class-per-domain with Mixin Composition** — Each `db/` file defines an `XOperations` class. Agent-specific settings use mixins (`db/agent_settings/`) composed into `AgentOperations`. New agent settings → new mixin, not a bigger class. -3. **Schema in `db/schema.py`, Migrations in `db/migrations.py`** — All OSS table DDL lives in `schema.py`. Schema changes require a versioned migration in `migrations.py` (tracked in the `schema_migrations` table). Never create tables ad-hoc in service code. **Runner safety (#1160):** `init_database()` wraps both migration passes + `init_schema` in a cross-process `flock` (`db/migration_lock.py`) so concurrent uvicorn workers + the scheduler container can't race the suite; table-rebuild migrations use `_atomic_rebuild` (rename-swap inside an explicit `BEGIN`/`COMMIT`) so a crash mid-rebuild rolls back instead of losing rows between an autocommitted `DROP` and the re-insert; a failed migration is named on its traceback (`add_note`) and surfaced as `first_pending` in the `/health` 503. **Backend split (#1183):** the `db/migrations.py` bespoke runner (PRAGMA + `INSERT OR IGNORE`) is **SQLite-only**. The PostgreSQL backend is owned by **Alembic** — `init_database()`'s non-SQLite branch calls `db/alembic_runner.upgrade_to_head()` (`src/backend/migrations/` + `alembic.ini`; `env.py` targets the `db/tables.py` MetaData). A fresh PG DB is built by the `0001_baseline` revision (which reuses the exact `init_schema_postgres` head DDL); a pre-Alembic PG DB is stamped at baseline, not rebuilt. The two systems coexist during the SQLite→Postgres transition, so a schema change currently lands in **both** `migrations.py` (SQLite) and a new Alembic revision (Postgres) until SQLite reaches **end-of-support on September 1, 2026** (decision #1278; migration guide `docs/migrations/SQLITE_TO_POSTGRES.md`) — after which the long-term goal (#746) is `tables.py` MetaData as the single source with autogenerated revisions. **Two-track migrations (open-core):** enterprise modules own only `enterprise_*` tables and migrate them through a **separate** runner (`enterprise/backend/_migrations.py`) tracked in `enterprise_schema_migrations` — never the OSS `schema_migrations`, so the two version-lines can't collide. Enterprise authors one file per migration in the module's `migrations/` package (`NNNN_slug.py` with `NAME` + `upgrade(cursor, conn)`, auto-discovered in filename order). Enterprise migrations may FK-into OSS tables but must **never ALTER** an OSS table — anything OSS must enforce goes through an OSS migration as an edition-agnostic primitive (e.g. `users.suspended_at`, #995). The enterprise runner is invoked from `register_enterprise` *after* OSS `init_database`, so OSS tables already exist. +3. **Schema in `db/schema.py`, Migrations in `db/migrations.py`** — All OSS table DDL lives in `schema.py`. Schema changes require a versioned migration in `migrations.py` (tracked in the `schema_migrations` table). Never create tables ad-hoc in service code. **Runner safety (#1160):** `init_database()` wraps both migration passes + `init_schema` in a cross-process `flock` (`db/migration_lock.py`) so workers + scheduler can't race; table-rebuild migrations use `_atomic_rebuild` (rename-swap inside `BEGIN`/`COMMIT`) so a crash mid-rebuild rolls back; a failed migration is named via `add_note` and surfaced as `first_pending` in the `/health` 503. **Backend split (#1183):** the `db/migrations.py` runner (PRAGMA + `INSERT OR IGNORE`) is **SQLite-only**; PostgreSQL is owned by **Alembic** — `init_database()`'s non-SQLite branch calls `db/alembic_runner.upgrade_to_head()` (`src/backend/migrations/` + `alembic.ini`; `env.py` targets `db/tables.py` MetaData). Fresh PG built by the `0001_baseline` revision (reuses `init_schema_postgres` DDL); pre-Alembic PG stamped at baseline. Both coexist during transition, so a schema change lands in **both** `migrations.py` (SQLite) and a new Alembic revision (Postgres) until SQLite **end-of-support September 1, 2026** (#1278; guide `docs/migrations/SQLITE_TO_POSTGRES.md`) — after which the goal (#746) is `tables.py` MetaData as single source with autogenerated revisions. **Two-track (open-core):** enterprise owns only `enterprise_*` tables via a **separate** runner (`enterprise/backend/_migrations.py`, tracked in `enterprise_schema_migrations`, never OSS `schema_migrations`); one file per migration (`NNNN_slug.py` with `NAME` + `upgrade(cursor, conn)`, filename order). Enterprise migrations may FK-into OSS tables but must **never ALTER** one — OSS enforcement goes through an OSS migration as an edition-agnostic primitive (e.g. `users.suspended_at`, #995). The enterprise runner runs from `register_enterprise` *after* OSS `init_database`. 4. **Router Registration Order Matters** — In `main.py`, static routes like `/api/agents/context-stats` must come before `/{name}` catch-all. New collection-level agent endpoints must be registered before parameterized routes. @@ -813,7 +811,7 @@ These are structural patterns that must be preserved. Breaking them causes casca 16. **Time-Window SQL uses `iso_cutoff()`, not `datetime('now', ...)`** — Columns written via `utc_now_iso()` are ISO-Z strings (`T` separator, `Z` suffix); SQLite's `datetime('now', ...)` emits a different format (space separator, no suffix), making lexicographic comparison silently incorrect (#476). For rolling-window filters on ISO-Z TEXT columns, compute the cutoff in Python via `iso_cutoff(hours)` from `utils/helpers.py` and pass it as a bound parameter. -17. **Non-root containers** — every Trinity-built image MUST end with a `USER` directive switching to a non-root user. Backend additionally requires `group_add: ${DOCKER_GID:-999}` in compose for Docker socket access on Linux. New service Dockerfiles failing this invariant are rejected at review. Established by #874. CI guards in `.github/workflows/container-security.yml` (path-filtered, runs unconditionally on `docker/**`, `docker-compose*.yml`, `scripts/deploy/start.sh`, `src/mcp-server/Dockerfile` changes — independent of the `ui`-label-gated e2e workflow so backend infra PRs can't silently skip them): `verify-non-root` execs the running backend/scheduler/mcp-server containers (those hold the credentials and the `docker.sock` mount), asserts UID 1000, and proves `group_add` is wired through on Linux by running `docker.from_env().ping()` from inside the backend (NOT a `/api/agents` HTTP probe — `list_all_agents_fast` swallows Docker exceptions and returns `[]`, which made the original gate a false positive); `verify-prod-frontend-uid` builds the prod frontend image out-of-band (start.sh boots the Vite-dev image) and asserts its UID is 101 (`nginxinc/nginx-unprivileged`). Dev-only images (`docker/frontend/Dockerfile`) are intentionally exempt — they have no production attack surface. Existing deployments upgrading through this change must re-own their data path and `agent-configs` volume per [docs/migrations/NON_ROOT_CONTAINERS_2026-05.md](../migrations/NON_ROOT_CONTAINERS_2026-05.md). +17. **Non-root containers** — every Trinity-built image MUST end with a `USER` directive switching to a non-root user; the backend additionally requires `group_add: ${DOCKER_GID:-999}` in compose for Docker socket access on Linux. New Dockerfiles failing this are rejected at review (#874). CI guards in `.github/workflows/container-security.yml` (path-filtered on `docker/**`, `docker-compose*.yml`, `scripts/deploy/start.sh`, `src/mcp-server/Dockerfile`, independent of the `ui`-gated e2e workflow): `verify-non-root` execs the backend/scheduler/mcp-server containers, asserts UID 1000, and proves `group_add` works by running `docker.from_env().ping()` from inside the backend (not a `/api/agents` probe — `list_all_agents_fast` swallows Docker errors, a false-positive trap); `verify-prod-frontend-uid` builds the prod frontend out-of-band and asserts UID 101 (`nginxinc/nginx-unprivileged`). Dev-only `docker/frontend/Dockerfile` is exempt. Upgrading deployments must re-own their data path and `agent-configs` volume per [docs/migrations/NON_ROOT_CONTAINERS_2026-05.md](../migrations/NON_ROOT_CONTAINERS_2026-05.md). 18. **Trigger boundaries accept `Idempotency-Key`** (RELIABILITY-006, #525) — every producer boundary that creates an execution accepts an optional `Idempotency-Key` header and routes it through `services/idempotency_service.py` (`begin`/`complete`/`fail`) backed by the `idempotency_keys` table. The same `(scope, key)` within 24h yields one execution; duplicates short-circuit with the original result + `X-Idempotent-Replay: true` (in-flight duplicate → 409). Enforcement lives at the **router** layer, not solely in `TaskExecutionService`, because sync `/chat` runs an inline path and `/api/webhooks/{token}` creates no execution. Wired boundaries: `/chat`, `/task`, `/api/internal/execute-task`, `/api/webhooks/{token}` (auto-derives `(token, body_hash)`), `/api/agents/{name}/fan-out`, and the scheduler (`Idempotency-Key: sched:{execution_id}`) + MCP `chat_with_agent`/`fan_out` (deterministic key over call args). **Any new trigger type must accept an idempotency key before merge** — the dedup layer is fail-open (a key never blocks a real execution), so the cost of adding it is one `begin/complete/fail` triple. @@ -1589,7 +1587,7 @@ Bridges (members of **both** networks): `backend` (primary HTTP API — Redis on - **Non-root execution** (Invariant #17, #874): backend and scheduler as `trinity` (UID 1000), MCP server as `node` (UID 1000), frontend as `nginx` (UID 101), agents as `developer` (UID 1000). Backend needs `group_add: ${DOCKER_GID:-999}` for Docker socket access on Linux. - `CAP_DROP: ALL` + `CAP_ADD: NET_BIND_SERVICE`; `security_opt: no-new-privileges:true`; tmpfs `/tmp` with `noexec,nosuid` (RAM-backed, default 512 MB — operator-tunable via `AGENT_TMP_SIZE` on the backend service, validated `^\d+[mg]$` with invalid→default; `noexec,nosuid` stay fixed; counts against the agent memory cgroup; creation-time, so existing agents pick up a change on recreate not restart, #1231. Heavy scratch like pip/npm/ML wheels is redirected via a default `TMPDIR=/home/developer/.tmp` on the disk-backed home volume, created at start by `startup.sh`; mount spec + TMPDIR default live in `services/agent_service/capabilities.py` so create/recreate/system-agent can't drift, #1098); no external UI port exposure; network isolation per Network Topology above. - **Internal API security (C-003)**: `/api/internal/` endpoints (scheduler, agent containers) require the `X-Internal-Secret` header; falls back to `SECRET_KEY` if `INTERNAL_API_SECRET` unset. -- **Agent-server inbound auth (#1159)**: the in-container agent server (`:8000`) historically had zero auth on the shared agent network. Every backend→agent call now carries a per-agent `X-Trinity-Agent-Token` = `HMAC-SHA256(AGENT_AUTH_SECRET, "trinity-agent-auth:v1:"+name)` — *derived*, not stored (no DB column); the master lives only in the backend env, so a compromised agent holds its own token but cannot compute a sibling's. A **pure-ASGI** middleware (`docker/base-image/agent_server/middleware/auth.py`) enforces it on **all** HTTP **and** WebSocket routes via constant-time compare, exempting only exact `/health` (+ `OPTIONS`) — pure-ASGI (not `BaseHTTPMiddleware`) so the boundary is scope-complete (it also gates WebSocket scopes, which `BaseHTTPMiddleware` can't see) and never buffers SSE streams. The dead, unauthenticated `/ws/chat` route (it ran `runtime.execute` — arbitrary Claude — and slipped past the original HTTP-only middleware) was additionally removed; WS coverage now keeps any future WS route authenticated by default. CORS (`allow_origins=["*"]` + credentials) was removed (agent server is internal-only). Grace path: empty `TRINITY_AGENT_AUTH_TOKEN` env → allow (old-image rollout); `check_agent_auth_token_env_matches` forces a one-pass recreate (deterministic derive → loop-safe) so a missing/stale (renamed-agent) token re-injects. Backend is fail-closed (`derive_agent_token` raises on empty `AGENT_AUTH_SECRET`; `start.sh` auto-generates the hex32 master like `SECRET_KEY`/`INTERNAL_API_SECRET`, both compose files forward it to the backend). Callers route through `services/agent_auth.py` helpers (`agent_httpx_client` / `build_agent_auth_headers` / `merge_auth_headers`); a static guard (`tests/unit/test_agent_auth_header_guard.py`) fails any new raw `agent-{name}:8000` caller that skips them. +- **Agent-server inbound auth (#1159)** (details in [agent-server-authentication.md](feature-flows/agent-server-authentication.md)): every backend→agent call carries a per-agent `X-Trinity-Agent-Token` = `HMAC-SHA256(AGENT_AUTH_SECRET, "trinity-agent-auth:v1:"+name)` — *derived*, not stored; the master lives only in backend env, so a compromised agent can't compute a sibling's token. A **pure-ASGI** middleware (`docker/base-image/agent_server/middleware/auth.py`) enforces it on **all** HTTP **and** WS routes via constant-time compare, exempting only exact `/health` (+ `OPTIONS`) — pure-ASGI (not `BaseHTTPMiddleware`) so it gates WS scopes too and never buffers SSE. The dead unauthenticated `/ws/chat` route (ran arbitrary Claude) was removed; CORS dropped (internal-only). Grace path: empty `TRINITY_AGENT_AUTH_TOKEN` → allow (old-image); `check_agent_auth_token_env_matches` forces a one-pass recreate so a missing/stale token re-injects. Backend fail-closed (`derive_agent_token` raises on empty secret; `start.sh` auto-generates the hex32 master, both compose files forward it). Callers route through `services/agent_auth.py`; a static guard (`tests/unit/test_agent_auth_header_guard.py`) fails any raw `agent-{name}:8000` caller that skips them. - **WebSocket security (C-002, #550)**: single-use ticket auth — see [Real-time Delivery](#real-time-delivery-reliability-003-306). - **Frontend XSS (H-005)**: all markdown rendering uses DOMPurify via `utils/markdown.js`; no direct `v-html` with unsanitized content. - **Rate limiting (#1023)**: shared sliding-window limiter `services/rate_limiter.py` — Redis sorted-set rolling window (no fixed-window boundary burst), fail-open with bounded per-worker in-process fallback; `enforce(key, limit, window)` raises 429 + `Retry-After`. New request-rate limits reuse this primitive — don't hand-roll Redis counters. Intentionally NOT unified under it: the auth login/OTP limiters in `routers/auth.py` are failure-counters (increment on failure, reset on success) — a different pattern. A global ASGI middleware with a route→policy table is a tracked follow-up.