|
| 1 | +# SnapAgent Observability + Doctor Architecture |
| 2 | + |
| 3 | +## Goal |
| 4 | + |
| 5 | +Build a lightweight but operable observability architecture so production issues can be diagnosed directly from chat channels (`/doctor`) without coupling diagnosis logic to core runtime internals. |
| 6 | + |
| 7 | +## Design Principles |
| 8 | + |
| 9 | +1. Thin connector, strong evidence. |
| 10 | +2. Session-scoped control, no global pause. |
| 11 | +3. Read-only diagnostics first, no risky auto-fix by default. |
| 12 | +4. Separate deterministic data collection from model reasoning. |
| 13 | + |
| 14 | +## Four Surfaces |
| 15 | + |
| 16 | +### 1) Event Backbone (Track 0) |
| 17 | + |
| 18 | +- Unified event model: `DiagnosticEvent`. |
| 19 | +- Correlation fields: `session_key`, `run_id`, `turn_id`. |
| 20 | +- Message bus emits structured inbound/outbound/runtime events. |
| 21 | + |
| 22 | +Role: |
| 23 | +- Foundation for cross-surface correlation and postmortem timeline reconstruction. |
| 24 | + |
| 25 | +### 2) Health Surface (Track 1) |
| 26 | + |
| 27 | +- CLI: `snapagent health --json`, `snapagent status --deep --json`. |
| 28 | +- Aggregates provider/config/workspace/channel/runtime queue evidence. |
| 29 | + |
| 30 | +Role: |
| 31 | +- Fast readiness/liveness check and root-cause narrowing. |
| 32 | + |
| 33 | +### 3) Logging Surface (Track 2) |
| 34 | + |
| 35 | +- Structured JSONL sink (`diagnostic.jsonl`) with rotation/follow. |
| 36 | +- CLI: `snapagent logs --json --session ... --run ... --follow`. |
| 37 | + |
| 38 | +Role: |
| 39 | +- Session/run-scoped evidence retrieval for operational debugging. |
| 40 | + |
| 41 | +### 4) Doctor Surface (Codex-Driven) |
| 42 | + |
| 43 | +- Chat commands: |
| 44 | + - `/doctor` |
| 45 | + - `/doctor status` |
| 46 | + - `/doctor cancel` |
| 47 | + - `/doctor resume` |
| 48 | +- `/doctor` first pauses current session tasks (reuse stop/cancel path). |
| 49 | +- Provider precheck before diagnostics: |
| 50 | + - if provider not ready, return setup guidance and block doctor mode. |
| 51 | + - guidance includes OAuth/API-key paths and validation command. |
| 52 | +- Diagnostic execution is model-driven via read-only tool: |
| 53 | + - `doctor_check(check=health|status|logs|events, session_key?, run_id?, lines?)` |
| 54 | + |
| 55 | +Role: |
| 56 | +- Turn observability data into interactive diagnosis in user channels (Feishu/Telegram/CLI). |
| 57 | + |
| 58 | +## End-to-End Flow |
| 59 | + |
| 60 | +1. User sends `/doctor` in a chat session. |
| 61 | +2. Agent cancels active tasks for this session only. |
| 62 | +3. Agent runs provider precheck. |
| 63 | +4. If precheck fails: return setup guidance and stop. |
| 64 | +5. If precheck passes: enter doctor mode and start diagnostic turn. |
| 65 | +6. Codex decides which `doctor_check` calls to run and synthesizes conclusions. |
| 66 | +7. User can continue with follow-up questions, or `/doctor cancel`/`/doctor resume`. |
| 67 | + |
| 68 | +## Safety Boundaries |
| 69 | + |
| 70 | +- Session-local interruption only; other sessions/cron are unaffected by default. |
| 71 | +- Diagnostics are read-only (`health/status/logs/events`). |
| 72 | +- No automatic code mutation/restart in M0. |
| 73 | + |
| 74 | +## Why This Shape |
| 75 | + |
| 76 | +- Deterministic observability primitives stay in SnapAgent. |
| 77 | +- Dynamic diagnosis stays in Codex reasoning layer. |
| 78 | +- Keeps code volume low while preserving operational control and auditability. |
0 commit comments