diff --git a/README.md b/README.md index bf787ae..27d3a75 100644 --- a/README.md +++ b/README.md @@ -4,6 +4,14 @@ A lightweight personal AI assistant framework built on Python. ## Changelog +### v0.4 — Observability Surfaces & Doctor Mode + +- **Event Backbone (Track 0)**: standardized diagnostic events with `session_key` / `run_id` / `turn_id`. +- **Health Surface (Track 1)**: machine-readable health/readiness snapshots (`health`, `status --deep --json`). +- **Logging Surface (Track 2)**: structured diagnostic logs with session/run filters and follow mode. +- **Doctor Surface**: chat-driven diagnosis commands (`/doctor`, `/doctor status`, `/doctor cancel`, `/doctor resume`) with provider precheck guidance and evidence-first diagnostics through `doctor_check`. +- Architecture doc: [`docs/architecture/doctor-observability-architecture.md`](docs/architecture/doctor-observability-architecture.md). + ### v0.3 — ReAct Core, Prompt Security & Async Interrupt Handling - **ReAct tracing core**: orchestrator now records thought/action/observation style steps with stronger iteration-cap handling. @@ -379,6 +387,15 @@ User Message → [Channel] → [MessageBus] → [AgentLoop] → [ConversationOrc | **SubagentManager** | `agent/subagent.py` | Background sub-task execution | | **ProviderRegistry** | `providers/registry.py` | Provider metadata (17 specs) | +### Observability And Doctor + +See [`docs/architecture/doctor-observability-architecture.md`](docs/architecture/doctor-observability-architecture.md) for the full design of: + +- Event Backbone (Track 0) +- Health Surface (Track 1) +- Logging Surface (Track 2) +- Doctor Surface (Codex-driven diagnosis) + ## Project Structure ``` diff --git a/docs/architecture/doctor-observability-architecture.md b/docs/architecture/doctor-observability-architecture.md new file mode 100644 index 0000000..4fc7412 --- /dev/null +++ b/docs/architecture/doctor-observability-architecture.md @@ -0,0 +1,78 @@ +# SnapAgent Observability + Doctor Architecture + +## Goal + +Build a lightweight but operable observability architecture so production issues can be diagnosed directly from chat channels (`/doctor`) without coupling diagnosis logic to core runtime internals. + +## Design Principles + +1. Thin connector, strong evidence. +2. Session-scoped control, no global pause. +3. Read-only diagnostics first, no risky auto-fix by default. +4. Separate deterministic data collection from model reasoning. + +## Four Surfaces + +### 1) Event Backbone (Track 0) + +- Unified event model: `DiagnosticEvent`. +- Correlation fields: `session_key`, `run_id`, `turn_id`. +- Message bus emits structured inbound/outbound/runtime events. + +Role: +- Foundation for cross-surface correlation and postmortem timeline reconstruction. + +### 2) Health Surface (Track 1) + +- CLI: `snapagent health --json`, `snapagent status --deep --json`. +- Aggregates provider/config/workspace/channel/runtime queue evidence. + +Role: +- Fast readiness/liveness check and root-cause narrowing. + +### 3) Logging Surface (Track 2) + +- Structured JSONL sink (`diagnostic.jsonl`) with rotation/follow. +- CLI: `snapagent logs --json --session ... --run ... --follow`. + +Role: +- Session/run-scoped evidence retrieval for operational debugging. + +### 4) Doctor Surface (Codex-Driven) + +- Chat commands: + - `/doctor` + - `/doctor status` + - `/doctor cancel` + - `/doctor resume` +- `/doctor` first pauses current session tasks (reuse stop/cancel path). +- Provider precheck before diagnostics: + - if provider not ready, return setup guidance and block doctor mode. + - guidance includes OAuth/API-key paths and validation command. +- Diagnostic execution is model-driven via read-only tool: + - `doctor_check(check=health|status|logs|events, session_key?, run_id?, lines?)` + +Role: +- Turn observability data into interactive diagnosis in user channels (Feishu/Telegram/CLI). + +## End-to-End Flow + +1. User sends `/doctor` in a chat session. +2. Agent cancels active tasks for this session only. +3. Agent runs provider precheck. +4. If precheck fails: return setup guidance and stop. +5. If precheck passes: enter doctor mode and start diagnostic turn. +6. Codex decides which `doctor_check` calls to run and synthesizes conclusions. +7. User can continue with follow-up questions, or `/doctor cancel`/`/doctor resume`. + +## Safety Boundaries + +- Session-local interruption only; other sessions/cron are unaffected by default. +- Diagnostics are read-only (`health/status/logs/events`). +- No automatic code mutation/restart in M0. + +## Why This Shape + +- Deterministic observability primitives stay in SnapAgent. +- Dynamic diagnosis stays in Codex reasoning layer. +- Keeps code volume low while preserving operational control and auditability. diff --git a/pyproject.toml b/pyproject.toml index 86fae04..4a63fbd 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -1,6 +1,6 @@ [project] name = "snapagent-ai" -version = "0.1.4.post2" +version = "0.1.4.post3" description = "A lightweight personal AI assistant framework" requires-python = ">=3.11" license = {text = "MIT"} diff --git a/snapagent/__init__.py b/snapagent/__init__.py index d2446dc..96306dc 100644 --- a/snapagent/__init__.py +++ b/snapagent/__init__.py @@ -1,5 +1,5 @@ """SnapAgent - A lightweight AI agent framework.""" -__version__ = "0.1.4.post2" +__version__ = "0.1.4.post3" __logo__ = "🐈" __app_name__ = "SnapAgent"