Skip to content

feat(services): periodic health monitoring for subsystems#42

Merged
Mathews-Tom merged 2 commits into
mainfrom
feat/health-monitoring
Mar 25, 2026
Merged

feat(services): periodic health monitoring for subsystems#42
Mathews-Tom merged 2 commits into
mainfrom
feat/health-monitoring

Conversation

@Mathews-Tom

Copy link
Copy Markdown
Owner

Summary

Add periodic health monitoring for VaultMind's subsystems, inspired by SivaRamSV/paaw's 30-second heartbeat pattern. The HealthMonitor checks six subsystems (vault access, graph file, ChromaDB, SQLite, LLM, bot), computes an overall health status, and emits alerts on state transitions — enabling proactive detection of silent degradation.

How It Works

HealthMonitor.run_check()
  ├─ _check_vault_access()    → vault dir exists + readable + writable
  ├─ _check_graph_file()      → graph JSON exists + non-empty
  ├─ _check_chromadb()        → store.search() latency within threshold
  ├─ _check_sqlite()          → ~/.vaultmind/data/ dir exists
  ├─ _check_llm()             → LLM client configured
  └─ _check_bot()             → Telegram bot token present
  → Aggregate into HealthReport (HEALTHY / DEGRADED / CRITICAL)
  → Compare vs. previous report → emit alerts on transitions
  → dispatch_alerts() → async handlers (log, Telegram notification)

Overall Status Logic

Condition Status
2+ failures with severity ≥ 0.7 CRITICAL
3+ failures (any severity) DEGRADED
Otherwise HEALTHY

Alert Transitions

  • pass → fail: emits error (severity ≥ 0.7) or warning alert with suggested remediation
  • fail → pass: emits warning recovery alert
  • same status: no alert (avoids noise)
  • first report: no alerts (no baseline to compare against)

Suggested Remediation Actions

Check Suggestion
ChromaDB Check ChromaDB disk space and connectivity
SQLite Verify ~/.vaultmind/data/ directory permissions
Graph file Run vaultmind graph-maintain to rebuild
LLM Check API key and network connectivity
Bot Verify VAULTMIND_TELEGRAM__BOT_TOKEN is set
Vault access Check vault directory permissions

Changes

New Files

  • src/vaultmind/services/health.py (308 lines) — HealthStatus and CheckType StrEnums. HealthSignal, HealthReport, HealthAlert frozen dataclasses. HealthMonitor class with per-check enable flags, configurable latency threshold, transition detection, and async alert dispatch via registered handlers
  • tests/test_health.py (442 lines) — 20 tests across 4 classes

Modified Files

  • src/vaultmind/config.py — Added HealthConfig class with 10 fields: enabled, check_interval_seconds, per-check toggles (check_chromadb, check_sqlite, check_graph_file, check_llm, check_bot, check_vault_access), chromadb_latency_warn_ms, retention_days. Added health field to Settings
  • config/default.toml — Added [health] section with all config entries

Design Decisions

  • Sync checks, async dispatch: run_check() is synchronous (safe to call from any context), alerts are buffered in _pending_alerts and dispatched via dispatch_alerts() which is async (for Telegram notification delivery)
  • No real LLM calls in health checks: LLM reachability check only verifies the client object is configured — avoids burning tokens on health pings. ChromaDB check does a real search() call since it's local and free
  • Subsystem references typed as Any: avoids circular imports between services/ and indexer/ or llm/
  • Per-check enable flags: each check can be independently disabled via config, so environments without ChromaDB or Telegram can skip those checks
  • Severity-based classification: each signal carries a 0.0-1.0 severity score that drives the HEALTHY/DEGRADED/CRITICAL computation

Test plan

  • 20 new tests in test_health.py across 4 classes:
    • Individual checks (10): vault access pass/fail, graph file pass/fail/empty, ChromaDB skip/pass/exception, bot token pass/fail
    • Overall status (4): all-pass HEALTHY, 1-fail HEALTHY, 3-fail DEGRADED, 2-critical CRITICAL
    • Alert transitions (4): pass-to-fail alert, fail-to-pass recovery, no-transition silence, first-report baseline
    • HealthConfig (2): default validity, field presence
  • Full suite: 918/918 tests pass, 0 regressions
  • ruff check — clean
  • mypy --ignore-missing-imports — clean
  • Integration: wire into bot command's asyncio loop with configurable interval
  • Manual: verify alert dispatch to Telegram via Notifier when a check fails

New module services/health.py with HealthMonitor that checks vault
access, graph file, ChromaDB connectivity, SQLite data dir, LLM
client, and bot token. Each check returns a HealthSignal with status,
latency, and severity. Overall status computed as HEALTHY, DEGRADED
(3+ failures), or CRITICAL (2+ high-severity failures).

Alert transitions detected between consecutive reports — pass-to-fail
emits error/warning alerts with suggested remediation, fail-to-pass
emits recovery alerts. Async dispatch to registered handlers.

Add HealthConfig with per-check toggles, latency threshold, and
retention settings.
20 tests across 4 classes: individual checks (10) covering vault
access, graph file, ChromaDB, bot token with pass/fail scenarios;
overall status computation (4) for healthy/degraded/critical; alert
transitions (4) for pass-to-fail, recovery, no-transition, first
report; and HealthConfig validation (2).
@Mathews-Tom Mathews-Tom merged commit 17c698d into main Mar 25, 2026
3 checks passed
@Mathews-Tom Mathews-Tom deleted the feat/health-monitoring branch March 25, 2026 15:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant