feat(services): periodic health monitoring for subsystems#42
Merged
Conversation
New module services/health.py with HealthMonitor that checks vault access, graph file, ChromaDB connectivity, SQLite data dir, LLM client, and bot token. Each check returns a HealthSignal with status, latency, and severity. Overall status computed as HEALTHY, DEGRADED (3+ failures), or CRITICAL (2+ high-severity failures). Alert transitions detected between consecutive reports — pass-to-fail emits error/warning alerts with suggested remediation, fail-to-pass emits recovery alerts. Async dispatch to registered handlers. Add HealthConfig with per-check toggles, latency threshold, and retention settings.
20 tests across 4 classes: individual checks (10) covering vault access, graph file, ChromaDB, bot token with pass/fail scenarios; overall status computation (4) for healthy/degraded/critical; alert transitions (4) for pass-to-fail, recovery, no-transition, first report; and HealthConfig validation (2).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add periodic health monitoring for VaultMind's subsystems, inspired by SivaRamSV/paaw's 30-second heartbeat pattern. The
HealthMonitorchecks six subsystems (vault access, graph file, ChromaDB, SQLite, LLM, bot), computes an overall health status, and emits alerts on state transitions — enabling proactive detection of silent degradation.How It Works
Overall Status Logic
Alert Transitions
error(severity ≥ 0.7) orwarningalert with suggested remediationwarningrecovery alertSuggested Remediation Actions
vaultmind graph-maintainto rebuildChanges
New Files
src/vaultmind/services/health.py(308 lines) —HealthStatusandCheckTypeStrEnums.HealthSignal,HealthReport,HealthAlertfrozen dataclasses.HealthMonitorclass with per-check enable flags, configurable latency threshold, transition detection, and async alert dispatch via registered handlerstests/test_health.py(442 lines) — 20 tests across 4 classesModified Files
src/vaultmind/config.py— AddedHealthConfigclass with 10 fields:enabled,check_interval_seconds, per-check toggles (check_chromadb,check_sqlite,check_graph_file,check_llm,check_bot,check_vault_access),chromadb_latency_warn_ms,retention_days. Addedhealthfield toSettingsconfig/default.toml— Added[health]section with all config entriesDesign Decisions
run_check()is synchronous (safe to call from any context), alerts are buffered in_pending_alertsand dispatched viadispatch_alerts()which is async (for Telegram notification delivery)search()call since it's local and freeAny: avoids circular imports between services/ and indexer/ or llm/Test plan
test_health.pyacross 4 classes:ruff check— cleanmypy --ignore-missing-imports— clean