All notable changes to this project are documented in this file.
Dashboard UX improvements, source visibility, manual test suite, and setup documentation hardening.
- COPY buttons: per-panel clipboard buttons on AGENT REASONING and TOOL CALLS panels — copy full panel content as plain text
- Session hold + DISMISS: when a session ends, panels remain visible (status → "Completed — Monitoring"). A ✕ DISMISS button returns to the idle overlay on demand. A new session auto-dismisses any held content.
- MCP tool badge styling: MCP tools (e.g.
get_ospf,traceroute) rendered with a distinct.tool-name.mcpbadge to visually distinguish them from built-in Claude tools (Read, Edit, Bash) - MCP params unwrapping: tool input display strips the outer
{params: ...}wrapper that MCP tools use internally, showing clean parameter key/value pairs - Font size: bumped +1px across all UI elements for improved readability
- New SOURCE meta-group in the dashboard header: shows
<inventory> · <credentials>(e.g.NetBox · VaultorNETWORK.json · .env) when a session is active - Same info added to the Discord "Investigation Started" embed (
📦 Inventory: X · 🔑 Credentials: Yline) core/inventory.py: exposesinventory_sourcemodule-level string ("NetBox"or"NETWORK.json")core/vault.py: newcredential_source()— probes Vault for theainoc/routersecret if not yet cached, returns"Vault"or".env". Self-sufficient in any process regardless of import order.oncall/watcher.py: passes both sources to_write_dashboard_state()andpost_investigation_started()at session start; logs both at watcher startup
- 55 OSPF/BGP troubleshooting scenarios added to
testing/manual_testing.md:- 27 OSPF tests: covers all 7 adjacency criteria (hello/dead timers, area ID/type, network type, authentication, passive interface, MTU, Router ID uniqueness)
- 28 BGP tests: covers all 6 session formation criteria + 11 path selection attributes (Weight through Neighbor IP tie-break)
- Operator applies break configs via SSH; agent diagnoses and proposes fix
metadata/vault/vault_setup.md: added Production: Initialization section — switching vault.hcl to HTTP listener,vault operator init,vault operator unseal, KV engine enable, and unseal-on-reboot caveatmetadata/netbox/netbox_setup.md: added Production: Boot Persistence section — restart policies for Docker Compose containers and optional systemd unit
- UT-026 schema guard updated for new
inventory_sourceandcredential_sourcefields in active session state - 610 unit tests passing
Real-time agent observability dashboard, session stop mechanism, SSH timeout optimizations.
- New
dashboard/ws_bridge.py— always-on WebSocket bridge: tail-follows the watcher's NDJSON session file (stream-json format), parses events, and broadcasts to browser clients. Serves HTTP + WebSocket on a single port (DASHBOARD_PORTenv var, default 5555) using websockets 16.0'sprocess_requestcallback — no separate HTTP server needed - New
dashboard/index.html— single-file browser UI (dark NOC theme): two panels — AGENT REASONING (markdown-rendered, streamed incrementally) and TOOL CALLS (collapsible timeline with tool name, inputs, and output). Auto-scroll with lock toggle, session header with live elapsed timer and cost display, idle overlay between sessions - New
dashboard/oncall-dashboard.service— systemd unit (independent ofoncall-watcher.service; communication is filesystem-only viadata/dashboard_state.json+ session NDJSON files) - Late-joining browser clients receive a full replay buffer (up to 200 events) on connect — no missed events
- WebSocket event types surfaced:
init,session_start,session_idle,reasoning,tool_start,tool_input_complete,tool_result,session_end - Tool input JSON streamed incrementally (
input_json_deltachunks) and accumulated per content-block index before display - Remote access: SSH tunnel (
ssh -L 5555:localhost:5555) or direct LAN
- Sentinel file pattern (
data/stop_session): any actor creates it → watcher detects within 2s → kills agent tmux session and posts Discord error notification - Dashboard "■ STOP" button (red, visible only during active sessions): sends
{"action": "stop"}via WebSocket → bridge writes sentinel → watcher acts within 2s - CLI stop:
touch /home/mcp/aiNOC/data/stop_sessionduring an active session — same effect - Stale sentinel cleared automatically at session start so it never blocks the next session
- Switched
--output-format json→stream-json --verbose --include-partial-messageswithstdbuf -oL(forces line-buffered stdout to file; prevents block-buffering delay that would stall the dashboard) --verboseis required — without it, nostream_eventobjects are emitted- New
_write_dashboard_state()helper — writesdata/dashboard_state.jsonat session start (withsession_name,session_file,state: "active") and session end (state: "idle") - Cost parsing updated: reverse-scans NDJSON lines for
{"type": "result", "total_cost_usd": ...}(was a single JSON object) DASHBOARD_RETAIN_LOGS=1env var — when set, session NDJSON files are kept after session end for post-mortem review (default: deleted)
SSH_TIMEOUT_TRANSPORT: 30 → 15s — SSH handshake; devices respond in <5s or are unreachableSSH_TIMEOUT_OPS_LONG: 60 → 45s — traceroute; IOS XE completes in ~30sSSH_RETRIES: 2 → 1 — 2 total attempts instead of 3- Worst-case per MCP call: 3 × 30s + 2 × 2s = 94s → 2 × 15s + 1 × 2s = 32s
- Per-command
timeout_opsmoved from Scrapli connection constructor (_connection_params) tosend_command(command, timeout_ops=timeout_ops)— connection setup always usesSSH_TIMEOUT_OPS(30s); only the traceroute command execution uses the longer 45s timeout where it is actually needed
- UT-026 (
testing/agent-testing/unit/test_ws_bridge.py): 39 tests —_strip_tool_prefix, NDJSON parsing (malformed/empty input, result line, text_delta, full tool_use lifecycle via content_block_start/delta/stop, tool_result, incomplete partial JSON), event buffer ring behavior, session state broadcast - Watcher stop-sentinel detection and dashboard state file tests added to existing watcher test files
- 472 → 577 total tests passing
- New:
websockets>=16.0,<17.0
- New
core/vault.py— thin Vault KV v2 client withget_secret(path, key, fallback_env): reads secrets from Vault, caches per-path, falls back toos.getenv()when Vault is not configured or unreachable - Vault paths:
ainoc/router(username, password),ainoc/jira(api_token),ainoc/discord(bot_token) - Consumers updated to use
get_secret():core/settings.py— router credentialscore/jira_client.py— Jira API tokencore/discord_approval.py— Discord bot token
- New env vars:
VAULT_ADDR,VAULT_TOKEN(both optional — Vault is fully optional) - New dependency:
hvac>=2.3,<3.0 - Setup guide:
metadata/vault/vault_setup.md
- New
core/netbox.py— NetBox device inventory loader via pynetbox: maps NetBox devices to the same{host, platform, transport, cli_style, location}schema asNETWORK.json core/inventory.pyrewritten — tries NetBox first, falls back toNETWORK.jsonwhen NetBox is not configured, unreachable, or returns no valid devices- NetBox custom fields on Device model:
transport(asyncssh/restconf),cli_style(ios) - New
metadata/netbox/populate_netbox.py— idempotent pynetbox script that creates all prerequisite objects (custom fields, manufacturer, device types, platform, roles, sites) and 9 devices with management interfaces and IPs - New dependency:
pynetbox>=7.4,<8.0 - Setup guide:
metadata/netbox/netbox_setup.md
core/vault.py: INFO log on first Vault read per path; DEBUG log when Vault not configuredcore/netbox.py: INFO log with device count on successful loadcore/inventory.py: INFO log showing which source loaded the inventory (NetBox vs NETWORK.json)
- UT-019 (
test_vault.py): 9 tests — env var fallback, Vault reads with mock hvac, caching, error fallback - UT-020 (
test_netbox.py): 9 tests — None on missing config, pynetbox exceptions, schema mapping, CIDR stripping, field validation - 454 → 472 total tests passing (includes test_watcher_discord_notifications.py now registered as UT-021)
- Transient false positive: the agent incorrectly concluded "transient — recovered without intervention" when the SLA path recovered via an alternate ISP (IBN) rather than the expected path (IAN). IAN Eth0/3 was still admin-down. Root cause: LLM non-determinism in applying Principle 2 (off-path detection). Fixed by two complementary changes:
- Prompt enrichment:
invoke_claude()now looks up the SLA path insla_paths/paths.jsonby source device and injectsscope_devices+ expected path description directly into the prompt. The agent has the scope list immediately, without needing to recall it from an earlier file read. - Oncall skill: Step 1 now includes an explicit mandatory scope check (between traceroute call and outcome bullets) that forces hop-by-hop comparison against
scope_devices. Step 1a Branch A condition updated to explicitly require all hops within scope.
- Prompt enrichment:
--output-format json: Claude is now invoked with--output-format json, with stdout redirected tologs/session-oncall-<timestamp>.md. This replaces all failed terminal-capture approaches (pipe-pane, capture-pane, capture-pane+alternate-screen-off). The JSON envelope containstotal_cost_usd,num_turns,usage, and the fullresulttext — reliable, no escape code issues.- Removed dead code:
_ANSI_REregex and_clean_session_log()function removed (ANSI stripping was only needed for pipe-pane output).set-option -g history-limit 50000andset-option alternate-screen offremoved from tmux setup. - tmux attach link removed: the
📺 Session details: tmux attach -t <session>line is removed from the Discord investigation-started embed. It was useless — shows empty terminal during investigation (alternate screen), "no sessions" after (killed in finally).
- Session duration and exit: "Agent session ended." now includes duration and exit classification:
"Agent session ended. Duration: 2m46s, exit: normal"(orcrash (code N)/timeout (force-killed)). - Session cost: after session end,
total_cost_usdandnum_turnsare parsed from the JSON output file and logged:"Session cost: $0.1141 | turns: 5". - Approval audit: after
_post_discord_session_notification, watcher readsdata/pending_approval.jsonand logs approval status, decided_by, risk level, and devices. If no approval was requested, logs"No approval requested this session (transient/recovered)". - Session cost in Discord embeds:
post_session_completeandpost_session_errornow acceptsession_costand display a 💰 Cost inline field when available.
- Duplicate Discord notifications: fixed a
try/except/elsesemantics bug ininvoke_claude()that caused both a red "Agent Session Error" embed and a green "Session Complete — transient" embed to be posted when the agent crashed. Python'stry/except/elsefires theelsewhenever thetrybody raises no exception — not only when noif/elifbranch matched. Fixed by movingpost_session_completeinto theelseof theif/elifchain inside thetry. The notification block is now extracted into_post_discord_session_notification()for testability. - Crash cooldown UnboundLocalError:
main()was missingglobal _last_crash_tsdeclaration. The_last_crash_ts = Noneassignment (cooldown expiry clear) caused Python to treat the variable as local throughout the function, crashing withUnboundLocalErroron every SLA Down event. Recovery events were unaffected (theycontinuebefore the cooldown check). Result: watcher appeared to run but silently crashed on every Down event.
- Crash cooldown: after an agent crash (non-zero exit code), new sessions are suppressed for
CRASH_COOLDOWN_MINUTES(default 5) to prevent wasting API calls when the failure is systemic (e.g. API credit limits, authentication errors). The cooldown timestamp is cleared automatically once the window expires. The cooldown state is module-level and independent of Discord configuration. - Agent timeout:
_wait_for_tmux_process_exit()now enforces a deadline (default 30 min). If Claude doesn't exit within the timeout, the tmux session is force-killed viatmux kill-session, the watcher logs a warning, and the lock file is released so new sessions can proceed. Configurable viaAGENT_TIMEOUT_MINUTESenv var. - tmux session cleanup: after the agent exits, the tmux session is explicitly killed (
tmux kill-session). Sessions no longer accumulate indefinitely.
- Investigation-started notification: when the watcher spawns an agent session, it immediately posts a blue informational embed to Discord ("🚨 NEW ISSUE: DEVICE {name} — Investigation Started") so the operator is notified before the investigation even begins.
- Progress updates: after 60 seconds of active investigation the watcher posts 🔍 "Still investigating network state..." and after 120 seconds 🔍 "Investigation ongoing, please wait..." to Discord. Each message only fires if the agent is still running at that mark — crashes before the threshold produce no progress message.
- Acknowledgment messages: after the operator reacts with ✅ or ❌, a confirmation reply is posted ("Approval received from @user — aiNOC is proceeding with the fix." / "Rejection received from @user — aiNOC will not apply the fix.").
- Jira ticket in outcome embeds:
post_approval_outcomenow reads the issue key from the approval state file and includes a "Ticket SUP-xx updated" field in the Discord outcome embed. - Removed duplicate expiry message:
request_approvalno longer auto-posts an expiry outcome. All outcome posts (approved, rejected, expired) are handled by the agent viapost_approval_outcome, which includes the Jira ticket reference. Previously, expiry caused two identical-looking Discord messages.
- When Discord is not configured,
request_approvalwritesstatus: "SKIPPED"(previously was writingAPPROVED). Thepush_configgate rejects SKIPPED status — no Discord = no push, enforced at code level. - Integration tests updated:
_approve_devices()helper writes a valid APPROVED record before eachpush_configcall so the gate passes in test context. - New env var:
AGENT_TIMEOUT_MINUTES=30
- UT-017 (
test_approval.py): SKIPPED status assertion added;post()method added to MockSessions for ack message tests;post_approval_outcometests updated for Jiraissue_keyparam. - UT-018 (
test_config_approval_gate.py): SKIPPED added to bad-status parametrize list. - UT-021 (
test_watcher_discord_notifications.py): 10 tests covering Discord notification exclusivity (crash/timeout/watcher-exc → error only; normal exit → complete only; approval-requested → neither) and crash cooldown behaviour (timestamp set on crash, skips within window, clears after expiry, not set on normal exit). - Integration tests (
test_mcp_tools.py): all 8 push_config tests now call_approve_devices()before each push. - 443 → 452 total tests passing.
- New
core/discord_approval.pymodule — Discord bot API integration:post_approval_request(),poll_for_reaction(),post_outcome() - New
tools/approval.py— two MCP tools registered:request_approval: posts a rich embed to a configured Discord channel with findings, commands, devices, and risk level. Adds ✅/❌ reactions and polls for operator response. Returns"approved"/"rejected"/"expired"/"skipped"decision.post_approval_outcome: posts the final outcome (approved+verified, rejected, expired) as a Discord reply after fix + verification
- Discord-primary: when Discord is configured, the operator approves via Discord embed. When Discord is not configured, the agent logs to Jira that no approval channel is available and exits without pushing config.
- No Discord = no push: if Discord not configured (
DISCORD_BOT_TOKEN/DISCORD_CHANNEL_IDabsent),request_approvalreturns"skipped"— the agent must proceed to Session Closure without applying any fix - Audit trail: every approval request and outcome written to
data/pending_approval.json(runtime state, gitignored) - New env vars:
DISCORD_BOT_TOKEN,DISCORD_CHANNEL_ID,APPROVAL_TIMEOUT_MINUTES(default 10) - Setup guide:
metadata/discord/discord_setup.md - 13 → 15 MCP tools registered
push_confignow verifiesdata/pending_approval.jsonbefore executing any commands — architectural enforcement independent of prompt instructions- Requirements: record must exist with
status: "APPROVED"and device list must exactly match the push targets (sorted comparison). Pushing to unapproved devices is blocked even if an approval record exists for different devices. - Blocks with a descriptive error: no record, wrong status (
REJECTED,EXPIRED,PENDING,SKIPPED), device mismatch, orEXECUTEDreplay - After a successful push, record is marked
EXECUTED— a second push on the same approval is blocked; a newrequest_approvalcall is required - When Discord is not configured,
request_approvalwrites aSKIPPEDrecord —push_configis blocked at the code gate, enforcing the same policy as the prompt instructions - Previously: approval was prompt-level only (CLAUDE.md Pitfall #16). Now: two independent enforcement layers — code gate (architectural) + prompt instructions (behavioral)
- Interactive mode removed — the watcher always runs Claude in tmux + print mode (
-p). Claude processes its prompt, uses MCP tools, and exits automatically when done. No/exitneeded, no operator at the CLI required. - Single code path:
--serviceflag removed fromwatcher.pyand systemdExecStart. tmux is now a hard requirement checked at startup. - Session output logging: each session's full output is streamed via
tmux pipe-panetologs/session-oncall-<timestamp>.mdfor post-incident review. - Watcher resumes monitoring immediately after Claude exits —
_wait_for_tmux_process_exit()pollspane_deadsoremain-on-exit onsessions don't block the watcher. - tmux session cleanup: after Claude exits and the session log is cleaned, the watcher kills the tmux session (
tmux kill-session). Full session output is preserved inlogs/session-oncall-<timestamp>.md. (Note:remain-on-exit onis still set to keep the pane alive until the watcher's cleanup runs.)
invoke_deferred_review()deleted — no second agent session is spawned for deferred failures.- Deferred documentation: after the primary session ends,
watcher.pyscans for concurrent failures, adds a Jira comment to the original ticket, and posts an informational Discord embed. No agent cost, no autonomous investigation. - New
_document_deferred_events()helper inwatcher.py. Newpost_deferred_list()incore/discord_approval.py. - Removed:
PENDING_EVENTS_FILE,DEFERRED_FILE,save_pending_events(), stale file cleanup at startup. .gitignore: removedoncall/pending_events.json+oncall/deferred.json; addedlogs/session-*.md.
- Added Step 4: Approval, Remediation & Session Closure to
skills/oncall/SKILL.md— the skill is now a complete end-to-end workflow. Previously it ended at "Presenting Findings" with no bridge to the approval/remediation lifecycle (CLAUDE.md steps 4–6). An agent following the skill alone could skip Discord approval entirely. - Updated CLAUDE.md: "user is supervising the workflow via the Claude Code console" → "operator supervises via the Claude Code console and/or Discord remote approval" (accurate for remote approval scenarios)
- Added CLAUDE.md Pitfall #16 (never call
push_configwithout approval) and Pitfall #17 (always callpost_approval_outcomeafter resolution)
metadata/about/guardrails.md— expanded Agent Autonomy Approval section: documents code-level gate, Discord-primary approval model, exact device match requirement, and replay prevention. Replaces the single-line "no-auto-push rule" with a full three-layer description.
- UT-018 (
test_config_approval_gate.py): 10 unit tests covering all gate scenarios — no record, bad status (4 variants), replay, device mismatch, superset mismatch, successful push, EXECUTED marking - UT-014 (
test_config_push.py): updated to bypass approval gate via_NO_APPROVAL_ERRORmock (gate tested separately in UT-018) - 408 → 430 unit tests
- Maintenance window feature removed entirely — aiNOC runs fully in on-call context (interactive or service mode), so time-gated change control was functionally inert
- Deleted
policy/MAINTENANCE.json - Deleted
tools/state.check_maintenance_window()andpytzdependency - Removed
on_callparameter fromConfigCommandmodel andpush_config() - Unregistered
check_maintenance_windowMCP tool (13 tools now) - Deleted
testing/agent-testing/unit/test_maintenance_window.py(UT-007) - 415 → 401 unit tests
- Deleted
- Deleted
metadata/transports/transports.txt(orphaned reference note) - Deleted empty
vendors/directory placeholder - Deleted
transport/pool.py(no-op stub —async def close_sessions(): pass); simplifiedMCPServer.pylifespan to none - Removed NETCONF legacy acceptance branch from
ShowCommand.must_be_read_only({"filter":...}/{"get":...}) — NETCONF was removed in v5.0; these forms are now rejected like any other unknown JSON key - Removed dead constants
_OSPF_OPER_KEY/_BGP_OPER_KEYfromtools/protocol.py— leftover YANG path strings from before platform_map.py owned URL building - Deleted
testing/agent-testing/cookie.txt(libcurl artifact) and stale.pycbytecache files - Added 3 missing MCP tool allow rules to
.claude/settings.local.json/.claude/settings.local.example.json:get_routing_policies,run_show,get_intent - Updated
skills/redistribution/SKILL.mddevice names to current topology (D1C/R3C/R8C/B1C/B2C → E1C/C1C/C2C/A1C/A2C) - Fixed stale label in
test_mcp_tools.py:test_push_config_ios_netconf→test_push_config_ios_restconf - Rewrote
metadata/about/scalability.md— comprehensive contributor guide for adding protocols/vendors, synchronized with current implementation - Added scalability guide link to
README.md; fixed pitfall count infile_roles.md(15→14); added IT-005 to test tables
- BGP skill (
skills/bgp/SKILL.md): Added OpenConfirm state (RFC 4271 §8); fixed Active/Connect state descriptions; reordered Session Checklist (AS numbers before timers — more fundamental); added "Session Established but Zero Prefixes" section (address-family activation); added "Session Flapping / Reset Reasons" table; updated iBGP/RR section scope note (current topology is eBGP-only); added community handling omission note; fixed RR cluster-id explanation (RFC 4456 §8 accuracy); documentedclear ip bgpFORBIDDEN limitation in Verification Checklist - OSPF skill (
skills/ospf/SKILL.md): Added LOADING state to neighbor table; added P2MP timers (Hello 30s/Dead 120s); added NSSA Totally Stubby area type; added ABR route summarization (area range) section; added distribute-list filtering section (LSA-present/route-absent symptom); enhanced INIT state description (asymmetric link cause); added RFC 3101 §2.3 inline reference - Routing skill (
skills/routing/SKILL.md): Removed misleading "ios only" annotations; added distribute-list cross-reference to OSPF skill; added BGPmaximum-pathsdefault note (defaults to 1 — no ECMP without explicit config); cleaned up Query Reference table (removed redundant Platform support column) - On-Call skill (
skills/oncall/SKILL.md): Added Terminology section defining primary vs deferred review session (anchorspending_events.jsonconcept); clarified Step 2 ECMP precondition; addedlessons.mdread reminder before Step 0 - CLAUDE.md: Added Pitfall #15 (
clearcommands FORBIDDEN); added redistribution showcase entry to Skills Library table metadata/about/file_roles.md: Removed stalepool.pyreference; updated pitfall count (14→15)
Cisco-only architecture with 2-tier transport (RESTCONF→SSH). 9 devices, all Cisco IOS/IOS-XE. Other vendors available as customizable modules per client need.
- 9-device Cisco IOS/IOS-XE topology (2 platforms: cisco_iol, cisco_c8000v)
- OSPF Area 0 + Area 1 stub, BGP dual-ISP (AS1010↔AS4040/AS5050), BGP AS2020 at X1C
- 5 SLA paths: OSPF cost-based primary/backup ABR selection (A1C/A2C via C1C/C2C)
- Full redundancy across the Access, Collapsed Core, Edge-to-ISP layers
- 2-tier for c8000v: RESTCONF (primary, httpx/JSON) → SSH (fallback, Scrapli/CLI)
- SSH-only for IOL: A1C, A2C, IAN, IBN
ActionChainclass inplatform_map.pyfor ordered transport fallback- Config push: all devices use SSH CLI
- Clear separation between Interactive and Service modes
PLATFORM_MAPwith 2 distinct sections:ios,ios_restconftransport/restconf.py— httpx AsyncClient for RESTCONF reads;- RESTCONF now has dedicated BGP/OSPF trim functions to reduce token cost
- Transport dispatcher with ActionChain fallback iteration +
_transport_usedtag
- Deferred-event scanner deduplicates by
(device, msg)— SLA oscillation (Down→Up→Down) no longer triggers false deferred sessions - Jira client: module-level globals replaced with
_config()helper that reads env vars at call time (fixes stale-config under systemd) oncall-watcher.service: EnvironmentFile commented out — systemd doesn't strip inline comments from.env, corrupting values that python-dotenv handles correctly
- Result dict includes
_commandfield (actual CLI command or RESTCONF URL) right afterdevicefor inline visibility in Claude Code - Debug logging added to SSH and RESTCONF executors
- 416 unit + watcher-events tests (up from 244)
- 16 unit test files, covering: transport dispatch, RESTCONF/SSH executors, config push, tool layer, Jira tools
- New integration tests: full MCP tool coverage, transport layer, platform coverage
- New test: deferred excludes trigger device's repeated SLA oscillation events
On-Call-first architecture. Standalone mode retired as an official mode. Tool set simplified.
- Retired Standalone Mode as an official workflow — On-Call is now the primary mode; ad-hoc console troubleshooting remains supported via the 6 Core Principles
- Removed
snapshot_statetool and all snapshot infrastructure (feature was write-only — no programmatic reader existed) - Added
on_call: boolparameter topush_config— bypasses the maintenance window whenTrue(On-Call fixes apply at any hour) - Risk assessment (
assess_risk) now surfaced before user approval: agent calls it in On-Call step 4 and includes risk level in the findings table
- 281 tests: removed 6 snapshot input validation tests, added 3 on_call model/bypass tests
- Manual E2E: retired ST-00x Standalone test suite; OC-001, MW-001, and WB-001–004 remain
- 15 MCP tools registered (snapshot_state removed)
Major quality, reliability, and security release. No new protocols or vendors — hardened foundation for v4.5.
- Enforced maintenance windows in
push_config(blocked outside policy) - Restricted
run_showto read-only commands (no config bypass) - RouterOS REST validation (forbidden paths blocked, POST rejected)
- Syslog prompt injection mitigation (sanitize + delimiter)
- Expanded forbidden command set (5 → 14 patterns)
- Configurable TLS/SSL per transport:
VERIFY_TLSROUTEROS_USE_HTTPSSSH_STRICT_HOST_KEY
- Decomposed monolithic
MCPServer.py(798 lines) into:tools/transport/core/input_models/
- Implemented bounded LRU cache (256 entries, TTL-based eviction)
- Added connection pooling for eAPI and REST transports
- Enforced HTTP timeouts on all device and Jira connections
- Added structured JSON logging with configurable levels
- Introduced 6 Core Troubleshooting Principles (mandatory, ordered) — see
CLAUDE.md - Rewrote Standalone Mode into 10 deterministic steps with decision gates
- Added protocol skill prerequisite gates (interfaces + neighbors verified before deep investigation)
- Implemented role-aware risk assessment using
INTENT.jsonand SLA paths
- SLA recovery (Up) event detection and logging
- Added service mode (
--serviceflag, renamed from-d/--daemon) with tmux session support andwallnotification - Added systemd service file (
oncall/oncall-watcher.service) for production deployment Added pre-change snapshot support in(removed in v4.5 — feature was write-only)push_config- Generated rollback advisory for all config changes
- 230 unit tests across 9 test files (up from 3 in v3.0) (229 in v4.5 after snapshot tests removed)
- 4 integration test files with
NO_LABskip guards - 12 manual E2E scenarios:
- 7 standalone
- 1 on-call
- 1 maintenance window
- 3 watcher
- Enforced Pydantic
Literalvalidation on all query parameters
Focus: Multi-mode operations, improved diagnosis flow, optimized AI performance, reduced hallucinations and costs.
- Added
mcp_tool_map.jsonfor improved MCP tool selection - Updated
INTENT.jsonfor cleaner network context - Added
CLAUDE.mdwith defined workflows and guidance - Added troubleshooting skills for improved coherence
- Added
cases.mdandlessons.md(seecases/) - aiNOC now documents cases and curates reusable lessons
- Well-defined test suites
- Regression test checklist
- Added MikroTik API reference
- Minor bug fixes
Focus: Topology expansion, MCP toolset improvements, optimized AI performance, reduced hallucinations and costs, beyond SSH connectivity.
- Structured outputs:
- Cisco: Genie
- Arista: eAPI
- MikroTik: REST API
- Strict command determinism:
platform_map- Query enums in input models
- Platform-aware commands
- Tool caching to prevent duplicate commands and troubleshooting loops
- Protocol-specific MCP tools
- Targeted config sections (avoiding full
show rundumps) - Updated
INTENT.jsonandNETWORK.json - Legacy
run_showtool now fallback-only - Improved tool docstrings
- Routers: 20
- MCP tools: 14
- New vendor: MikroTik
- New protocol: BGP
- Cisco: Genie parsing
- Arista: eAPI (replacing SSH)
- MikroTik: REST API queries
- Platform command map
- Improved topology diagram
- Cleaner, modular codebase
- Enhanced
INTENT.json - Minor bug fixes
- Routers: 11
- Protocols: 2 (OSPF, EIGRP)
- Vendors: 2 (Arista, Cisco)
- MCP server tools: 6