Skip to content

fix(events): follow-up to #310 — AI agent event filtering, Path B re-enable, and correlation backfill #311

@alexandervazquez98

Description

@alexandervazquez98

Pre-flight Checks

  • I have searched existing issues and this is not a duplicate
  • I understand this issue needs status:approved before a PR can be opened

Problem Description

Issue #310 closes the write-side RCA gap (Path A legacy engine + Path C leased worker + CLI poll alerts) and gates the two most critical downstream consumers (escalation_notifier.py and routers/events.py for ITSM ticket creation) so they no longer fire / surface PROPAGATED events as independent incidents. Three additional gaps remain that are explicitly out of scope for #310 and are tracked here.

Items pending (will become follow-up PRs)

1. AI agent event filtering

  • Where: backend/services/ai_chat_service.py and any related AI harness that consumes events.
  • Why pending: fix(events): production collector hardcodes correlation_type=ROOT, breaking topology-based RCA #310 deliberately does not touch the AI agent because it is the most complex consumer (RAG, conditional tool calls, memory) and changing what events it "sees" deserves its own analysis and test set. The agent currently surfaces ROOT and PROPAGATED events as if they were independent failures.
  • Proposed approach: gate ai_chat_service.py on a shared _is_authoritative_event() helper (analogous to _is_authoritative_availability_event at backend/services/event_service.py:466), expose correlation_type and root_cause_ci_id in the AI tool input, and decide whether to filter PROPAGATED by default or surface it as secondary context.
  • Acceptance criteria:
    • AI agent queries that return events no longer include PROPAGATED events unless explicitly requested.
    • When a chain has multiple events, the AI response identifies the root cause by root_cause_ci_id.
    • Tests cover: chain with mixed ROOT + PROPAGATED; agent asked about a specific ci_id; agent asked about a chain.

2. Path B re-enable decision (architectural)

  • Where: backend/services/snmp_service.py:snmp_collector_loop + backend/main.py:298-302 + docker-compose.yml:81.
  • Why pending: Path B (in-process snmp_collector_loop) is the only event-write path that already performs RCA correctly (snmp_service.py:548-558). It is currently disabled in production via DISABLE_BACKEND_COLLECTOR=true. Once fix(events): production collector hardcodes correlation_type=ROOT, breaking topology-based RCA #310 lands and Path A / Path C / CLI poll alerts all perform RCA inline, Path B becomes architecturally redundant. The decision to keep it dormant, deprecate it, or bring it back as the primary path is out of scope for fix(events): production collector hardcodes correlation_type=ROOT, breaking topology-based RCA #310.
  • Proposed approach: open a dedicated architectural review. Compare (a) keeping all three paths (legacy engine, in-process, leased polling) with Path B dormant; (b) deprecating Path B entirely; (c) re-enabling Path B and removing the other two.
  • Acceptance criteria:
    • Decision recorded in an ADR or design doc.
    • If Path B is deprecated: code removed, references cleaned, tests adjusted.
    • If Path B is re-enabled: docker-compose env contract updated, monitor confirms production traffic moves through Path B.

3. Correlation backfill migration

  • Where: backend/scripts/ (new migration script) + Neo4j Event nodes.
  • Why pending: Existing events in Neo4j written before fix(events): production collector hardcodes correlation_type=ROOT, breaking topology-based RCA #310 lands have correlation_type = 'ROOT' (or no correlation_type at all for legacy CLI_POLL_ALERT events) even when they belong to a dependency chain. After fix(events): production collector hardcodes correlation_type=ROOT, breaking topology-based RCA #310, new events will be correctly tagged, but historical events will still look like independent failures.
  • Proposed approach: write an idempotent migration that:
    1. For each OPEN Event node, runs the same topology traversal find_open_parent_event does.
    2. If a parent OPEN event is found, updates correlation_type to PROPAGATED, sets propagated_from and root_cause_ci_id.
    3. Adds correlation_type = 'ROOT' to legacy CLI_POLL_ALERT events that have no correlation_type set.
  • Acceptance criteria:
    • Migration script is idempotent (running twice produces the same result).
    • Migration preserves correlation_type for events that are already correctly tagged.
    • Migration emits a count summary (X events updated to PROPAGATED, Y events backfilled to ROOT).
    • Migration is wired into the deploy pipeline OR documented as a one-shot operator step.

Out of scope (intentional, not pending)

  • Audit log behavior: backend/services/audit_service.py (or equivalent) keeps logging both ROOT and PROPAGATED events as separate records. This is the correct behavior for audit (forensic completeness) and is NOT a bug.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions