You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have searched existing issues and this is not a duplicate
I understand this issue needs status:approved before a PR can be opened
Problem Description
Issue #310 closes the write-side RCA gap (Path A legacy engine + Path C leased worker + CLI poll alerts) and gates the two most critical downstream consumers (escalation_notifier.py and routers/events.py for ITSM ticket creation) so they no longer fire / surface PROPAGATED events as independent incidents. Three additional gaps remain that are explicitly out of scope for #310 and are tracked here.
Items pending (will become follow-up PRs)
1. AI agent event filtering
Where: backend/services/ai_chat_service.py and any related AI harness that consumes events.
Why pending: fix(events): production collector hardcodes correlation_type=ROOT, breaking topology-based RCA #310 deliberately does not touch the AI agent because it is the most complex consumer (RAG, conditional tool calls, memory) and changing what events it "sees" deserves its own analysis and test set. The agent currently surfaces ROOT and PROPAGATED events as if they were independent failures.
Proposed approach: gate ai_chat_service.py on a shared _is_authoritative_event() helper (analogous to _is_authoritative_availability_event at backend/services/event_service.py:466), expose correlation_type and root_cause_ci_id in the AI tool input, and decide whether to filter PROPAGATED by default or surface it as secondary context.
Acceptance criteria:
AI agent queries that return events no longer include PROPAGATED events unless explicitly requested.
When a chain has multiple events, the AI response identifies the root cause by root_cause_ci_id.
Tests cover: chain with mixed ROOT + PROPAGATED; agent asked about a specific ci_id; agent asked about a chain.
Proposed approach: open a dedicated architectural review. Compare (a) keeping all three paths (legacy engine, in-process, leased polling) with Path B dormant; (b) deprecating Path B entirely; (c) re-enabling Path B and removing the other two.
Acceptance criteria:
Decision recorded in an ADR or design doc.
If Path B is deprecated: code removed, references cleaned, tests adjusted.
If Path B is re-enabled: docker-compose env contract updated, monitor confirms production traffic moves through Path B.
3. Correlation backfill migration
Where: backend/scripts/ (new migration script) + Neo4j Event nodes.
Proposed approach: write an idempotent migration that:
For each OPEN Event node, runs the same topology traversal find_open_parent_event does.
If a parent OPEN event is found, updates correlation_type to PROPAGATED, sets propagated_from and root_cause_ci_id.
Adds correlation_type = 'ROOT' to legacy CLI_POLL_ALERT events that have no correlation_type set.
Acceptance criteria:
Migration script is idempotent (running twice produces the same result).
Migration preserves correlation_type for events that are already correctly tagged.
Migration emits a count summary (X events updated to PROPAGATED, Y events backfilled to ROOT).
Migration is wired into the deploy pipeline OR documented as a one-shot operator step.
Out of scope (intentional, not pending)
Audit log behavior: backend/services/audit_service.py (or equivalent) keeps logging both ROOT and PROPAGATED events as separate records. This is the correct behavior for audit (forensic completeness) and is NOT a bug.
Pre-flight Checks
Problem Description
Issue #310 closes the write-side RCA gap (Path A legacy engine + Path C leased worker + CLI poll alerts) and gates the two most critical downstream consumers (escalation_notifier.py and routers/events.py for ITSM ticket creation) so they no longer fire / surface PROPAGATED events as independent incidents. Three additional gaps remain that are explicitly out of scope for #310 and are tracked here.
Items pending (will become follow-up PRs)
1. AI agent event filtering
backend/services/ai_chat_service.pyand any related AI harness that consumes events.ai_chat_service.pyon a shared_is_authoritative_event()helper (analogous to_is_authoritative_availability_eventatbackend/services/event_service.py:466), exposecorrelation_typeandroot_cause_ci_idin the AI tool input, and decide whether to filter PROPAGATED by default or surface it as secondary context.root_cause_ci_id.2. Path B re-enable decision (architectural)
backend/services/snmp_service.py:snmp_collector_loop+backend/main.py:298-302+docker-compose.yml:81.snmp_collector_loop) is the only event-write path that already performs RCA correctly (snmp_service.py:548-558). It is currently disabled in production viaDISABLE_BACKEND_COLLECTOR=true. Once fix(events): production collector hardcodes correlation_type=ROOT, breaking topology-based RCA #310 lands and Path A / Path C / CLI poll alerts all perform RCA inline, Path B becomes architecturally redundant. The decision to keep it dormant, deprecate it, or bring it back as the primary path is out of scope for fix(events): production collector hardcodes correlation_type=ROOT, breaking topology-based RCA #310.3. Correlation backfill migration
backend/scripts/(new migration script) + Neo4j Event nodes.correlation_type = 'ROOT'(or nocorrelation_typeat all for legacyCLI_POLL_ALERTevents) even when they belong to a dependency chain. After fix(events): production collector hardcodes correlation_type=ROOT, breaking topology-based RCA #310, new events will be correctly tagged, but historical events will still look like independent failures.find_open_parent_eventdoes.correlation_typetoPROPAGATED, setspropagated_fromandroot_cause_ci_id.correlation_type = 'ROOT'to legacyCLI_POLL_ALERTevents that have nocorrelation_typeset.correlation_typefor events that are already correctly tagged.Out of scope (intentional, not pending)
backend/services/audit_service.py(or equivalent) keeps logging both ROOT and PROPAGATED events as separate records. This is the correct behavior for audit (forensic completeness) and is NOT a bug.Related
openspec/changes/fix-310-event-correlation-rca/for the active SDD change and its explore artifact.