Skip to content

[FEATURE] Runbook-aware reasoning — ingest runbooks and surface remediation steps #1073

@devankitjuneja

Description

@devankitjuneja

Problem statement

The agent diagnoses root causes but never produces actionable remediation steps. The remediation_steps field exists throughout the stack — in state, report context, and the Slack/terminal report formatters — but every code path in the diagnosis node hardcodes it to an empty list. Users see a root cause but no "what to do next."

Beyond that, the investigation planner has no awareness of team-specific runbooks. If an org's runbook says "when latency spikes, check DB connection pool before checking deployments," the planner can't use that signal — it relies solely on alert keywords and source detection.

The SREGuidanceTool knowledge base covers generic Google SRE book pipeline patterns but not web service, Kubernetes, or org-specific failure patterns.

Proposed solution

Two-phase implementation:

Phase 1 — Populate remediation steps (smaller, standalone):

  • Add an explicit instruction for the LLM to return ordered remediation steps during diagnosis.
  • Replace the hardcoded empty remediation_steps list with the parsed LLM output.
  • Update synthetic scoring rubrics to assert non-empty remediation_steps.

Phase 2 — Runbook ingestion and retrieval at planning time:

  • Add a local runbook store mapping service tags and trigger keywords to runbook content.
  • Add a CLI command to register a runbook from a file.
  • At planning time, retrieve the top matching runbook excerpt for the current alert and inject it into the planning prompt.
  • Use runbook context during diagnosis so remediation steps are runbook-backed, not generic.

Phase 1 is a self-contained improvement that can ship independently of the runbook storage work.

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions