Problem statement
The agent diagnoses root causes but never produces actionable remediation steps. The remediation_steps field exists throughout the stack — in state, report context, and the Slack/terminal report formatters — but every code path in the diagnosis node hardcodes it to an empty list. Users see a root cause but no "what to do next."
Beyond that, the investigation planner has no awareness of team-specific runbooks. If an org's runbook says "when latency spikes, check DB connection pool before checking deployments," the planner can't use that signal — it relies solely on alert keywords and source detection.
The SREGuidanceTool knowledge base covers generic Google SRE book pipeline patterns but not web service, Kubernetes, or org-specific failure patterns.
Proposed solution
Two-phase implementation:
Phase 1 — Populate remediation steps (smaller, standalone):
- Add an explicit instruction for the LLM to return ordered remediation steps during diagnosis.
- Replace the hardcoded empty
remediation_steps list with the parsed LLM output.
- Update synthetic scoring rubrics to assert non-empty
remediation_steps.
Phase 2 — Runbook ingestion and retrieval at planning time:
- Add a local runbook store mapping service tags and trigger keywords to runbook content.
- Add a CLI command to register a runbook from a file.
- At planning time, retrieve the top matching runbook excerpt for the current alert and inject it into the planning prompt.
- Use runbook context during diagnosis so remediation steps are runbook-backed, not generic.
Phase 1 is a self-contained improvement that can ship independently of the runbook storage work.
Problem statement
The agent diagnoses root causes but never produces actionable remediation steps. The
remediation_stepsfield exists throughout the stack — in state, report context, and the Slack/terminal report formatters — but every code path in the diagnosis node hardcodes it to an empty list. Users see a root cause but no "what to do next."Beyond that, the investigation planner has no awareness of team-specific runbooks. If an org's runbook says "when latency spikes, check DB connection pool before checking deployments," the planner can't use that signal — it relies solely on alert keywords and source detection.
The
SREGuidanceToolknowledge base covers generic Google SRE book pipeline patterns but not web service, Kubernetes, or org-specific failure patterns.Proposed solution
Two-phase implementation:
Phase 1 — Populate remediation steps (smaller, standalone):
remediation_stepslist with the parsed LLM output.remediation_steps.Phase 2 — Runbook ingestion and retrieval at planning time:
Phase 1 is a self-contained improvement that can ship independently of the runbook storage work.