Skip to content

feat(runbooks): local runbook store with runbook-aware diagnosis grounding (#1073 phase 2a)#2029

Open
devankitjuneja wants to merge 12 commits into
Tracer-Cloud:mainfrom
devankitjuneja:feature/1073-phase-2a-runbook-store
Open

feat(runbooks): local runbook store with runbook-aware diagnosis grounding (#1073 phase 2a)#2029
devankitjuneja wants to merge 12 commits into
Tracer-Cloud:mainfrom
devankitjuneja:feature/1073-phase-2a-runbook-store

Conversation

@devankitjuneja
Copy link
Copy Markdown
Contributor

@devankitjuneja devankitjuneja commented May 14, 2026

Relates to #1073

Describe the changes you have made in this PR -

Adds a local markdown runbook store and runbook-aware reasoning to the investigation pipeline (Phase 2a).

What's included:

  • `app/runbooks/` — disk-backed store with YAML frontmatter parsing and deterministic top-1 retrieval (service match +2, keyword overlap +1 per trigger)
  • `app/pipeline/pipeline.py` — `_retrieve_runbook()` runs between `extract_alert` and the ReAct loop, writes `matched_runbook` to state
  • `app/agent/prompt.py` — matched runbook body appended to `format_alert_context()` so the LLM grounds remediation steps in team playbooks
  • `app/delivery/publish_findings/` — `runbook_provenance` on `ReportContext`; renders `Source: runbooks/.md` below Recommended Actions
  • `opensre runbook add|list|remove` CLI + `/runbook` REPL parity
  • 46 new tests; `docs/runbooks.mdx`

Runbook format:

---
service: payments-api
triggers:
  - oom
  - memory
---

Demo/Screenshot for feature changes and bug fixes -

image

Code Understanding and AI Usage

Did you use AI assistance (ChatGPT, Claude, Copilot, etc.) to write any part of this code?

  • Yes, I used AI assistance (continue below)

If you used AI assistance:

  • I have reviewed every single line of the AI-generated code
  • I can explain the purpose and logic of each function/component I added
  • I have tested edge cases and understand how the code handles them
  • I have modified the AI output to follow this project's coding standards and conventions

Explain your implementation approach:

Retrieval is deterministic (no LLM, no vector DB) — score = service name match + keyword overlap against `triggers:` frontmatter. Top-1 result is written to state before the ReAct loop so `format_alert_context()` can append the runbook body to the user message. Injection is in the alert context (not system prompt) because `build_system_prompt` is stateless. Template fallback from Phase 1 stays active when no runbook matches.


Checklist before requesting a review

  • I have added proper PR title and linked to the issue
  • I have performed a self-review of my code
  • I can explain the purpose of every function, class, and logic block I added
  • I understand why my changes work and have tested them thoroughly
  • I have considered potential edge cases and how my code handles them
  • If it is a core feature, I have added thorough tests
  • My code follows the project's style guidelines and conventions

@github-actions
Copy link
Copy Markdown
Contributor

Greptile code review

This repo uses Greptile for automated review. Before merge, aim for Confidence Score: 5/5 with zero unresolved review threads — see CONTRIBUTING.md.

Run a review — add a PR comment with:

@greptile review

Give it ~5-10 minutes (sometimes longer) for results, then fix feedback and re-trigger until you reach Confidence Score: 5/5.

Optional: automate with the greploop skill.

Comment thread tests/runbooks/test_store.py Fixed
Comment thread tests/runbooks/test_store.py Fixed
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 14, 2026

Greptile Summary

This PR adds a local markdown runbook store and runbook-aware reasoning to the investigation pipeline (Phase 2a of #1073). All previously flagged issues from earlier review rounds have been addressed in this revision.

  • app/runbooks/: New disk-backed store (store.py) with YAML frontmatter parsing, slug validation enforced symmetrically in both save() and remove(), and a deterministic top-1 retrieval engine (retrieval.py) that scores by service match (+2) and multi-word trigger overlap (+1 per trigger using all()-token matching).
  • app/pipeline/pipeline.py: _retrieve_runbook() inserted between extract_alert and the ReAct loop, using len(w) >= 3 (consistent with docs and tests), with commonLabels: null handled via or {}, wrapped in a broad except so a broken runbook store never blocks an investigation.
  • Delivery + prompt layers: Matched runbook body injected into format_alert_context() for LLM grounding; runbook_provenance written to ReportContext and rendered as a _Source: runbooks/<slug>.md_ line in Slack messages and a Block Kit context block.

Confidence Score: 5/5

Safe to merge — all previously identified defects are resolved and the new runbook path is wrapped to never block an investigation.

Every previously flagged issue has been corrected: the len(w) >= 3 keyword filter is consistent across pipeline, test suite, and docs; multi-word triggers are matched with the walrus-operator all() approach; commonLabels: null is handled by or {}; save() now validates the slug before copying; and the CLI remove command converts both ValueError and RunbookValidationError to clean ClickExceptions. The new code is defensive, test coverage is thorough (46 new tests including a synthetic scenario), and no new defects were found.

No files require special attention.

Important Files Changed

Filename Overview
app/runbooks/store.py New disk-backed runbook store with YAML frontmatter parsing, slug validation, save/remove/load_all APIs. All previously flagged issues addressed: save() now validates slug before copying, remove() catches invalid slugs in CLI layer.
app/runbooks/retrieval.py Deterministic top-1 scoring engine. Multi-word trigger matching now uses walrus operator + all() to split trigger tokens and require all parts to appear in keyword_set — previously flagged silent-zero bug is resolved.
app/pipeline/pipeline.py _retrieve_runbook() inserted between extract_alert and ReAct loop. Uses len(w) >= 3 (matching docs/tests), handles commonLabels: null via or {}, wraps in broad except to never block investigation.
app/agent/prompt.py Adds _build_runbook_section() that appends matched runbook body (truncated at 2000 chars at a newline boundary) to format_alert_context(). Slug validated to [\w-]+ so no injection risk.
app/cli/commands/runbook.py New CLI group with add/list/remove subcommands. remove() now catches both ValueError and RunbookValidationError and converts them to clean ClickException — previously flagged traceback issue is resolved.
app/delivery/publish_findings/report_context.py Adds runbook_provenance field to ReportContext TypedDict and build_report_context(). Only populated when matched_runbook is a dict with a non-empty slug.
app/delivery/publish_findings/formatters/report.py Appends Source: runbooks/.md as both a text line in format_slack_message and a Block Kit context block in build_slack_blocks. Correctly guarded with None-check on runbook_provenance and slug.
tests/synthetic/runbooks/test_runbook_suite.py Fixture-driven synthetic suite. Now uses len(w) >= 3 consistently with production — previously flagged > 3 mismatch is fixed.

Sequence Diagram

sequenceDiagram
    participant P as pipeline.py
    participant RS as runbooks/store.py
    participant RR as runbooks/retrieval.py
    participant PR as prompt.py
    participant RC as report_context.py
    participant RF as report.py (Slack)

    P->>RS: load_all() → list[Runbook]
    P->>RR: retrieve_matching_runbook(runbooks, keywords, service, pipeline_name)
    RR-->>P: "Runbook | None"
    P->>P: matched.to_dict() → dict
    Note over P: _merge(state, {matched_runbook: dict})

    P->>PR: format_alert_context(state)
    PR->>PR: _build_runbook_section(state[matched_runbook])
    PR-->>P: alert context + runbook block

    P->>RC: build_report_context(state)
    RC->>RC: extract runbook_provenance from matched_runbook
    RC-->>P: ReportContext with runbook_provenance

    P->>RF: format_slack_message(ctx) / build_slack_blocks(ctx)
    RF->>RF: "append _Source: runbooks/<slug>.md_ line/context block"
    RF-->>P: Slack message with runbook provenance
Loading

Reviews (9): Last reviewed commit: "fix(runbooks): handle null commonLabels ..." | Re-trigger Greptile

Comment thread app/pipeline/pipeline.py Outdated
Comment thread app/cli/commands/runbook.py
@devankitjuneja
Copy link
Copy Markdown
Contributor Author

@greptile review

Comment thread tests/synthetic/runbooks/test_runbook_suite.py Outdated
@devankitjuneja
Copy link
Copy Markdown
Contributor Author

@greptile review

Comment thread app/runbooks/retrieval.py Outdated
@devankitjuneja devankitjuneja marked this pull request as draft May 14, 2026 18:48
@devankitjuneja devankitjuneja marked this pull request as ready for review May 15, 2026 08:51
@devankitjuneja devankitjuneja force-pushed the feature/1073-phase-2a-runbook-store branch from cf6595d to 6ab2a18 Compare May 15, 2026 08:54
@devankitjuneja
Copy link
Copy Markdown
Contributor Author

@greptile review

@devankitjuneja
Copy link
Copy Markdown
Contributor Author

@greptile review

Comment thread app/runbooks/store.py
Copy link
Copy Markdown
Contributor

@VibhorGautam VibhorGautam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a good shape overall - store, CLI, prompt section, and report provenance all tied together

two prompt-section nits:

_build_runbook_section cuts the body at 2000 chars, so it can split a sentence or code block. id look for the last newline before the cutoff and add a truncated marker so the agent doesnt treat a partial runbook as complete

_build_runbook_section falls back to unknown for slug, but report_context.py uses an empty string in provenance. small thing, but the report could look weird if a runbook is missing a slug

i couldnt tell from this diff where _retrieve_runbook decides which runbook matches. is that already in main, or coming in another PR? thats the part id want to sanity-check since the prompt asks the agent to prefer runbook actions

tmp_path + monkeypatch coverage looks right

@devankitjuneja
Copy link
Copy Markdown
Contributor Author

this is a good shape overall - store, CLI, prompt section, and report provenance all tied together

two prompt-section nits:

_build_runbook_section cuts the body at 2000 chars, so it can split a sentence or code block. id look for the last newline before the cutoff and add a truncated marker so the agent doesnt treat a partial runbook as complete

_build_runbook_section falls back to unknown for slug, but report_context.py uses an empty string in provenance. small thing, but the report could look weird if a runbook is missing a slug

i couldnt tell from this diff where _retrieve_runbook decides which runbook matches. is that already in main, or coming in another PR? thats the part id want to sanity-check since the prompt asks the agent to prefer runbook actions

tmp_path + monkeypatch coverage looks right

Hi @VibhorGautam
Thanks for poiting out these issues. This is a good review :)

  • Both points are valid and will be accounted in the next commit
  • _retrieve_runbook is in app/pipeline/pipeline.py lines 197–223, already in this PR.

@VibhorGautam
Copy link
Copy Markdown
Contributor

ah missed that, thanks for the pointer - looks like the matching logic is solid then

@devankitjuneja
Copy link
Copy Markdown
Contributor Author

@greptile review

@devankitjuneja
Copy link
Copy Markdown
Contributor Author

@greptile review

Comment thread app/pipeline/pipeline.py Outdated
@VibhorGautam
Copy link
Copy Markdown
Contributor

good fix on the ValueError catch, clean and minimal

greptile's right about the commonLabels null case too. .get("commonLabels", {}) only falls back when the key is missing, not when the alert source sends it as explicit null. or {} covers both paths

worth scanning the rest of _retrieve_runbook for similar .get(..., {}) usage on external json fields while you're in there, same footgun anywhere the upstream payload can send explicit nulls

@devankitjuneja
Copy link
Copy Markdown
Contributor Author

@greptile review

@muddlebee
Copy link
Copy Markdown
Collaborator

hey @devankitjuneja

thank you for the PR, we will need some sort of confirmation from @VaibhavUpreti and @davincios before going with the merge and approval and reviews.

@devankitjuneja
Copy link
Copy Markdown
Contributor Author

hey @devankitjuneja

thank you for the PR, we will need some sort of confirmation from @VaibhavUpreti and @davincios before going with the merge and approval and reviews.

Sure :)

@devankitjuneja
Copy link
Copy Markdown
Contributor Author

Hi @VaibhavUpreti
Need your inputs on this.

Comment thread app/cli/commands/runbook.py Outdated
@runbook.command("add")
@click.argument("path", type=click.Path(exists=True, dir_okay=False, path_type=Path))
def runbook_add(path: Path) -> None:
"""Copy a markdown runbook into ~/.config/opensre/runbooks/."""
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's update all the runbook DIR to .opensre/runbooks

Copy link
Copy Markdown
Member

@VaibhavUpreti VaibhavUpreti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@devankitjuneja great work so far, could you please add a demo video of a grafana alert by running opensre investigate -i <file_path>, before and after you added the runbook.

After loading integration the first step should be to load the runbook in the planning step.

@devankitjuneja
Copy link
Copy Markdown
Contributor Author

@devankitjuneja great work so far, could you please add a demo video of a grafana alert by running opensre investigate -i <file_path>, before and after you added the runbook.

After loading integration the first step should be to load the runbook in the planning step.

Sure @VaibhavUpreti

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants