feat(rca): confidence-aware RCA with sufficiency gate by devankitjuneja · Pull Request #2159 · Tracer-Cloud/opensre

devankitjuneja · 2026-05-18T05:26:49Z

Fixes #1368

Describe the changes you have made in this PR -

Adds a confidence band (HIGH/MEDIUM/LOW) to every investigation output, backed by a deterministic sufficiency gate.

Band thresholds: ≥0.75 → HIGH, 0.40–0.74 → MEDIUM, <0.40 → LOW
Sufficiency gate: weak conclusions (low score or insufficient validated claims) are prefixed with "Most likely:" to signal uncertainty
Per-claim evidence attribution: each validated claim tracks the specific evidence keys that supported it
Fallback fields: ranked_hypotheses and missing_evidence populated when confidence is MEDIUM/LOW
Confidence band surfaces in terminal output, Slack, Telegram, and delivery payloads

Demo/Screenshot for feature changes and bug fixes -

Code Understanding and AI Usage

Did you use AI assistance (ChatGPT, Claude, Copilot, etc.) to write any part of this code?

No, I wrote all the code myself
Yes, I used AI assistance (continue below)

If you used AI assistance:

I have reviewed every single line of the AI-generated code
I can explain the purpose and logic of each function/component I added
I have tested edge cases and understand how the code handles them
I have modified the AI output to follow this project's coding standards and conventions

Explain your implementation approach:

The confidence band is computed deterministically from validity_score (LLM-assigned 0–1) using fixed thresholds — no second LLM call for the band itself. The sufficiency gate runs after parse_diagnosis() and checks both score and validated claim count, because a high LLM-reported score with zero validated claims should not pass on its own. Per-claim evidence attribution uses a _ValidatedClaimSchema Pydantic model so the LLM attributes specific evidence keys per claim. The terminal renderer uses a filter pass to prevent the inline confidence section (embedded in the Slack message string) from double-printing alongside the styled confidence block.

Checklist before requesting a review

I have added proper PR title and linked to the issue
I have performed a self-review of my code
I can explain the purpose of every function, class, and logic block I added
I understand why my changes work and have tested them thoroughly
I have considered potential edge cases and how my code handles them
If it is a core feature, I have added thorough tests
My code follows the project's style guidelines and conventions

github-actions · 2026-05-18T05:26:57Z

Greptile code review

This repo uses Greptile for automated review. Before merge, aim for Confidence Score: 5/5 with zero unresolved review threads — see CONTRIBUTING.md.

Run a review — add a PR comment with:

@greptile review

Give it ~5-10 minutes (sometimes longer) for results, then fix feedback and re-trigger until you reach Confidence Score: 5/5.

Optional: automate with the greploop skill.

greptile-apps · 2026-05-18T05:31:51Z

Greptile Summary

This PR adds a deterministic confidence band (high/medium/low) to every investigation output, backed by a sufficiency gate that checks both validity_score and validated-claim count. All previously flagged edge cases — the "unknown" category receiving a contradictory "Most likely:" prefix, the LLM-written prefix not being stripped on a sufficient result, and MEDIUM/HIGH bands not being downgraded when the gate fires — are correctly handled in the updated code.

result.py introduces classify_confidence_band and check_sufficiency, and the InvestigationResult dataclass gains confidence_band, ranked_hypotheses, and missing_evidence; the "unknown" and "healthy" factory methods set their bands correctly and both now bypass the gate.
investigation.py applies the gate after parse_diagnosis, unconditionally sets confidence_band = "low" when the gate fires (covering both HIGH and MEDIUM starting bands), and strips a pre-existing "Most likely:" prefix when evidence is sufficient.
Delivery layer (formatters/report.py, renderers/terminal.py, node.py, remote/renderer.py) propagates the band through Slack strings, Block Kit blocks, Telegram HTML, terminal rich and plain renderers, and streaming payloads; _sanitize_for_slack and html.escape are applied correctly in every path.

Confidence Score: 5/5

Safe to merge — all gate/prefix/band-downgrade corrections are in place, test coverage is thorough, and no correctness regressions were found.

The sufficiency gate logic is deterministic and well-tested; the delivery layer changes are additive and isolated; all previously raised correctness issues have been addressed in the current diff.

No files require special attention.

Important Files Changed

Filename	Overview
app/agent/result.py	Adds `_ValidatedClaimSchema`, `classify_confidence_band`, and `check_sufficiency`; unknown/healthy factory classmethods correctly set their bands; previously-flagged edge cases (unknown category, LLM prefix stripping, band downgrade) are all resolved.
app/agent/investigation.py	Applies sufficiency gate after `parse_diagnosis`, unconditionally downgrades band to "low" and adds "Most likely:" when gate fires; strips existing prefix when gate passes; error paths hard-code "low".
app/delivery/publish_findings/formatters/report.py	Renders confidence band + ranked_hypotheses + missing_evidence in Slack string, Slack Block Kit, and Telegram; `_sanitize_for_slack` is called for all LLM-sourced items in both the string and block-kit paths; Telegram pipeline correctly HTML-escapes through `_to_telegram_html_body`.
app/delivery/publish_findings/renderers/terminal.py	New `_filter_confidence_sections` and `_render_rich_confidence_block` work correctly for rich mode; plain renderer has a cosmetic issue where `.strip()` eats the intended blank-line separator before the confidence output.
app/state/agent_state.py	Adds `confidence_band`, `ranked_hypotheses`, `missing_evidence` to both `AgentState` TypedDict and `AgentStateModel` Pydantic model with correct defaults.
tests/agent/test_confidence_gating.py	Comprehensive new test file covering band thresholds, factory classmethod bands, all sufficiency gate scenarios (high-score/no-claims, medium-score/1-claim, medium-score/2-claims, healthy, unknown, prefix stripping), and band downgrade on gate fire.
app/pipeline/runners.py	Streams three new fields (`confidence_band`, `ranked_hypotheses`, `missing_evidence`) in the `astream_investigation` payload.
app/remote/renderer.py	Updates the diagnose node status string to include the confidence band (e.g. `validity:HIGH(85%)`) and passes new fields to `render_report`.
app/utils/openclaw_delivery.py	Appends bracketed band label (e.g. `[HIGH]`) to the confidence line in the OpenClaw report body.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[parse_diagnosis / LLM structured output] --> B[classify_confidence_band\nvalidity_score → high/medium/low]
    B --> C[InvestigationResult\nconfidence_band\nranked_hypotheses\nmissing_evidence]
    C --> D{check_sufficiency}
    D -->|"healthy / unknown category"| E[Gate PASSES\nstrip 'Most likely:' if present]
    D -->|"score ≥ 0.75 + ≥1 validated claim"| E
    D -->|"score ≥ 0.40 + ≥2 validated claims"| E
    D -->|otherwise| F[Gate FIRES\nadd 'Most likely:' prefix\nforce band = low]
    E --> G[_result_to_state]
    F --> G
    G --> H[build_report_context]
    H --> I[format_slack_message\nbuild_slack_blocks\nformat_telegram_message]
    H --> J[render_report\nrich terminal / plain terminal]
    I --> K[Slack / Telegram delivery]
    J --> L[Terminal output]

_{Reviews (6): Last reviewed commit: "fix(rca): downgrade confidence_band to l..." | Re-trigger Greptile}

devankitjuneja · 2026-05-18T05:50:15Z

@greptile review

devankitjuneja · 2026-05-18T07:48:37Z

@greptile review

devankitjuneja · 2026-05-18T08:04:01Z

@greptile review

devankitjuneja · 2026-05-18T08:11:17Z

@greptile review

devankitjuneja · 2026-05-18T08:20:43Z

@greptile review

VibhorGautam

one edge worth documenting: the confidence band and sufficiency gate can intentionally disagree. a result can score high but still fail sufficiency because the evidence is thin, and investigation.py then downgrades the emitted band from high to medium while adding the "Most likely:" qualifier

that cross-file override is the important behavior but its not obvious from result.py alone. could you add a focused test for high-score + insufficient-evidence asserting the downgrade, and maybe a short inline note at the downgrade branch in investigation.py

devankitjuneja · 2026-05-18T13:12:44Z

one edge worth documenting: the confidence band and sufficiency gate can intentionally disagree. a result can score high but still fail sufficiency because the evidence is thin, and investigation.py then downgrades the emitted band from high to medium while adding the "Most likely:" qualifier

that cross-file override is the important behavior but its not obvious from result.py alone. could you add a focused test for high-score + insufficient-evidence asserting the downgrade, and maybe a short inline note at the downgrade branch in investigation.py

Valid ask. In-line comment would definitely make our lives easier.
Also, the test is already covered in test_gate_downgrades_band_to_low_when_fired
Thanks for the suggestion :)

Add confidence band (high/medium/low) to every investigation output, a sufficiency gate that prefixes weak conclusions with "Most likely:", per-claim evidence attribution, and ranked_hypotheses/missing_evidence fields. Confidence band and uncertainty signals rendered in terminal, Slack, Telegram, and delivery channels.

…d NaN score

…fires

cerencamkiran · 2026-05-22T11:47:31Z

One thing I’m a bit unsure about is that the sufficiency gate depends on validated_claims count + the LLM’s validity_score.

Right now, something like “2 validated claims + medium score” can still pass even if the actual evidence is fairly weak.

A few claims can still come from the same noisy signal or the same source, and the current logic doesn’t fully separate:

independent evidence vs repeating the same signal in different ways,
directly validated findings vs inferred correlations,
or conflicting evidence vs just missing evidence.

For a confidence-aware RCA system, I think the actual evidence quality and source diversity matter more than the raw claim count.

Good work overall. Thanks for your effort.

devankitjuneja · 2026-05-22T12:04:19Z

Hi @cerencamkiran, thanks for a thorough review.
You're right for a full confidence-aware RCA system, evidence quality and source diversity are non-negotiable.
This PR introduces an intentioanl MVP - happy to open a separate issue that extends this.

greptile-apps Bot reviewed May 18, 2026

View reviewed changes

Comment thread app/agent/result.py

Comment thread app/agent/investigation.py Outdated

greptile-apps Bot reviewed May 18, 2026

View reviewed changes

Comment thread app/agent/investigation.py Outdated

greptile-apps Bot reviewed May 18, 2026

View reviewed changes

Comment thread app/delivery/publish_findings/formatters/report.py

greptile-apps Bot reviewed May 18, 2026

View reviewed changes

Comment thread app/agent/investigation.py Outdated

VibhorGautam reviewed May 18, 2026

View reviewed changes

Ankit Juneja added 8 commits May 20, 2026 12:18

fix(rca): pass confidence fields through streaming renderer

a66ffc9

fix(rca): exempt unknown category from gate; strip stale prefix; guar…

aaf7947

…d NaN score

fix(rca): cap confidence_band to medium when sufficiency gate fires

7e72b70

fix(rca): propagate confidence fields to all emit/event/output paths

8c0718c

fix(tests): update CLI out dict assertion to include confidence_band

654b608

fix(rca): downgrade confidence_band to low whenever sufficiency gate …

5cd2110

…fires

docs(rca): note cross-file band override at sufficiency gate

a7a6524

devankitjuneja force-pushed the feature/1368-confidence-aware-rca branch from 8592edb to a7a6524 Compare May 20, 2026 06:49

Conversation

devankitjuneja commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the changes you have made in this PR -

Demo/Screenshot for feature changes and bug fixes -

Code Understanding and AI Usage

Checklist before requesting a review

Uh oh!

github-actions Bot commented May 18, 2026

Greptile code review

Uh oh!

greptile-apps Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

devankitjuneja commented May 18, 2026

Uh oh!

Uh oh!

devankitjuneja commented May 18, 2026

Uh oh!

Uh oh!

devankitjuneja commented May 18, 2026

Uh oh!

devankitjuneja commented May 18, 2026

Uh oh!

Uh oh!

devankitjuneja commented May 18, 2026

Uh oh!

VibhorGautam left a comment

Choose a reason for hiding this comment

Uh oh!

devankitjuneja commented May 18, 2026

Uh oh!

cerencamkiran commented May 22, 2026

Uh oh!

devankitjuneja commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

devankitjuneja commented May 18, 2026 •

edited

Loading

greptile-apps Bot commented May 18, 2026 •

edited

Loading