Skip to content

feat(rca): confidence-aware RCA with sufficiency gate#2159

Open
devankitjuneja wants to merge 8 commits into
Tracer-Cloud:mainfrom
devankitjuneja:feature/1368-confidence-aware-rca
Open

feat(rca): confidence-aware RCA with sufficiency gate#2159
devankitjuneja wants to merge 8 commits into
Tracer-Cloud:mainfrom
devankitjuneja:feature/1368-confidence-aware-rca

Conversation

@devankitjuneja
Copy link
Copy Markdown
Contributor

@devankitjuneja devankitjuneja commented May 18, 2026

Fixes #1368

Describe the changes you have made in this PR -

Adds a confidence band (HIGH/MEDIUM/LOW) to every investigation output, backed by a deterministic sufficiency gate.

  • Band thresholds: ≥0.75 → HIGH, 0.40–0.74 → MEDIUM, <0.40 → LOW
  • Sufficiency gate: weak conclusions (low score or insufficient validated claims) are prefixed with "Most likely:" to signal uncertainty
  • Per-claim evidence attribution: each validated claim tracks the specific evidence keys that supported it
  • Fallback fields: ranked_hypotheses and missing_evidence populated when confidence is MEDIUM/LOW
  • Confidence band surfaces in terminal output, Slack, Telegram, and delivery payloads

Demo/Screenshot for feature changes and bug fixes -

image

Code Understanding and AI Usage

Did you use AI assistance (ChatGPT, Claude, Copilot, etc.) to write any part of this code?

  • No, I wrote all the code myself
  • Yes, I used AI assistance (continue below)

If you used AI assistance:

  • I have reviewed every single line of the AI-generated code
  • I can explain the purpose and logic of each function/component I added
  • I have tested edge cases and understand how the code handles them
  • I have modified the AI output to follow this project's coding standards and conventions

Explain your implementation approach:

The confidence band is computed deterministically from validity_score (LLM-assigned 0–1) using fixed thresholds — no second LLM call for the band itself. The sufficiency gate runs after parse_diagnosis() and checks both score and validated claim count, because a high LLM-reported score with zero validated claims should not pass on its own. Per-claim evidence attribution uses a _ValidatedClaimSchema Pydantic model so the LLM attributes specific evidence keys per claim. The terminal renderer uses a filter pass to prevent the inline confidence section (embedded in the Slack message string) from double-printing alongside the styled confidence block.


Checklist before requesting a review

  • I have added proper PR title and linked to the issue
  • I have performed a self-review of my code
  • I can explain the purpose of every function, class, and logic block I added
  • I understand why my changes work and have tested them thoroughly
  • I have considered potential edge cases and how my code handles them
  • If it is a core feature, I have added thorough tests
  • My code follows the project's style guidelines and conventions

@github-actions
Copy link
Copy Markdown
Contributor

Greptile code review

This repo uses Greptile for automated review. Before merge, aim for Confidence Score: 5/5 with zero unresolved review threads — see CONTRIBUTING.md.

Run a review — add a PR comment with:

@greptile review

Give it ~5-10 minutes (sometimes longer) for results, then fix feedback and re-trigger until you reach Confidence Score: 5/5.

Optional: automate with the greploop skill.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 18, 2026

Greptile Summary

This PR adds a deterministic confidence band (high/medium/low) to every investigation output, backed by a sufficiency gate that checks both validity_score and validated-claim count. All previously flagged edge cases — the "unknown" category receiving a contradictory "Most likely:" prefix, the LLM-written prefix not being stripped on a sufficient result, and MEDIUM/HIGH bands not being downgraded when the gate fires — are correctly handled in the updated code.

  • result.py introduces classify_confidence_band and check_sufficiency, and the InvestigationResult dataclass gains confidence_band, ranked_hypotheses, and missing_evidence; the "unknown" and "healthy" factory methods set their bands correctly and both now bypass the gate.
  • investigation.py applies the gate after parse_diagnosis, unconditionally sets confidence_band = "low" when the gate fires (covering both HIGH and MEDIUM starting bands), and strips a pre-existing "Most likely:" prefix when evidence is sufficient.
  • Delivery layer (formatters/report.py, renderers/terminal.py, node.py, remote/renderer.py) propagates the band through Slack strings, Block Kit blocks, Telegram HTML, terminal rich and plain renderers, and streaming payloads; _sanitize_for_slack and html.escape are applied correctly in every path.

Confidence Score: 5/5

Safe to merge — all gate/prefix/band-downgrade corrections are in place, test coverage is thorough, and no correctness regressions were found.

The sufficiency gate logic is deterministic and well-tested; the delivery layer changes are additive and isolated; all previously raised correctness issues have been addressed in the current diff.

No files require special attention.

Important Files Changed

Filename Overview
app/agent/result.py Adds _ValidatedClaimSchema, classify_confidence_band, and check_sufficiency; unknown/healthy factory classmethods correctly set their bands; previously-flagged edge cases (unknown category, LLM prefix stripping, band downgrade) are all resolved.
app/agent/investigation.py Applies sufficiency gate after parse_diagnosis, unconditionally downgrades band to "low" and adds "Most likely:" when gate fires; strips existing prefix when gate passes; error paths hard-code "low".
app/delivery/publish_findings/formatters/report.py Renders confidence band + ranked_hypotheses + missing_evidence in Slack string, Slack Block Kit, and Telegram; _sanitize_for_slack is called for all LLM-sourced items in both the string and block-kit paths; Telegram pipeline correctly HTML-escapes through _to_telegram_html_body.
app/delivery/publish_findings/renderers/terminal.py New _filter_confidence_sections and _render_rich_confidence_block work correctly for rich mode; plain renderer has a cosmetic issue where .strip() eats the intended blank-line separator before the confidence output.
app/state/agent_state.py Adds confidence_band, ranked_hypotheses, missing_evidence to both AgentState TypedDict and AgentStateModel Pydantic model with correct defaults.
tests/agent/test_confidence_gating.py Comprehensive new test file covering band thresholds, factory classmethod bands, all sufficiency gate scenarios (high-score/no-claims, medium-score/1-claim, medium-score/2-claims, healthy, unknown, prefix stripping), and band downgrade on gate fire.
app/pipeline/runners.py Streams three new fields (confidence_band, ranked_hypotheses, missing_evidence) in the astream_investigation payload.
app/remote/renderer.py Updates the diagnose node status string to include the confidence band (e.g. validity:HIGH(85%)) and passes new fields to render_report.
app/utils/openclaw_delivery.py Appends bracketed band label (e.g. [HIGH]) to the confidence line in the OpenClaw report body.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[parse_diagnosis / LLM structured output] --> B[classify_confidence_band\nvalidity_score → high/medium/low]
    B --> C[InvestigationResult\nconfidence_band\nranked_hypotheses\nmissing_evidence]
    C --> D{check_sufficiency}
    D -->|"healthy / unknown category"| E[Gate PASSES\nstrip 'Most likely:' if present]
    D -->|"score ≥ 0.75 + ≥1 validated claim"| E
    D -->|"score ≥ 0.40 + ≥2 validated claims"| E
    D -->|otherwise| F[Gate FIRES\nadd 'Most likely:' prefix\nforce band = low]
    E --> G[_result_to_state]
    F --> G
    G --> H[build_report_context]
    H --> I[format_slack_message\nbuild_slack_blocks\nformat_telegram_message]
    H --> J[render_report\nrich terminal / plain terminal]
    I --> K[Slack / Telegram delivery]
    J --> L[Terminal output]
Loading

Reviews (6): Last reviewed commit: "fix(rca): downgrade confidence_band to l..." | Re-trigger Greptile

Comment thread app/agent/result.py
Comment thread app/agent/investigation.py Outdated
@devankitjuneja
Copy link
Copy Markdown
Contributor Author

@greptile review

Comment thread app/agent/investigation.py Outdated
@devankitjuneja
Copy link
Copy Markdown
Contributor Author

@greptile review

Comment thread app/delivery/publish_findings/formatters/report.py
@devankitjuneja
Copy link
Copy Markdown
Contributor Author

@greptile review

1 similar comment
@devankitjuneja
Copy link
Copy Markdown
Contributor Author

@greptile review

Comment thread app/agent/investigation.py Outdated
@devankitjuneja
Copy link
Copy Markdown
Contributor Author

@greptile review

Copy link
Copy Markdown
Contributor

@VibhorGautam VibhorGautam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one edge worth documenting: the confidence band and sufficiency gate can intentionally disagree. a result can score high but still fail sufficiency because the evidence is thin, and investigation.py then downgrades the emitted band from high to medium while adding the "Most likely:" qualifier

that cross-file override is the important behavior but its not obvious from result.py alone. could you add a focused test for high-score + insufficient-evidence asserting the downgrade, and maybe a short inline note at the downgrade branch in investigation.py

@devankitjuneja
Copy link
Copy Markdown
Contributor Author

one edge worth documenting: the confidence band and sufficiency gate can intentionally disagree. a result can score high but still fail sufficiency because the evidence is thin, and investigation.py then downgrades the emitted band from high to medium while adding the "Most likely:" qualifier

that cross-file override is the important behavior but its not obvious from result.py alone. could you add a focused test for high-score + insufficient-evidence asserting the downgrade, and maybe a short inline note at the downgrade branch in investigation.py

Valid ask. In-line comment would definitely make our lives easier.
Also, the test is already covered in test_gate_downgrades_band_to_low_when_fired
Thanks for the suggestion :)

Ankit Juneja added 8 commits May 20, 2026 12:18
Add confidence band (high/medium/low) to every investigation output,
a sufficiency gate that prefixes weak conclusions with "Most likely:",
per-claim evidence attribution, and ranked_hypotheses/missing_evidence
fields. Confidence band and uncertainty signals rendered in terminal,
Slack, Telegram, and delivery channels.
@devankitjuneja devankitjuneja force-pushed the feature/1368-confidence-aware-rca branch from 8592edb to a7a6524 Compare May 20, 2026 06:49
@cerencamkiran
Copy link
Copy Markdown
Collaborator

One thing I’m a bit unsure about is that the sufficiency gate depends on validated_claims count + the LLM’s validity_score.

Right now, something like “2 validated claims + medium score” can still pass even if the actual evidence is fairly weak.

A few claims can still come from the same noisy signal or the same source, and the current logic doesn’t fully separate:

  • independent evidence vs repeating the same signal in different ways,
  • directly validated findings vs inferred correlations,
  • or conflicting evidence vs just missing evidence.

For a confidence-aware RCA system, I think the actual evidence quality and source diversity matter more than the raw claim count.

Good work overall. Thanks for your effort.

@devankitjuneja
Copy link
Copy Markdown
Contributor Author

Hi @cerencamkiran, thanks for a thorough review.
You're right for a full confidence-aware RCA system, evidence quality and source diversity are non-negotiable.
This PR introduces an intentioanl MVP - happy to open a separate issue that extends this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Confidence-Aware RCA: require evidence sufficiency before definitive root cause

3 participants