feat(rca): confidence-aware RCA with sufficiency gate#2159
feat(rca): confidence-aware RCA with sufficiency gate#2159devankitjuneja wants to merge 8 commits into
Conversation
Greptile code reviewThis repo uses Greptile for automated review. Before merge, aim for Confidence Score: 5/5 with zero unresolved review threads — see CONTRIBUTING.md. Run a review — add a PR comment with: Give it ~5-10 minutes (sometimes longer) for results, then fix feedback and re-trigger until you reach Confidence Score: 5/5. Optional: automate with the greploop skill. |
Greptile SummaryThis PR adds a deterministic confidence band (
Confidence Score: 5/5Safe to merge — all gate/prefix/band-downgrade corrections are in place, test coverage is thorough, and no correctness regressions were found. The sufficiency gate logic is deterministic and well-tested; the delivery layer changes are additive and isolated; all previously raised correctness issues have been addressed in the current diff. No files require special attention. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[parse_diagnosis / LLM structured output] --> B[classify_confidence_band\nvalidity_score → high/medium/low]
B --> C[InvestigationResult\nconfidence_band\nranked_hypotheses\nmissing_evidence]
C --> D{check_sufficiency}
D -->|"healthy / unknown category"| E[Gate PASSES\nstrip 'Most likely:' if present]
D -->|"score ≥ 0.75 + ≥1 validated claim"| E
D -->|"score ≥ 0.40 + ≥2 validated claims"| E
D -->|otherwise| F[Gate FIRES\nadd 'Most likely:' prefix\nforce band = low]
E --> G[_result_to_state]
F --> G
G --> H[build_report_context]
H --> I[format_slack_message\nbuild_slack_blocks\nformat_telegram_message]
H --> J[render_report\nrich terminal / plain terminal]
I --> K[Slack / Telegram delivery]
J --> L[Terminal output]
Reviews (6): Last reviewed commit: "fix(rca): downgrade confidence_band to l..." | Re-trigger Greptile |
|
@greptile review |
|
@greptile review |
|
@greptile review |
1 similar comment
|
@greptile review |
|
@greptile review |
VibhorGautam
left a comment
There was a problem hiding this comment.
one edge worth documenting: the confidence band and sufficiency gate can intentionally disagree. a result can score high but still fail sufficiency because the evidence is thin, and investigation.py then downgrades the emitted band from high to medium while adding the "Most likely:" qualifier
that cross-file override is the important behavior but its not obvious from result.py alone. could you add a focused test for high-score + insufficient-evidence asserting the downgrade, and maybe a short inline note at the downgrade branch in investigation.py
Valid ask. In-line comment would definitely make our lives easier. |
Add confidence band (high/medium/low) to every investigation output, a sufficiency gate that prefixes weak conclusions with "Most likely:", per-claim evidence attribution, and ranked_hypotheses/missing_evidence fields. Confidence band and uncertainty signals rendered in terminal, Slack, Telegram, and delivery channels.
8592edb to
a7a6524
Compare
|
One thing I’m a bit unsure about is that the sufficiency gate depends on Right now, something like “2 validated claims + medium score” can still pass even if the actual evidence is fairly weak. A few claims can still come from the same noisy signal or the same source, and the current logic doesn’t fully separate:
For a confidence-aware RCA system, I think the actual evidence quality and source diversity matter more than the raw claim count. Good work overall. Thanks for your effort. |
|
Hi @cerencamkiran, thanks for a thorough review. |
Fixes #1368
Describe the changes you have made in this PR -
Adds a confidence band (HIGH/MEDIUM/LOW) to every investigation output, backed by a deterministic sufficiency gate.
ranked_hypothesesandmissing_evidencepopulated when confidence is MEDIUM/LOWDemo/Screenshot for feature changes and bug fixes -
Code Understanding and AI Usage
Did you use AI assistance (ChatGPT, Claude, Copilot, etc.) to write any part of this code?
If you used AI assistance:
Explain your implementation approach:
The confidence band is computed deterministically from
validity_score(LLM-assigned 0–1) using fixed thresholds — no second LLM call for the band itself. The sufficiency gate runs afterparse_diagnosis()and checks both score and validated claim count, because a high LLM-reported score with zero validated claims should not pass on its own. Per-claim evidence attribution uses a_ValidatedClaimSchemaPydantic model so the LLM attributes specific evidence keys per claim. The terminal renderer uses a filter pass to prevent the inline confidence section (embedded in the Slack message string) from double-printing alongside the styled confidence block.Checklist before requesting a review