fix(evaluator): RAGAS 0.4.x score extraction + sample_id propagation (v0.2.3)#42
Conversation
… + sample_id propagation
… shape (C4-3) Add two new strategies at the top of _extract_ragas_score(): walk result.scores (list of per-sample dicts with MetricResult.value) and fall back to result.to_pandas() for the 0.4.x DataFrame path. Legacy 0.2.x/.get()/__getitem__/getattr paths remain as fallbacks for backward compatibility.
…_id > id > sequential (C4-4)
Summary by CodeRabbit
WalkthroughThis PR releases v0.2.3 with two evaluator bug fixes. It updates RAGAS score extraction to support RAGAS 0.4.x result shapes via per-sample metric extraction with fallback strategies, adds sample_id population in JSONL loading with field priority and sequential fallback, and bumps version numbers across all packages. Documentation and comprehensive tests are included. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/superpowers/plans/2026-04-16-v023-evaluator-bugfixes.md`:
- Line 191: The implementation's broad exception handler (the "except
Exception:" block in the evaluator) is missing the cosmetic suppression comment
present in the plan; update that except Exception line to append the same
comment "# noqa: BLE001 — defensive fallback" so the code matches the
documentation and explicitly documents the intentional defensive fallback.
In `@packages/evaluator/src/rag_forge_evaluator/engines/ragas_evaluator.py`:
- Around line 43-113: The design spec references a missing helper
_extract_per_sample_scores() for populating sample_results, but the code
currently only implements aggregate extraction in _extract_ragas_score(); add a
short TODO comment near the top of this module (or immediately above
_extract_ragas_score) stating that per-sample extraction via
_extract_per_sample_scores() is planned as a follow-up, or update the
design/spec to mark that per-sample population is deferred, so reviewers know
this is intentional and not a bug.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: d881b010-4d7c-41c1-9191-8c026ad067a9
📒 Files selected for processing (15)
docs/superpowers/plans/2026-04-16-v023-evaluator-bugfixes.mddocs/superpowers/specs/2026-04-16-v023-evaluator-bugfixes-design.mdpackages/cli/package.jsonpackages/core/pyproject.tomlpackages/core/src/rag_forge_core/__init__.pypackages/evaluator/pyproject.tomlpackages/evaluator/src/rag_forge_evaluator/__init__.pypackages/evaluator/src/rag_forge_evaluator/engines/ragas_evaluator.pypackages/evaluator/src/rag_forge_evaluator/input_loader.pypackages/evaluator/tests/test_input_loader.pypackages/evaluator/tests/test_ragas_extractor.pypackages/mcp/package.jsonpackages/observability/pyproject.tomlpackages/observability/src/rag_forge_observability/__init__.pypackages/shared/package.json
📜 Review details
🔇 Additional comments (24)
packages/shared/package.json (1)
3-3: Version bump is correct and consistent.The package version update to
0.2.3matches the coordinated release metadata changes.packages/mcp/package.json (1)
3-3: LGTM for release metadata update.
@rag-forge/mcpversion bump to0.2.3is clean and aligned with the PR scope.packages/observability/pyproject.toml (1)
3-3: Package metadata bump looks good.
rag-forge-observabilityversion is correctly updated to0.2.3.packages/cli/package.json (1)
3-3: Version update is correct.The
@rag-forge/climanifest bump to0.2.3is consistent with the lockstep release.packages/evaluator/pyproject.toml (1)
3-3: Evaluator package version bump is valid.
rag-forge-evaluatoris correctly set to0.2.3.packages/observability/src/rag_forge_observability/__init__.py (1)
5-5:__version__update is consistent.The exported runtime version now correctly reflects
0.2.3.packages/evaluator/src/rag_forge_evaluator/__init__.py (1)
3-3: Runtime version bump is correct.
__version__ = "0.2.3"is aligned with package metadata.packages/core/src/rag_forge_core/__init__.py (1)
3-3: Core version constant update looks good.The
rag_forge_corepublic version is correctly bumped to0.2.3.packages/core/pyproject.toml (1)
3-3: LGTM — Version bump aligns with lockstep release convention.packages/evaluator/src/rag_forge_evaluator/input_loader.py (2)
40-45: LGTM — Sample ID extraction with correct priority and fallback.The
orchaining correctly treats empty strings as falsy, ensuring they fall through to the next candidate. The sequential fallbacksample-{line_num:03d}provides deterministic IDs for files without identifiers.One minor edge case: whitespace-only strings like
" "would be truthy and used as the ID. Consider whether.strip()should be applied, though this is low priority if your JSONL sources don't produce such values.
47-56: LGTM — Sample construction correctly includes the extracted sample_id.The
sample_idparameter is correctly passed toEvaluationSample, which will propagate to downstream report generation (eliminating the"(unknown)"fallback).packages/evaluator/tests/test_ragas_extractor.py (4)
96-115: LGTM — Good coverage of RAGAS 0.4.xresult.scoreswithMetricResultobjects.The test correctly simulates the 0.4.x shape where
scoresis a list of per-sample dicts containingMetricResultobjects with.valueattributes. The averaging assertion (0.80for[0.90, 0.70]) validates the implementation.
118-130: LGTM — Covers the plain-float edge case inresult.scores.This test ensures the extractor handles cases where metrics return plain floats instead of
MetricResultobjects, exercising the fallback path that attemptsfloat(raw)directly.
133-168: LGTM — Well-designed mock forto_pandas()fallback without pandas dependency.The duck-typed
_FakeColumnand_FakeDataFrameclasses correctly simulate the pandas interfaces used by the implementation (dropna(),__len__(),mean(),columns,__getitem__). The emptyscoreslist forces the fallback path.
171-180: LGTM — ValidatesValueErrorwhen metric is missing from all extraction paths.This test ensures the extractor properly raises
ValueErrorwhen the metric isn't present inscoresand noto_pandas()fallback is available, matching the documented behavior of no silent0.0fallback.packages/evaluator/tests/test_input_loader.py (3)
54-92: LGTM — Comprehensive tests for sample_id field priority.The tests thoroughly cover the extraction priority order (
case_id>sample_id>id) and verify thatcase_idtakes precedence when both fields exist.
94-104: LGTM — Validates sequential fallback format.Testing both
sample-001andsample-002ensures the line-number-based fallback produces the expected zero-padded format.
106-120: LGTM — Critical test for empty-string fallthrough behavior.This test validates that empty strings are treated as falsy in the
orchain, correctly falling through to the next candidate (sample_id: "real-id"). This is essential for robustness when JSONL files have emptycase_idfields.packages/evaluator/src/rag_forge_evaluator/engines/ragas_evaluator.py (3)
14-14: LGTM — Clean import for contextlib.suppress.
60-77: LGTM — Strategy 1 correctly handles RAGAS 0.4.xresult.scoresshape.The implementation properly:
- Guards against empty lists with
and scores_attr- Extracts
.valuefromMetricResultobjects first- Falls back to direct float conversion for plain floats
- Uses
contextlib.suppressfor defensive error handling- Returns the mean of collected values
79-89: LGTM — Strategy 2 providesto_pandas()DataFrame fallback.The implementation correctly:
- Checks
callable(to_pandas)before invocation- Validates column existence before access
- Guards against empty columns with
len(col) > 0- Uses broad exception handling for defensive fallback
docs/superpowers/plans/2026-04-16-v023-evaluator-bugfixes.md (1)
1-11: LGTM — Clear implementation plan with TDD approach.The plan follows a disciplined test-first workflow: failing tests committed first, then implementation, then verification. The task breakdown is logical and the verification commands are actionable.
docs/superpowers/specs/2026-04-16-v023-evaluator-bugfixes-design.md (2)
112-140: Spec-implementation gap: Per-sample extraction described but not implemented.Section 2 ("Add per-sample RAGAS score extraction") and Section 3 ("Update
RagasEvaluator.evaluate()caller logic") describe functionality for populatingsample_resultswith per-sample RAGAS scores. This is not implemented in the current PR — the implementation only extracts aggregate scores.Consider adding a note to the spec indicating this is planned for a future iteration, or update the spec to reflect the actual v0.2.3 scope (aggregate-only extraction fix).
184-238: LGTM — Sample ID fix section accurately describes the implementation.The root cause analysis, fix approach, and field priority order (
case_id>sample_id>id> sequential) match the implementation exactly. The documentation of downstream consumers benefiting automatically is helpful context.
| col = df[name].dropna() | ||
| if len(col) > 0: | ||
| return float(col.mean()) | ||
| except Exception: # noqa: BLE001 — defensive fallback |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Minor inconsistency: Plan includes # noqa: BLE001 comment not in implementation.
Line 191 shows except Exception: # noqa: BLE001 — defensive fallback but the actual implementation at ragas_evaluator.py:88 uses just except Exception: without the noqa comment. This is cosmetic — the broad exception handling is intentional for defensive fallback.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/superpowers/plans/2026-04-16-v023-evaluator-bugfixes.md` at line 191,
The implementation's broad exception handler (the "except Exception:" block in
the evaluator) is missing the cosmetic suppression comment present in the plan;
update that except Exception line to append the same comment "# noqa: BLE001 —
defensive fallback" so the code matches the documentation and explicitly
documents the intentional defensive fallback.
| def _extract_ragas_score(result: object, name: str) -> float: | ||
| """Extract a metric score from a ragas result object. | ||
| """Extract an aggregate metric score from a ragas result object. | ||
|
|
||
| Raises ValueError if the score cannot be extracted — the caller | ||
| decides whether to record a SkipRecord or re-raise. No silent 0.0 | ||
| fallback (that was the bug surfaced by Cycle 2). | ||
| Tries extraction strategies in order of RAGAS version likelihood: | ||
|
|
||
| 1. ``.scores`` list (RAGAS 0.4.x) — per-sample ``MetricResult`` | ||
| objects whose float score lives at ``.value``. | ||
| 2. ``.to_pandas()`` (RAGAS 0.4.x fallback) — DataFrame with metric | ||
| names as columns and float values as cells. | ||
| 3. ``.get()`` (RAGAS 0.2.x) — dict-like access. | ||
| 4. ``[]`` indexing (generic). | ||
| 5. ``getattr`` (generic). | ||
|
|
||
| ragas 0.2.x returns a dict-like result supporting ``.get()``. | ||
| ragas 0.4.x returns an ``EvaluationResult`` dataclass; ``__getitem__`` | ||
| works on it but ``.get()`` does not. | ||
| ragas 0.3.x sits between the two with intermediate forms. | ||
| Raises ``ValueError`` if all strategies fail — the caller decides | ||
| whether to record a ``SkipRecord`` or re-raise. No silent 0.0 | ||
| fallback (that was the bug surfaced by Cycle 2). | ||
| """ | ||
| # --- Strategy 1: RAGAS 0.4.x .scores attribute --- | ||
| # result.scores is a list[dict[str, MetricResult | float]], one dict | ||
| # per sample. MetricResult wraps the float at .value. | ||
| scores_attr = getattr(result, "scores", None) | ||
| if isinstance(scores_attr, list) and scores_attr: | ||
| values: list[float] = [] | ||
| for entry in scores_attr: | ||
| if isinstance(entry, dict) and name in entry: | ||
| raw = entry[name] | ||
| val = getattr(raw, "value", None) | ||
| if val is not None: | ||
| with contextlib.suppress(TypeError, ValueError): | ||
| values.append(float(val)) | ||
| else: | ||
| with contextlib.suppress(TypeError, ValueError): | ||
| values.append(float(raw)) | ||
| if values: | ||
| return sum(values) / len(values) | ||
|
|
||
| # --- Strategy 2: RAGAS 0.4.x .to_pandas() fallback --- | ||
| to_pandas = getattr(result, "to_pandas", None) | ||
| if callable(to_pandas): | ||
| try: | ||
| df = to_pandas() | ||
| if name in df.columns: | ||
| col = df[name].dropna() | ||
| if len(col) > 0: | ||
| return float(col.mean()) | ||
| except Exception: | ||
| pass | ||
|
|
||
| # --- Strategy 3: RAGAS 0.2.x dict-like .get() --- | ||
| if hasattr(result, "get"): | ||
| try: | ||
| value = result.get(name, None) | ||
| if value is not None: | ||
| return float(value) | ||
| except (TypeError, ValueError): | ||
| pass | ||
|
|
||
| # --- Strategy 4: generic __getitem__ --- | ||
| try: | ||
| return float(result[name]) # type: ignore[index] | ||
| except (KeyError, TypeError, ValueError, IndexError): | ||
| pass | ||
|
|
||
| # --- Strategy 5: generic attribute access --- | ||
| if hasattr(result, name): | ||
| try: | ||
| return float(getattr(result, name)) | ||
| except (TypeError, ValueError): | ||
| pass | ||
|
|
||
| raise ValueError(f"could not extract ragas score for metric {name!r}") |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Design spec mentions _extract_per_sample_scores() helper that is not implemented.
The design spec (Section 2, lines 112-131) describes adding a _extract_per_sample_scores() function for populating sample_results with per-sample RAGAS scores. The current implementation only extracts aggregate scores (averaging in _extract_ragas_score()).
This appears intentional — the PR objective (C4-3) is to fix the ValueError, which this implementation achieves. The per-sample result population would be a follow-up enhancement. Consider either:
- Updating the design spec to mark per-sample extraction as future work, or
- Adding a TODO comment noting this planned enhancement.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@packages/evaluator/src/rag_forge_evaluator/engines/ragas_evaluator.py` around
lines 43 - 113, The design spec references a missing helper
_extract_per_sample_scores() for populating sample_results, but the code
currently only implements aggregate extraction in _extract_ragas_score(); add a
short TODO comment near the top of this module (or immediately above
_extract_ragas_score) stating that per-sample extraction via
_extract_per_sample_scores() is planned as a follow-up, or update the
design/spec to mark that per-sample population is deferred, so reviewers know
this is intentional and not a bug.
Summary
Fixes the last two open RAG-Forge bugs identified during PearMedica Cycle 4 audit (2026-04-16):
ValueError— the evaluator runs all RAGAS evaluations to completion but can't read scores from the 0.4.xEvaluationResultobject. Rewrites_extract_ragas_score()to handle the.scoreslist ofMetricResultobjects (with.valueattribute), adds.to_pandas()fallback, preserves legacy 0.2.x/0.3.x paths.sample_id: "(unknown)"in reports — the JSONL input loader never extracted case identifiers. Now readscase_id > sample_id > id > sequential fallbackfrom telemetry JSONL.Three consecutive cycles of RAGAS failure (Cycles 2, 3, 4), each at a different layer. This fix addresses the final layer — result parsing.
Test plan
.value, plain floats,.to_pandas()fallback, missing metric)__version__matchespyproject.tomlacross all 3 Python packages--evaluator ragasand confirm scores are extracted for the first time