fix(evaluator): RAGAS 0.4.x score extraction + sample_id propagation (v0.2.3) by hallengray · Pull Request #42 · hallengray/rag-forge

hallengray · 2026-04-16T13:16:03Z

Summary

Fixes the last two open RAG-Forge bugs identified during PearMedica Cycle 4 audit (2026-04-16):

C4-3 (HIGH): RAGAS score extraction ValueError — the evaluator runs all RAGAS evaluations to completion but can't read scores from the 0.4.x EvaluationResult object. Rewrites _extract_ragas_score() to handle the .scores list of MetricResult objects (with .value attribute), adds .to_pandas() fallback, preserves legacy 0.2.x/0.3.x paths.
C4-4 (LOW): sample_id: "(unknown)" in reports — the JSONL input loader never extracted case identifiers. Now reads case_id > sample_id > id > sequential fallback from telemetry JSONL.
Version bump: All 6 packages bumped to 0.2.3 (lockstep).

Three consecutive cycles of RAGAS failure (Cycles 2, 3, 4), each at a different layer. This fix addresses the final layer — result parsing.

Test plan

4 new RAGAS 0.4.x extraction tests (MetricResult .value, plain floats, .to_pandas() fallback, missing metric)
6 new sample_id tests (case_id, sample_id, id, priority ordering, sequential fallback, empty-string fallthrough)
296 tests pass, 3 skipped (pre-existing env skips: Playwright not in dev venv)
ruff clean, mypy clean, pnpm build/lint/typecheck clean
Version drift guard confirms __version__ matches pyproject.toml across all 3 Python packages
Validate with PearMedica Cycle 5: re-run --evaluator ragas and confirm scores are extracted for the first time

… + sample_id propagation

…C4-3)

… shape (C4-3) Add two new strategies at the top of _extract_ragas_score(): walk result.scores (list of per-sample dicts with MetricResult.value) and fall back to result.to_pandas() for the 0.4.x DataFrame path. Legacy 0.2.x/.get()/__getitem__/getattr paths remain as fallbacks for backward compatibility.

…_id > id > sequential (C4-4)

…omments

coderabbitai · 2026-04-16T13:17:49Z

Summary by CodeRabbit

Bug Fixes
- Enhanced RAGAS score extraction with additional fallback strategies
- Improved sample ID population in JSONL input loading with prioritized field selection
Tests
- Added test coverage for RAGAS score extraction variations
- Added test coverage for sample ID extraction from JSONL inputs
Documentation
- Added implementation plan for v0.2.3 evaluator improvements
- Added design specification for evaluator changes
Chores
- Version bumped to 0.2.3 across all packages

Walkthrough

This PR releases v0.2.3 with two evaluator bug fixes. It updates RAGAS score extraction to support RAGAS 0.4.x result shapes via per-sample metric extraction with fallback strategies, adds sample_id population in JSONL loading with field priority and sequential fallback, and bumps version numbers across all packages. Documentation and comprehensive tests are included.

Changes

Cohort / File(s)	Summary
Documentation & Planning `docs/superpowers/plans/2026-04-16-v023-evaluator-bugfixes.md`, `docs/superpowers/specs/2026-04-16-v023-evaluator-bugfixes-design.md`	Added implementation plan and design specification for v0.2.3 evaluator bug fixes covering RAGAS 0.4.x compatibility and sample_id extraction with detailed scope, implementation steps, and test coverage requirements.
Version Bumps `packages/cli/package.json`, `packages/core/pyproject.toml`, `packages/core/src/rag_forge_core/__init__.py`, `packages/evaluator/pyproject.toml`, `packages/evaluator/src/rag_forge_evaluator/__init__.py`, `packages/mcp/package.json`, `packages/observability/pyproject.toml`, `packages/observability/src/rag_forge_observability/__init__.py`, `packages/shared/package.json`	Updated package versions from 0.2.2 to 0.2.3 across all package manifests and \init\.py exports.
RAGAS Score Extraction `packages/evaluator/src/rag_forge_evaluator/engines/ragas_evaluator.py`, `packages/evaluator/tests/test_ragas_extractor.py`	Rewrote `_extract_ragas_score()` with five-strategy fallback chain: extract per-sample metrics from `.scores` list, fall back to `.to_pandas()` conversion with mean aggregation, then try legacy dict/index/attribute access patterns. Added tests for RAGAS 0.4.x mock shapes and fallback paths.
JSONL Sample ID Extraction `packages/evaluator/src/rag_forge_evaluator/input_loader.py`, `packages/evaluator/tests/test_input_loader.py`	Updated `load_jsonl()` to populate `sample_id` from JSONL fields with priority (`case_id` > `sample_id` > `id`) or generate deterministic `sample-{NNN}` sequential fallback. Added tests validating field priority, fallback behavior, and empty string handling.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 Score extraction takes a hop,
Sampling IDs never stop,
RAGAS bounces, version bumps align,
Four-point-oh compatibility—simply divine!
New test cases, fallback chains so fine. 🔄✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 46.15% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically summarizes the two main fixes (RAGAS 0.4.x score extraction and sample_id propagation) and version bump, directly corresponding to the core changes in the PR.
Description check	✅ Passed	The description is highly detailed and directly related to the changeset, explaining both high-priority bug fixes (C4-3 and C4-4), the technical approach, version bumping, test plan, and validation strategy.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/v023-evaluator-bugfixes

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/superpowers/plans/2026-04-16-v023-evaluator-bugfixes.md`:
- Line 191: The implementation's broad exception handler (the "except
Exception:" block in the evaluator) is missing the cosmetic suppression comment
present in the plan; update that except Exception line to append the same
comment "# noqa: BLE001 — defensive fallback" so the code matches the
documentation and explicitly documents the intentional defensive fallback.

In `@packages/evaluator/src/rag_forge_evaluator/engines/ragas_evaluator.py`:
- Around line 43-113: The design spec references a missing helper
_extract_per_sample_scores() for populating sample_results, but the code
currently only implements aggregate extraction in _extract_ragas_score(); add a
short TODO comment near the top of this module (or immediately above
_extract_ragas_score) stating that per-sample extraction via
_extract_per_sample_scores() is planned as a follow-up, or update the
design/spec to mark that per-sample population is deferred, so reviewers know
this is intentional and not a bug.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: d881b010-4d7c-41c1-9191-8c026ad067a9

📥 Commits

Reviewing files that changed from the base of the PR and between 8f45df5 and 5f08605.

📒 Files selected for processing (15)

docs/superpowers/plans/2026-04-16-v023-evaluator-bugfixes.md
docs/superpowers/specs/2026-04-16-v023-evaluator-bugfixes-design.md
packages/cli/package.json
packages/core/pyproject.toml
packages/core/src/rag_forge_core/__init__.py
packages/evaluator/pyproject.toml
packages/evaluator/src/rag_forge_evaluator/__init__.py
packages/evaluator/src/rag_forge_evaluator/engines/ragas_evaluator.py
packages/evaluator/src/rag_forge_evaluator/input_loader.py
packages/evaluator/tests/test_input_loader.py
packages/evaluator/tests/test_ragas_extractor.py
packages/mcp/package.json
packages/observability/pyproject.toml
packages/observability/src/rag_forge_observability/__init__.py
packages/shared/package.json

📜 Review details

🔇 Additional comments (24)

packages/shared/package.json (1)

3-3: Version bump is correct and consistent.

The package version update to 0.2.3 matches the coordinated release metadata changes.

packages/mcp/package.json (1)

3-3: LGTM for release metadata update.

@rag-forge/mcp version bump to 0.2.3 is clean and aligned with the PR scope.

packages/observability/pyproject.toml (1)

3-3: Package metadata bump looks good.

rag-forge-observability version is correctly updated to 0.2.3.

packages/cli/package.json (1)

3-3: Version update is correct.

The @rag-forge/cli manifest bump to 0.2.3 is consistent with the lockstep release.

packages/evaluator/pyproject.toml (1)

3-3: Evaluator package version bump is valid.

rag-forge-evaluator is correctly set to 0.2.3.

packages/observability/src/rag_forge_observability/__init__.py (1)

5-5: __version__ update is consistent.

The exported runtime version now correctly reflects 0.2.3.

packages/evaluator/src/rag_forge_evaluator/__init__.py (1)

3-3: Runtime version bump is correct.

__version__ = "0.2.3" is aligned with package metadata.

packages/core/src/rag_forge_core/__init__.py (1)

3-3: Core version constant update looks good.

The rag_forge_core public version is correctly bumped to 0.2.3.

packages/core/pyproject.toml (1)

3-3: LGTM — Version bump aligns with lockstep release convention.

packages/evaluator/src/rag_forge_evaluator/input_loader.py (2)

40-45: LGTM — Sample ID extraction with correct priority and fallback.

The or chaining correctly treats empty strings as falsy, ensuring they fall through to the next candidate. The sequential fallback sample-{line_num:03d} provides deterministic IDs for files without identifiers.

One minor edge case: whitespace-only strings like " " would be truthy and used as the ID. Consider whether .strip() should be applied, though this is low priority if your JSONL sources don't produce such values.

47-56: LGTM — Sample construction correctly includes the extracted sample_id.

The sample_id parameter is correctly passed to EvaluationSample, which will propagate to downstream report generation (eliminating the "(unknown)" fallback).

packages/evaluator/tests/test_ragas_extractor.py (4)

96-115: LGTM — Good coverage of RAGAS 0.4.x result.scores with MetricResult objects.

The test correctly simulates the 0.4.x shape where scores is a list of per-sample dicts containing MetricResult objects with .value attributes. The averaging assertion (0.80 for [0.90, 0.70]) validates the implementation.

118-130: LGTM — Covers the plain-float edge case in result.scores.

This test ensures the extractor handles cases where metrics return plain floats instead of MetricResult objects, exercising the fallback path that attempts float(raw) directly.

133-168: LGTM — Well-designed mock for to_pandas() fallback without pandas dependency.

The duck-typed _FakeColumn and _FakeDataFrame classes correctly simulate the pandas interfaces used by the implementation (dropna(), __len__(), mean(), columns, __getitem__). The empty scores list forces the fallback path.

171-180: LGTM — Validates ValueError when metric is missing from all extraction paths.

This test ensures the extractor properly raises ValueError when the metric isn't present in scores and no to_pandas() fallback is available, matching the documented behavior of no silent 0.0 fallback.

packages/evaluator/tests/test_input_loader.py (3)

54-92: LGTM — Comprehensive tests for sample_id field priority.

The tests thoroughly cover the extraction priority order (case_id > sample_id > id) and verify that case_id takes precedence when both fields exist.

94-104: LGTM — Validates sequential fallback format.

Testing both sample-001 and sample-002 ensures the line-number-based fallback produces the expected zero-padded format.

106-120: LGTM — Critical test for empty-string fallthrough behavior.

This test validates that empty strings are treated as falsy in the or chain, correctly falling through to the next candidate (sample_id: "real-id"). This is essential for robustness when JSONL files have empty case_id fields.

packages/evaluator/src/rag_forge_evaluator/engines/ragas_evaluator.py (3)

14-14: LGTM — Clean import for contextlib.suppress.

60-77: LGTM — Strategy 1 correctly handles RAGAS 0.4.x result.scores shape.

The implementation properly:

Guards against empty lists with and scores_attr

Extracts .value from MetricResult objects first

Falls back to direct float conversion for plain floats

Uses contextlib.suppress for defensive error handling

Returns the mean of collected values

79-89: LGTM — Strategy 2 provides to_pandas() DataFrame fallback.

The implementation correctly:

Checks callable(to_pandas) before invocation

Validates column existence before access

Guards against empty columns with len(col) > 0

Uses broad exception handling for defensive fallback

docs/superpowers/plans/2026-04-16-v023-evaluator-bugfixes.md (1)

1-11: LGTM — Clear implementation plan with TDD approach.

The plan follows a disciplined test-first workflow: failing tests committed first, then implementation, then verification. The task breakdown is logical and the verification commands are actionable.

docs/superpowers/specs/2026-04-16-v023-evaluator-bugfixes-design.md (2)

112-140: Spec-implementation gap: Per-sample extraction described but not implemented.

Section 2 ("Add per-sample RAGAS score extraction") and Section 3 ("Update RagasEvaluator.evaluate() caller logic") describe functionality for populating sample_results with per-sample RAGAS scores. This is not implemented in the current PR — the implementation only extracts aggregate scores.

Consider adding a note to the spec indicating this is planned for a future iteration, or update the spec to reflect the actual v0.2.3 scope (aggregate-only extraction fix).

184-238: LGTM — Sample ID fix section accurately describes the implementation.

The root cause analysis, fix approach, and field priority order (case_id > sample_id > id > sequential) match the implementation exactly. The documentation of downstream consumers benefiting automatically is helpful context.

coderabbitai · 2026-04-16T13:21:21Z

+                col = df[name].dropna()
+                if len(col) > 0:
+                    return float(col.mean())
+        except Exception:  # noqa: BLE001 — defensive fallback


🧹 Nitpick | 🔵 Trivial

Minor inconsistency: Plan includes # noqa: BLE001 comment not in implementation.

Line 191 shows except Exception: # noqa: BLE001 — defensive fallback but the actual implementation at ragas_evaluator.py:88 uses just except Exception: without the noqa comment. This is cosmetic — the broad exception handling is intentional for defensive fallback.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/superpowers/plans/2026-04-16-v023-evaluator-bugfixes.md` at line 191, The implementation's broad exception handler (the "except Exception:" block in the evaluator) is missing the cosmetic suppression comment present in the plan; update that except Exception line to append the same comment "# noqa: BLE001 — defensive fallback" so the code matches the documentation and explicitly documents the intentional defensive fallback.

coderabbitai · 2026-04-16T13:21:21Z

 def _extract_ragas_score(result: object, name: str) -> float:
-    """Extract a metric score from a ragas result object.
+    """Extract an aggregate metric score from a ragas result object.

-    Raises ValueError if the score cannot be extracted — the caller
-    decides whether to record a SkipRecord or re-raise. No silent 0.0
-    fallback (that was the bug surfaced by Cycle 2).
+    Tries extraction strategies in order of RAGAS version likelihood:
+
+    1. ``.scores`` list (RAGAS 0.4.x) — per-sample ``MetricResult``
+       objects whose float score lives at ``.value``.
+    2. ``.to_pandas()`` (RAGAS 0.4.x fallback) — DataFrame with metric
+       names as columns and float values as cells.
+    3. ``.get()`` (RAGAS 0.2.x) — dict-like access.
+    4. ``[]`` indexing (generic).
+    5. ``getattr`` (generic).

-    ragas 0.2.x returns a dict-like result supporting ``.get()``.
-    ragas 0.4.x returns an ``EvaluationResult`` dataclass; ``__getitem__``
-    works on it but ``.get()`` does not.
-    ragas 0.3.x sits between the two with intermediate forms.
+    Raises ``ValueError`` if all strategies fail — the caller decides
+    whether to record a ``SkipRecord`` or re-raise.  No silent 0.0
+    fallback (that was the bug surfaced by Cycle 2).
    """
+    # --- Strategy 1: RAGAS 0.4.x .scores attribute ---
+    # result.scores is a list[dict[str, MetricResult | float]], one dict
+    # per sample. MetricResult wraps the float at .value.
+    scores_attr = getattr(result, "scores", None)
+    if isinstance(scores_attr, list) and scores_attr:
+        values: list[float] = []
+        for entry in scores_attr:
+            if isinstance(entry, dict) and name in entry:
+                raw = entry[name]
+                val = getattr(raw, "value", None)
+                if val is not None:
+                    with contextlib.suppress(TypeError, ValueError):
+                        values.append(float(val))
+                else:
+                    with contextlib.suppress(TypeError, ValueError):
+                        values.append(float(raw))
+        if values:
+            return sum(values) / len(values)
+
+    # --- Strategy 2: RAGAS 0.4.x .to_pandas() fallback ---
+    to_pandas = getattr(result, "to_pandas", None)
+    if callable(to_pandas):
+        try:
+            df = to_pandas()
+            if name in df.columns:
+                col = df[name].dropna()
+                if len(col) > 0:
+                    return float(col.mean())
+        except Exception:
+            pass
+
+    # --- Strategy 3: RAGAS 0.2.x dict-like .get() ---
    if hasattr(result, "get"):
        try:
            value = result.get(name, None)
            if value is not None:
                return float(value)
        except (TypeError, ValueError):
            pass
+
+    # --- Strategy 4: generic __getitem__ ---
    try:
        return float(result[name])  # type: ignore[index]
    except (KeyError, TypeError, ValueError, IndexError):
        pass
+
+    # --- Strategy 5: generic attribute access ---
    if hasattr(result, name):
        try:
            return float(getattr(result, name))
        except (TypeError, ValueError):
            pass
+
    raise ValueError(f"could not extract ragas score for metric {name!r}")


🧹 Nitpick | 🔵 Trivial

Design spec mentions _extract_per_sample_scores() helper that is not implemented.

The design spec (Section 2, lines 112-131) describes adding a _extract_per_sample_scores() function for populating sample_results with per-sample RAGAS scores. The current implementation only extracts aggregate scores (averaging in _extract_ragas_score()).

This appears intentional — the PR objective (C4-3) is to fix the ValueError, which this implementation achieves. The per-sample result population would be a follow-up enhancement. Consider either:

Updating the design spec to mark per-sample extraction as future work, or

Adding a TODO comment noting this planned enhancement.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@packages/evaluator/src/rag_forge_evaluator/engines/ragas_evaluator.py` around lines 43 - 113, The design spec references a missing helper _extract_per_sample_scores() for populating sample_results, but the code currently only implements aggregate extraction in _extract_ragas_score(); add a short TODO comment near the top of this module (or immediately above _extract_ragas_score) stating that per-sample extraction via _extract_per_sample_scores() is planned as a follow-up, or update the design/spec to mark that per-sample population is deferred, so reviewers know this is intentional and not a bug.

hallengray added 9 commits April 16, 2026 13:30

docs(spec): v0.2.3 evaluator bugfixes design — RAGAS score extraction…

b6df8a1

… + sample_id propagation

docs(plan): v0.2.3 evaluator bugfixes implementation plan — 6 tasks, TDD

9ff8b29

test(evaluator): add failing tests for RAGAS 0.4.x score extraction (…

e44c05a

…C4-3)

test(evaluator): add failing tests for sample_id JSONL extraction (C4-4)

33cf90a

fix(evaluator): extract sample_id from JSONL input — case_id > sample…

0b7f8f5

…_id > id > sequential (C4-4)

test(evaluator): document empty-string case_id fallthrough behavior

c97b1dd

chore(release): bump all packages to v0.2.3

a74ad54

style(evaluator): fix ruff SIM105 and remove stale type-ignore/noqa c…

5f08605

…omments

coderabbitai Bot reviewed Apr 16, 2026

View reviewed changes

hallengray merged commit 4ab9aca into main Apr 16, 2026
3 checks passed

hallengray deleted the fix/v023-evaluator-bugfixes branch April 16, 2026 13:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(evaluator): RAGAS 0.4.x score extraction + sample_id propagation (v0.2.3)#42

fix(evaluator): RAGAS 0.4.x score extraction + sample_id propagation (v0.2.3)#42
hallengray merged 9 commits into
mainfrom
fix/v023-evaluator-bugfixes

hallengray commented Apr 16, 2026

Uh oh!

coderabbitai Bot commented Apr 16, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Apr 16, 2026

Uh oh!

coderabbitai Bot Apr 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hallengray commented Apr 16, 2026

Summary

Test plan

Uh oh!

coderabbitai Bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Apr 16, 2026 •

edited

Loading