Skip to content

fix(evaluator): RAGAS 0.4.x score extraction + sample_id propagation (v0.2.3)#42

Merged
hallengray merged 9 commits into
mainfrom
fix/v023-evaluator-bugfixes
Apr 16, 2026
Merged

fix(evaluator): RAGAS 0.4.x score extraction + sample_id propagation (v0.2.3)#42
hallengray merged 9 commits into
mainfrom
fix/v023-evaluator-bugfixes

Conversation

@hallengray

Copy link
Copy Markdown
Owner

Summary

Fixes the last two open RAG-Forge bugs identified during PearMedica Cycle 4 audit (2026-04-16):

  • C4-3 (HIGH): RAGAS score extraction ValueError — the evaluator runs all RAGAS evaluations to completion but can't read scores from the 0.4.x EvaluationResult object. Rewrites _extract_ragas_score() to handle the .scores list of MetricResult objects (with .value attribute), adds .to_pandas() fallback, preserves legacy 0.2.x/0.3.x paths.
  • C4-4 (LOW): sample_id: "(unknown)" in reports — the JSONL input loader never extracted case identifiers. Now reads case_id > sample_id > id > sequential fallback from telemetry JSONL.
  • Version bump: All 6 packages bumped to 0.2.3 (lockstep).

Three consecutive cycles of RAGAS failure (Cycles 2, 3, 4), each at a different layer. This fix addresses the final layer — result parsing.

Test plan

  • 4 new RAGAS 0.4.x extraction tests (MetricResult .value, plain floats, .to_pandas() fallback, missing metric)
  • 6 new sample_id tests (case_id, sample_id, id, priority ordering, sequential fallback, empty-string fallthrough)
  • 296 tests pass, 3 skipped (pre-existing env skips: Playwright not in dev venv)
  • ruff clean, mypy clean, pnpm build/lint/typecheck clean
  • Version drift guard confirms __version__ matches pyproject.toml across all 3 Python packages
  • Validate with PearMedica Cycle 5: re-run --evaluator ragas and confirm scores are extracted for the first time

@coderabbitai

coderabbitai Bot commented Apr 16, 2026

Copy link
Copy Markdown

Summary by CodeRabbit

  • Bug Fixes

    • Enhanced RAGAS score extraction with additional fallback strategies
    • Improved sample ID population in JSONL input loading with prioritized field selection
  • Tests

    • Added test coverage for RAGAS score extraction variations
    • Added test coverage for sample ID extraction from JSONL inputs
  • Documentation

    • Added implementation plan for v0.2.3 evaluator improvements
    • Added design specification for evaluator changes
  • Chores

    • Version bumped to 0.2.3 across all packages

Walkthrough

This PR releases v0.2.3 with two evaluator bug fixes. It updates RAGAS score extraction to support RAGAS 0.4.x result shapes via per-sample metric extraction with fallback strategies, adds sample_id population in JSONL loading with field priority and sequential fallback, and bumps version numbers across all packages. Documentation and comprehensive tests are included.

Changes

Cohort / File(s) Summary
Documentation & Planning
docs/superpowers/plans/2026-04-16-v023-evaluator-bugfixes.md, docs/superpowers/specs/2026-04-16-v023-evaluator-bugfixes-design.md
Added implementation plan and design specification for v0.2.3 evaluator bug fixes covering RAGAS 0.4.x compatibility and sample_id extraction with detailed scope, implementation steps, and test coverage requirements.
Version Bumps
packages/cli/package.json, packages/core/pyproject.toml, packages/core/src/rag_forge_core/__init__.py, packages/evaluator/pyproject.toml, packages/evaluator/src/rag_forge_evaluator/__init__.py, packages/mcp/package.json, packages/observability/pyproject.toml, packages/observability/src/rag_forge_observability/__init__.py, packages/shared/package.json
Updated package versions from 0.2.2 to 0.2.3 across all package manifests and \init\.py exports.
RAGAS Score Extraction
packages/evaluator/src/rag_forge_evaluator/engines/ragas_evaluator.py, packages/evaluator/tests/test_ragas_extractor.py
Rewrote _extract_ragas_score() with five-strategy fallback chain: extract per-sample metrics from .scores list, fall back to .to_pandas() conversion with mean aggregation, then try legacy dict/index/attribute access patterns. Added tests for RAGAS 0.4.x mock shapes and fallback paths.
JSONL Sample ID Extraction
packages/evaluator/src/rag_forge_evaluator/input_loader.py, packages/evaluator/tests/test_input_loader.py
Updated load_jsonl() to populate sample_id from JSONL fields with priority (case_id > sample_id > id) or generate deterministic sample-{NNN} sequential fallback. Added tests validating field priority, fallback behavior, and empty string handling.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 Score extraction takes a hop,
Sampling IDs never stop,
RAGAS bounces, version bumps align,
Four-point-oh compatibility—simply divine!
New test cases, fallback chains so fine. 🔄✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 46.15% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically summarizes the two main fixes (RAGAS 0.4.x score extraction and sample_id propagation) and version bump, directly corresponding to the core changes in the PR.
Description check ✅ Passed The description is highly detailed and directly related to the changeset, explaining both high-priority bug fixes (C4-3 and C4-4), the technical approach, version bumping, test plan, and validation strategy.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/v023-evaluator-bugfixes

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/superpowers/plans/2026-04-16-v023-evaluator-bugfixes.md`:
- Line 191: The implementation's broad exception handler (the "except
Exception:" block in the evaluator) is missing the cosmetic suppression comment
present in the plan; update that except Exception line to append the same
comment "# noqa: BLE001 — defensive fallback" so the code matches the
documentation and explicitly documents the intentional defensive fallback.

In `@packages/evaluator/src/rag_forge_evaluator/engines/ragas_evaluator.py`:
- Around line 43-113: The design spec references a missing helper
_extract_per_sample_scores() for populating sample_results, but the code
currently only implements aggregate extraction in _extract_ragas_score(); add a
short TODO comment near the top of this module (or immediately above
_extract_ragas_score) stating that per-sample extraction via
_extract_per_sample_scores() is planned as a follow-up, or update the
design/spec to mark that per-sample population is deferred, so reviewers know
this is intentional and not a bug.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: d881b010-4d7c-41c1-9191-8c026ad067a9

📥 Commits

Reviewing files that changed from the base of the PR and between 8f45df5 and 5f08605.

📒 Files selected for processing (15)
  • docs/superpowers/plans/2026-04-16-v023-evaluator-bugfixes.md
  • docs/superpowers/specs/2026-04-16-v023-evaluator-bugfixes-design.md
  • packages/cli/package.json
  • packages/core/pyproject.toml
  • packages/core/src/rag_forge_core/__init__.py
  • packages/evaluator/pyproject.toml
  • packages/evaluator/src/rag_forge_evaluator/__init__.py
  • packages/evaluator/src/rag_forge_evaluator/engines/ragas_evaluator.py
  • packages/evaluator/src/rag_forge_evaluator/input_loader.py
  • packages/evaluator/tests/test_input_loader.py
  • packages/evaluator/tests/test_ragas_extractor.py
  • packages/mcp/package.json
  • packages/observability/pyproject.toml
  • packages/observability/src/rag_forge_observability/__init__.py
  • packages/shared/package.json
📜 Review details
🔇 Additional comments (24)
packages/shared/package.json (1)

3-3: Version bump is correct and consistent.

The package version update to 0.2.3 matches the coordinated release metadata changes.

packages/mcp/package.json (1)

3-3: LGTM for release metadata update.

@rag-forge/mcp version bump to 0.2.3 is clean and aligned with the PR scope.

packages/observability/pyproject.toml (1)

3-3: Package metadata bump looks good.

rag-forge-observability version is correctly updated to 0.2.3.

packages/cli/package.json (1)

3-3: Version update is correct.

The @rag-forge/cli manifest bump to 0.2.3 is consistent with the lockstep release.

packages/evaluator/pyproject.toml (1)

3-3: Evaluator package version bump is valid.

rag-forge-evaluator is correctly set to 0.2.3.

packages/observability/src/rag_forge_observability/__init__.py (1)

5-5: __version__ update is consistent.

The exported runtime version now correctly reflects 0.2.3.

packages/evaluator/src/rag_forge_evaluator/__init__.py (1)

3-3: Runtime version bump is correct.

__version__ = "0.2.3" is aligned with package metadata.

packages/core/src/rag_forge_core/__init__.py (1)

3-3: Core version constant update looks good.

The rag_forge_core public version is correctly bumped to 0.2.3.

packages/core/pyproject.toml (1)

3-3: LGTM — Version bump aligns with lockstep release convention.

packages/evaluator/src/rag_forge_evaluator/input_loader.py (2)

40-45: LGTM — Sample ID extraction with correct priority and fallback.

The or chaining correctly treats empty strings as falsy, ensuring they fall through to the next candidate. The sequential fallback sample-{line_num:03d} provides deterministic IDs for files without identifiers.

One minor edge case: whitespace-only strings like " " would be truthy and used as the ID. Consider whether .strip() should be applied, though this is low priority if your JSONL sources don't produce such values.


47-56: LGTM — Sample construction correctly includes the extracted sample_id.

The sample_id parameter is correctly passed to EvaluationSample, which will propagate to downstream report generation (eliminating the "(unknown)" fallback).

packages/evaluator/tests/test_ragas_extractor.py (4)

96-115: LGTM — Good coverage of RAGAS 0.4.x result.scores with MetricResult objects.

The test correctly simulates the 0.4.x shape where scores is a list of per-sample dicts containing MetricResult objects with .value attributes. The averaging assertion (0.80 for [0.90, 0.70]) validates the implementation.


118-130: LGTM — Covers the plain-float edge case in result.scores.

This test ensures the extractor handles cases where metrics return plain floats instead of MetricResult objects, exercising the fallback path that attempts float(raw) directly.


133-168: LGTM — Well-designed mock for to_pandas() fallback without pandas dependency.

The duck-typed _FakeColumn and _FakeDataFrame classes correctly simulate the pandas interfaces used by the implementation (dropna(), __len__(), mean(), columns, __getitem__). The empty scores list forces the fallback path.


171-180: LGTM — Validates ValueError when metric is missing from all extraction paths.

This test ensures the extractor properly raises ValueError when the metric isn't present in scores and no to_pandas() fallback is available, matching the documented behavior of no silent 0.0 fallback.

packages/evaluator/tests/test_input_loader.py (3)

54-92: LGTM — Comprehensive tests for sample_id field priority.

The tests thoroughly cover the extraction priority order (case_id > sample_id > id) and verify that case_id takes precedence when both fields exist.


94-104: LGTM — Validates sequential fallback format.

Testing both sample-001 and sample-002 ensures the line-number-based fallback produces the expected zero-padded format.


106-120: LGTM — Critical test for empty-string fallthrough behavior.

This test validates that empty strings are treated as falsy in the or chain, correctly falling through to the next candidate (sample_id: "real-id"). This is essential for robustness when JSONL files have empty case_id fields.

packages/evaluator/src/rag_forge_evaluator/engines/ragas_evaluator.py (3)

14-14: LGTM — Clean import for contextlib.suppress.


60-77: LGTM — Strategy 1 correctly handles RAGAS 0.4.x result.scores shape.

The implementation properly:

  • Guards against empty lists with and scores_attr
  • Extracts .value from MetricResult objects first
  • Falls back to direct float conversion for plain floats
  • Uses contextlib.suppress for defensive error handling
  • Returns the mean of collected values

79-89: LGTM — Strategy 2 provides to_pandas() DataFrame fallback.

The implementation correctly:

  • Checks callable(to_pandas) before invocation
  • Validates column existence before access
  • Guards against empty columns with len(col) > 0
  • Uses broad exception handling for defensive fallback
docs/superpowers/plans/2026-04-16-v023-evaluator-bugfixes.md (1)

1-11: LGTM — Clear implementation plan with TDD approach.

The plan follows a disciplined test-first workflow: failing tests committed first, then implementation, then verification. The task breakdown is logical and the verification commands are actionable.

docs/superpowers/specs/2026-04-16-v023-evaluator-bugfixes-design.md (2)

112-140: Spec-implementation gap: Per-sample extraction described but not implemented.

Section 2 ("Add per-sample RAGAS score extraction") and Section 3 ("Update RagasEvaluator.evaluate() caller logic") describe functionality for populating sample_results with per-sample RAGAS scores. This is not implemented in the current PR — the implementation only extracts aggregate scores.

Consider adding a note to the spec indicating this is planned for a future iteration, or update the spec to reflect the actual v0.2.3 scope (aggregate-only extraction fix).


184-238: LGTM — Sample ID fix section accurately describes the implementation.

The root cause analysis, fix approach, and field priority order (case_id > sample_id > id > sequential) match the implementation exactly. The documentation of downstream consumers benefiting automatically is helpful context.

col = df[name].dropna()
if len(col) > 0:
return float(col.mean())
except Exception: # noqa: BLE001 — defensive fallback

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Minor inconsistency: Plan includes # noqa: BLE001 comment not in implementation.

Line 191 shows except Exception: # noqa: BLE001 — defensive fallback but the actual implementation at ragas_evaluator.py:88 uses just except Exception: without the noqa comment. This is cosmetic — the broad exception handling is intentional for defensive fallback.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/superpowers/plans/2026-04-16-v023-evaluator-bugfixes.md` at line 191,
The implementation's broad exception handler (the "except Exception:" block in
the evaluator) is missing the cosmetic suppression comment present in the plan;
update that except Exception line to append the same comment "# noqa: BLE001 —
defensive fallback" so the code matches the documentation and explicitly
documents the intentional defensive fallback.

Comment on lines 43 to 113
def _extract_ragas_score(result: object, name: str) -> float:
"""Extract a metric score from a ragas result object.
"""Extract an aggregate metric score from a ragas result object.

Raises ValueError if the score cannot be extracted — the caller
decides whether to record a SkipRecord or re-raise. No silent 0.0
fallback (that was the bug surfaced by Cycle 2).
Tries extraction strategies in order of RAGAS version likelihood:

1. ``.scores`` list (RAGAS 0.4.x) — per-sample ``MetricResult``
objects whose float score lives at ``.value``.
2. ``.to_pandas()`` (RAGAS 0.4.x fallback) — DataFrame with metric
names as columns and float values as cells.
3. ``.get()`` (RAGAS 0.2.x) — dict-like access.
4. ``[]`` indexing (generic).
5. ``getattr`` (generic).

ragas 0.2.x returns a dict-like result supporting ``.get()``.
ragas 0.4.x returns an ``EvaluationResult`` dataclass; ``__getitem__``
works on it but ``.get()`` does not.
ragas 0.3.x sits between the two with intermediate forms.
Raises ``ValueError`` if all strategies fail — the caller decides
whether to record a ``SkipRecord`` or re-raise. No silent 0.0
fallback (that was the bug surfaced by Cycle 2).
"""
# --- Strategy 1: RAGAS 0.4.x .scores attribute ---
# result.scores is a list[dict[str, MetricResult | float]], one dict
# per sample. MetricResult wraps the float at .value.
scores_attr = getattr(result, "scores", None)
if isinstance(scores_attr, list) and scores_attr:
values: list[float] = []
for entry in scores_attr:
if isinstance(entry, dict) and name in entry:
raw = entry[name]
val = getattr(raw, "value", None)
if val is not None:
with contextlib.suppress(TypeError, ValueError):
values.append(float(val))
else:
with contextlib.suppress(TypeError, ValueError):
values.append(float(raw))
if values:
return sum(values) / len(values)

# --- Strategy 2: RAGAS 0.4.x .to_pandas() fallback ---
to_pandas = getattr(result, "to_pandas", None)
if callable(to_pandas):
try:
df = to_pandas()
if name in df.columns:
col = df[name].dropna()
if len(col) > 0:
return float(col.mean())
except Exception:
pass

# --- Strategy 3: RAGAS 0.2.x dict-like .get() ---
if hasattr(result, "get"):
try:
value = result.get(name, None)
if value is not None:
return float(value)
except (TypeError, ValueError):
pass

# --- Strategy 4: generic __getitem__ ---
try:
return float(result[name]) # type: ignore[index]
except (KeyError, TypeError, ValueError, IndexError):
pass

# --- Strategy 5: generic attribute access ---
if hasattr(result, name):
try:
return float(getattr(result, name))
except (TypeError, ValueError):
pass

raise ValueError(f"could not extract ragas score for metric {name!r}")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Design spec mentions _extract_per_sample_scores() helper that is not implemented.

The design spec (Section 2, lines 112-131) describes adding a _extract_per_sample_scores() function for populating sample_results with per-sample RAGAS scores. The current implementation only extracts aggregate scores (averaging in _extract_ragas_score()).

This appears intentional — the PR objective (C4-3) is to fix the ValueError, which this implementation achieves. The per-sample result population would be a follow-up enhancement. Consider either:

  1. Updating the design spec to mark per-sample extraction as future work, or
  2. Adding a TODO comment noting this planned enhancement.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/evaluator/src/rag_forge_evaluator/engines/ragas_evaluator.py` around
lines 43 - 113, The design spec references a missing helper
_extract_per_sample_scores() for populating sample_results, but the code
currently only implements aggregate extraction in _extract_ragas_score(); add a
short TODO comment near the top of this module (or immediately above
_extract_ragas_score) stating that per-sample extraction via
_extract_per_sample_scores() is planned as a follow-up, or update the
design/spec to mark that per-sample population is deferred, so reviewers know
this is intentional and not a bug.

@hallengray hallengray merged commit 4ab9aca into main Apr 16, 2026
3 checks passed
@hallengray hallengray deleted the fix/v023-evaluator-bugfixes branch April 16, 2026 13:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant