fix(evaluator): route RAGAS exceptions into Skipped counter (v0.2.2 G3) by hallengray · Pull Request #38 · hallengray/rag-forge

hallengray · 2026-04-15T21:45:58Z

Summary

Closes two residual skip-handling gaps in v0.2.1's RAGAS engine that Cycle 3 documented, separate from the C3-2 generate() AttributeError (fixed in #36).

EvaluationResult.skipped_evaluations was never populated by RagasEvaluator. The integer counter the report's top-level "Skipped: N" line reads stayed at 0 even when every job had crashed. Users reading the TL;DR saw Scored: 0, Skipped: 0, success: true and assumed nothing had happened.
Individual-metric extraction failures produced one aggregate SkipRecord per metric name (sample_id="<aggregate>") instead of fanning out per (sample, metric) pair. A 12-sample × 4-metric run that failed entirely produced 4 skip records instead of 48 — the blast radius was under-reported by 12x.

Changes

Extracted skip-record creation into _fan_out_skip_records() — one record per (sample, metric) pair with the real sample_id attached. Both the whole-batch crash path and the per-metric extraction failure path now use it.
skipped_evaluations = len(skipped_samples) set on both return sites in RagasEvaluator.evaluate(). The two fields can never drift apart silently again.
Reason strings truncated to 400 chars with trailing ellipsis so long Python tracebacks don't blow up HTML/PDF rendering.
Silenced two pre-existing mypy arg-type errors on the ragas_evaluate() kwargs with targeted # type: ignore comments + a docstring note pointing readers at the adapter contract test as the real interface guard — duck typing is the deliberate design.

Tests

tests/test_ragas_skip_counter.py — new, three regression tests:
- test_skipped_evaluations_counter_matches_skip_record_length — Cycle 3's exact complaint
- test_whole_batch_crash_fans_out_to_every_sample_metric_pair — asserts 3 samples × 4 metrics = 12 skip records with real sample IDs, not 4 aggregate records
- test_skip_reason_truncated_to_400_chars — long traceback handling
tests/test_cycle2_regression.py — updated two assertions that pre-dated the fan-out change. The original assertions demanded "metrics must be populated"; with fan-out, a MockJudge run legitimately produces zero scored metrics and a full skip list. Tests now assert the absence of Cycle 2 Finding feat: Phase 2C — Evaluation Enhancements + CI/CD Gate #4 (OpenAIEmbeddings.embed_query AttributeError) and Finding feat: Phase 2D — MCP Server Wiring + Templates #5 (InstructorRetryException / finish_reason='length') signatures in the skip records instead — which is the actual regression guard.
tests/test_evaluator_factory.py — gated test_ragas_not_installed_raises_on_evaluate with a reverse importorskip so it runs only on systems without the [ragas] extra. CI matrices with ragas installed now skip it instead of failing.

Local run of the affected files: 13 passed, 1 skipped, 0 failed.

Merge order

Depends on #36 (G1). This branch is already rebased onto G1 — the diff here is G3's commit only, on top of G1's 91d10ca. Once #36 merges, GitHub will show this PR as mergeable without conflicts. Alternatively you can merge this directly after #36 since the rebase is clean.

RAG-Forge v0.2.1's RagForgeRagasLLM wrapper was shipped without the concrete .generate() method that ragas 0.4.x's BaseRagasLLM exposes. Every RAGAS metric job crashed on first contact with AttributeError: 'RagForgeRagasLLM' object has no attribute 'generate' during the PearMedica Cycle 3 audit (2026-04-15), producing a Scored: 0, Overall: 0.0000 report across all 48 metric evaluations. Root cause: the adapter's design doc asserted "ragas only calls generate_text / agenerate_text / model_name" and deliberately avoided subclassing BaseRagasLLM to keep ragas as a soft dependency. That assumption was wrong — BaseRagasLLM.generate is a concrete async helper that ragas's metric code invokes on every LLM regardless of subclass status. The duck-typed wrapper must re-declare it. Fix preserves the no-hard-import design by adding duck-typed shims for every public method on BaseRagasLLM and BaseRagasEmbeddings: RagForgeRagasLLM: - async generate() — the specific shim Cycle 3 caught missing - is_finished() — was abstract on base, returns True - get_temperature(n) — matches base convention (0.01 / 0.3) - set_run_config() — stores ragas RunConfig for compatibility - run_config attribute — defaults to None RagForgeRagasEmbeddings: - embed_text(is_async) — dispatch helper on base class - embed_texts(is_async) — batch variant - set_run_config() - run_config attribute Tests: a contract test iterates every public method on the real ragas base classes and asserts our wrappers declare a callable of the same name, so the next ragas release that adds a method fails in CI not in a user audit. An end-to-end smoke test runs ragas.evaluate() against our wrappers on a 1-sample dataset and asserts it never raises an AttributeError naming our wrapper classes — the exact regression signature from Cycle 3. Three pre-existing test failures in test_cycle2_regression.py and test_evaluator_factory.py are unrelated to this change (pre-existing in v0.2.1 when ragas is installed; they assert on compound conditions that break when MockJudge triggers the skip path). They belong in workstream G3 (skip counter plumbing) and the G1 PR leaves them alone.

coderabbitai · 2026-04-15T21:46:12Z

Summary by CodeRabbit

Release Notes

Enhancements
- Strengthened RAGAS evaluation framework integration and adapter compatibility
- Improved tracking and reporting of skipped evaluations with detailed per-sample error information
- Enhanced failure diagnostics for metric calculations with granular error details

Walkthrough

Adapters for ragas LLMs and embeddings now implement BaseRagas-compatible methods (async generate, temperature handling, run_config, embed helpers, result fusion). Evaluator skip handling was refactored to fan out skip records per sample×metric and to surface skipped_evaluations. Multiple contract, e2e, and regression tests were added/updated.

Changes

Cohort / File(s)	Summary
Adapter Contract Expansion `packages/evaluator/src/rag_forge_evaluator/engines/ragas_adapters.py`	Added `RagForgeRagasLLM.generate(...)` async shim with `get_temperature`, multi-response fan-out using `asyncio.gather`, `_fuse_llm_results` to build expected `LLMResult.generations` shape, `is_finished(...)` (defaults True), `run_config` state and `set_run_config(...)`. Added `RagForgeRagasEmbeddings` async `embed_text(...)` / `embed_texts(...)`, `run_config`, and `set_run_config(...)` that delegate to `aembed_query`/`aembed_documents`.
Evaluator Skip Handling Refactor `packages/evaluator/src/rag_forge_evaluator/engines/ragas_evaluator.py`	Added `RagasEvaluator._fan_out_skip_records(samples, exc, metric_names)` to emit one `SkipRecord` per (sample, metric). Replaced aggregate skip construction with fan-out usage for whole-batch and per-metric failures, added `skipped_evaluations=len(skipped_samples)` to returned `EvaluationResult`, and annotated ragas call sites with `# type: ignore[arg-type, unused-ignore]`.
Adapter Unit & Contract Tests `packages/evaluator/tests/test_ragas_adapters.py`, `packages/evaluator/tests/test_ragas_adapters_contract.py`	Normalized LLM result extraction helper; added async tests for `RagForgeRagasLLM.generate(...)` (forwarding, temperature resolution, multi-generation fusion, error cases for `_fuse_llm_results`), and contract tests asserting parity with `BaseRagasLLM`/`BaseRagasEmbeddings` (method presence, async/sync parity, required params, `is_finished`, `get_temperature`, `set_run_config` behavior).
E2E, Regression & Integration Tests `packages/evaluator/tests/test_ragas_adapters_e2e.py`, `packages/evaluator/tests/test_cycle2_regression.py`, `packages/evaluator/tests/test_ragas_skip_counter.py`, `packages/evaluator/tests/test_evaluator_factory.py`	Added e2e smoke test using deterministic in-process judge stub; updated cycle2 regression tests to assert on `result.skipped_samples` patterns; added skip-counter integration tests to validate fan-out semantics, `skipped_evaluations` parity, and truncated skip reasons; adjusted evaluator factory test to skip when `ragas` is available.
Test Harness Adjustments `packages/evaluator/tests/*`	Multiple new and adjusted tests, helpers, and stubs to exercise new adapter methods and evaluator skip semantics; added import guards (`pytest.importorskip`) where appropriate.

Sequence Diagram(s)

sequenceDiagram
    participant C as Client/Test
    participant E as RagasEvaluator
    participant R as ragas.evaluate
    participant F as _fan_out_skip_records

    C->>E: evaluate(llm, embeddings, samples, metrics)
    E->>R: invoke ragas.evaluate(...)  # type: ignore[arg-type, unused-ignore]
    alt ragas.evaluate succeeds
        R-->>E: raw results
        E->>E: extract per-metric scores
        alt extraction raises per-metric exception
            E->>F: _fan_out_skip_records(samples, exc, [metric])
            F-->>E: list of SkipRecord (one per sample×metric)
            E-->>C: EvaluationResult(skipped_evaluations=len(skips), skipped_samples=skips)
        else extraction succeeds
            E-->>C: EvaluationResult(skipped_evaluations=0, scores=...)
        end
    else ragas.evaluate raises
        R-->>E: exception
        E->>F: _fan_out_skip_records(samples, exc, metric_names)
        F-->>E: list of SkipRecord (one per sample×metric)
        E-->>C: EvaluationResult(skipped_evaluations=len(skips), skipped_samples=skips)
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Poem

🐰 I hopped in to add async shims and more,

I fused the outputs, counted skips by score.
Run-configs held tight, temperatures set true,
Tests now guard parity — I cheered, then ate stew.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 59.42% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: fixing RAGAS exception routing into the Skipped counter, with version notation and cycle identifier.
Description check	✅ Passed	The description comprehensively explains the skip-handling gaps fixed, the specific changes made, new tests added, and dependency information.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/v0.2.2-g3-ragas-skip-counter

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/evaluator/src/rag_forge_evaluator/engines/ragas_adapters.py`:
- Around line 442-462: The current use of asyncio.run(...) in embed_text and
embed_texts will raise in a running event loop; instead, when is_async is True
delegate to a thread-safe call using asyncio.to_thread so the sync embedding
methods are executed off the loop. Update embed_text to call
asyncio.to_thread(self.embed_query, text) (matching the aembed_query pattern)
and update embed_texts to call asyncio.to_thread(self.embed_documents, texts)
when is_async is True; keep the existing direct returns for the non-async branch
and preserve function names embed_text, embed_texts, aembed_query,
aembed_documents.

In `@packages/evaluator/tests/test_ragas_skip_counter.py`:
- Around line 96-111: Remove the dead copy of mod.__dict__ by deleting the
unused real_import assignment (real_import = mod.__dict__.copy()) and the later
no-op assignment (_ = real_import), or if the intent was to restore module
state, replace the no-op with restoring mod.__dict__ from real_import; ensure
the patching block still saves original ragas.evaluate to original_evaluate,
assigns the failure stub _boom to ragas.evaluate, calls
evaluator.evaluate(samples), and finally restores ragas.evaluate from
original_evaluate (and optionally restores mod.__dict__ if you choose the
restore approach).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 9a51db37-1a61-44c3-9475-77387c6aa260

📥 Commits

Reviewing files that changed from the base of the PR and between 18947f4 and 16664c2.

📒 Files selected for processing (8)

packages/evaluator/src/rag_forge_evaluator/engines/ragas_adapters.py
packages/evaluator/src/rag_forge_evaluator/engines/ragas_evaluator.py
packages/evaluator/tests/test_cycle2_regression.py
packages/evaluator/tests/test_evaluator_factory.py
packages/evaluator/tests/test_ragas_adapters.py
packages/evaluator/tests/test_ragas_adapters_contract.py
packages/evaluator/tests/test_ragas_adapters_e2e.py
packages/evaluator/tests/test_ragas_skip_counter.py

📜 Review details

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: RAG Quality Gate
GitHub Check: Lint, Typecheck & Test

🔇 Additional comments (22)

packages/evaluator/tests/test_evaluator_factory.py (1)

26-37: LGTM — conditional skip logic is correct.

The try/except pattern properly detects when ragas is installed and skips the test accordingly. This prevents false failures in CI matrices that include the [ragas] extra while preserving the test's purpose on systems without it.

packages/evaluator/src/rag_forge_evaluator/engines/ragas_evaluator.py (5)

145-156: LGTM — well-documented type suppression.

The docstring clearly explains why # type: ignore[arg-type] is necessary (duck typing to avoid hard ragas dependency) and points to the contract test as the real guard against interface drift.

164-174: LGTM — whole-batch crash path correctly populates both skip fields.

The fan-out now produces one SkipRecord per (sample, metric) pair, and skipped_evaluations is set to match len(skipped_samples), fixing the counter drift issue from Cycle 3.

181-192: LGTM — per-metric failure path also uses fan-out.

Correctly fans out per-metric extraction failures across all samples, ensuring the Skipped counter reflects the true blast radius rather than just one aggregate record per metric.

223-256: LGTM — _fan_out_skip_records implementation is correct.

The method properly:

Truncates reason to 400 chars with trailing "..." (397 + 3 = 400)

Falls back to sample.query[:40] when sample_id is not set

Iterates the Cartesian product of samples × metric_names

Captures exception_type from the exception class name

214-221: LGTM — normal completion path also sets skipped_evaluations.

Both return sites (whole-batch crash at line 173 and normal completion here) now consistently set skipped_evaluations = len(skipped_samples), ensuring the counter never drifts from the detail list.

packages/evaluator/tests/test_ragas_skip_counter.py (3)

49-71: LGTM — counter consistency test is well-structured.

The test validates the core invariant that skipped_evaluations must equal len(skipped_samples), with a clear assertion message explaining the expected lockstep relationship.

113-136: LGTM — fan-out assertions are thorough.

The test correctly validates:

Expected count: len(samples) * 4 (3 samples × 4 metrics = 12)

Real sample IDs (not "<aggregate>")

exception_type matches the injected exception

reason contains the expected message

139-173: LGTM — truncation test validates the 400-char limit.

Correctly injects a 5000-char exception message and verifies each skip record's reason is ≤400 chars and ends with "...".

packages/evaluator/src/rag_forge_evaluator/engines/ragas_adapters.py (3)

10-54: LGTM — comprehensive contract documentation.

The expanded docstring thoroughly documents the ragas 0.4.x base class surface, including the historical context of the v0.2.1 AttributeError regression. This is valuable reference material for future maintainers.

213-256: LGTM — generate() shim correctly mirrors base class behavior.

The implementation:

Resolves temperature=None via get_temperature(n)

Delegates to agenerate_text()

Validates completion via is_finished()

Intentionally skips double-retry wrapping (documented rationale)

258-292: LGTM — helper shims match ragas conventions.

is_finished() conservatively returns True (documented rationale)

get_temperature() follows ragas's n-based convention (0.01 for n≤1, 0.3 otherwise)

set_run_config() stores the value for contract compatibility

packages/evaluator/tests/test_ragas_adapters.py (2)

15-26: LGTM — helper function handles both environments.

_llm_result_text() correctly extracts text from either a real LLMResult (when langchain is installed) or the _StringLLMResult stub, making tests work in both CI configurations.

71-92: LGTM — new async tests cover the v0.2.2 shims.

Both tests properly exercise the generate() method:

test_wrapper_async_generate_forwards_to_agenerate_text validates the async shim exists and forwards to the judge

test_wrapper_async_generate_resolves_default_temperature_when_none validates temperature=None doesn't cause type errors
packages/evaluator/tests/test_cycle2_regression.py (3)
131-161: LGTM — test updated to validate via skip record inspection.

The assertion now correctly checks that no skip record contains the Finding #4 signature (OpenAIEmbeddings or embed_query AttributeError) rather than demanding metrics be produced, aligning with the fan-out behavior.

196-211: LGTM — Finding #5 regression guard updated correctly.

The test now validates that skip records don't contain InstructorRetryException or finish_reason='length' signatures, properly guarding against the max_tokens overflow regression without requiring scored metrics.

216-221: The MetricResult dataclass includes the skipped field at line 52 of packages/evaluator/src/rag_forge_evaluator/engine.py. The code at lines 216-221 is correct and will not raise an AttributeError.
			> Likely an incorrect or invalid review comment.
packages/evaluator/tests/test_ragas_adapters_contract.py (3)

1-53: LGTM — excellent contract test foundation.

The file provides a strong regression guard against future ragas releases by:

Reflecting over the base classes to enumerate public methods

Asserting the wrappers declare matching callables

Clear failure messages listing missing shims

This is exactly the "tripwire" that would have caught the v0.2.1 generate() regression.

67-98: LGTM — comprehensive method coverage assertions.

The tests verify that RagForgeRagasLLM and RagForgeRagasEmbeddings implement every public method from their respective base classes, with actionable failure messages.

101-137: LGTM — targeted shim behavior tests.

Individual tests verify:

generate is async and callable

is_finished returns True by default

get_temperature follows the n-based convention

set_run_config stores the value on both wrappers

packages/evaluator/tests/test_ragas_adapters_e2e.py (2)

26-59: LGTM — well-designed stub for smoke testing.

_DeterministicJudge returns valid JSON that ragas can parse in both statement-extraction and verdict stages. The canned responses are sufficient to prove wrapper dispatch works without requiring schema-perfect output.

74-151: LGTM — focused e2e smoke test with clear purpose.

The test correctly:

Documents what it proves (wrapper dispatch) vs. what it doesn't (metric quality)

Uses raise_exceptions=True to surface errors

Distinguishes wrapper AttributeError (fail) from other exceptions (acceptable)

Would have caught the v0.2.1 generate() regression before release

coderabbitai · 2026-04-15T21:50:34Z

+    real_import = mod.__dict__.copy()
+
+    def _boom(*args: object, **kwargs: object) -> None:
+        _ = args, kwargs
+        raise RuntimeError("synthetic whole-batch ragas failure")
+
+    # Patch ragas_evaluate inside the deferred import block.
+    import ragas
+
+    original_evaluate = ragas.evaluate
+    ragas.evaluate = _boom  # type: ignore[assignment]
+    try:
+        result = evaluator.evaluate(samples)
+    finally:
+        ragas.evaluate = original_evaluate  # type: ignore[assignment]
+        _ = real_import


🧹 Nitpick | 🔵 Trivial

Dead code: real_import is copied but never used.

Line 96 copies mod.__dict__ but line 111 only assigns it to _ without restoring anything. This appears to be leftover from an earlier approach.

🧹 Proposed fix to remove unused code

- import rag_forge_evaluator.engines.ragas_evaluator as mod - - real_import = mod.__dict__.copy() - def _boom(*args: object, **kwargs: object) -> None: _ = args, kwargs raise RuntimeError("synthetic whole-batch ragas failure") - # Patch ragas_evaluate inside the deferred import block. import ragas original_evaluate = ragas.evaluate ragas.evaluate = _boom # type: ignore[assignment] try: result = evaluator.evaluate(samples) finally: ragas.evaluate = original_evaluate # type: ignore[assignment] - _ = real_import

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@packages/evaluator/tests/test_ragas_skip_counter.py` around lines 96 - 111, Remove the dead copy of mod.__dict__ by deleting the unused real_import assignment (real_import = mod.__dict__.copy()) and the later no-op assignment (_ = real_import), or if the intent was to restore module state, replace the no-op with restoring mod.__dict__ from real_import; ensure the patching block still saves original ragas.evaluate to original_evaluate, assigns the failure stub _boom to ragas.evaluate, calls evaluator.evaluate(samples), and finally restores ragas.evaluate from original_evaluate (and optionally restores mod.__dict__ if you choose the restore approach).

Five fixes from CodeRabbit review on PR #36. All were real issues the original contract test missed because it only checked method names, not signatures. ### generate(n > 1) now produces the correct LLMResult shape ragas uses n > 1 for multi-sample metrics (answer_correctness consistency checks, etc.) and expects LLMResult.generations shaped [[gen1, gen2, ..., genN]] — one prompt run with N candidate generations. The original shim ignored n and returned a single generation, silently breaking any metric that relied on sample diversity. Fix: when n > 1, fan out n independent agenerate_text calls via asyncio.gather and fuse the per-call results into a single [[gen1..genN]]-shaped LLMResult via the new _fuse_llm_results helper. n == 1 (the common case) stays a single call. New test: test_wrapper_async_generate_n_greater_than_one_produces_ fused_shape uses a counting judge to verify all N calls fire and the fused result carries all N distinct outputs. ### embed_text / embed_texts now async (crash fix) Critical bug: the original shims were sync methods wrapping asyncio.run(self.aembed_query(text)). ragas's BaseRagasEmbeddings declares embed_text and embed_texts as async coroutines and metric code invokes them with await embeddings.embed_text(...) from inside ragas's evaluation event loop. asyncio.run() inside a running loop crashes with RuntimeError: asyncio.run() cannot be called from a running event loop — a real live-fire crash that would have taken down any Cycle 4 run. Fix: make both methods async. They now just await the existing aembed_query / aembed_documents paths. The is_async parameter is accepted for signature parity with the base class but ignored — our underlying clients are synchronous and aembed_query already runs them in a worker thread via asyncio.to_thread, so both flag values land on the same code path. New tests: - test_embeddings_embed_text_is_async_and_awaitable: asserts both methods are inspect.iscoroutinefunction and round-trips through asyncio.run(embed.embed_text(...)). - test_embeddings_embed_text_callable_from_running_event_loop: exercises the exact asyncio.run-inside-loop path ragas creates. ### Contract test now checks async/sync parity, not just names CodeRabbit correctly pointed out that the original contract test compared method names but not their async/sync shape — exactly why the sync embed_text / embed_texts slipped past review. Fix: two new tests (test_llm_wrapper_async_signature_matches_base and test_embeddings_wrapper_async_signature_matches_base) iterate every public method on the real ragas base classes and assert inspect.iscoroutinefunction matches on both sides. A future release that adds a method in one shape or the other will fail in CI. ### generate() default temperature: None -> 0.01 BaseRagasLLM.generate's default is 0.01. Our shim had None, which is functionally equivalent (None triggers get_temperature()) but diverges from the base signature. Changed to 0.01 to match. Tests that exercise the None fallback still pass None explicitly. ### Tightened existing temperature-fallback test The original test asserted "doesn't raise" but would have passed even if get_temperature() was never called. Now patches llm.get_temperature with wraps= and asserts assert_called_once_with(1). Added inverse test that patches get_temperature and asserts it was NOT called when an explicit temperature is passed — guards against over-eager fallback firing. ### Narrowed e2e smoke test exception handler Original except Exception accepted any non-AttributeError as a pass. A bad shim raising TypeError or a nested-loop bug raising RuntimeError would have counted as success. Tightened to allow only RagasOutputParserException (expected downstream parser failure from the stub judge's canned JSON); everything else propagates.

CodeRabbit on PR #38 spotted the leftover real_import = mod.__dict__.copy() # line 96 _ = real_import # line 111 in test_whole_batch_crash_fans_out_to_every_sample_metric_pair. The dict was copied but never used — an artefact of an earlier monkeypatching approach that got simplified but not cleaned up. Removing it and the unnecessary module-level import that fed it. No behavioural change; test still passes.

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/evaluator/src/rag_forge_evaluator/engines/ragas_adapters.py`:
- Around line 119-148: The fallback branch in _fuse_llm_results currently
returns results[0] on import/attribute/index errors, dropping generations;
instead, catch the exception and manually concatenate all inner generations from
each result (e.g., fused_generations = [gen for r in results for gen in
r.generations[0]]), then return the no-langchain stub _StringLLMResult with
generations=[fused_generations]; update the except block to build
fused_generations and return _StringLLMResult(generations=[fused_generations])
rather than results[0].

In `@packages/evaluator/tests/test_ragas_adapters_contract.py`:
- Around line 46-52: Update the contract tests to validate parameter
compatibility using inspect.signature: for each method discovered by
_public_methods on the ragas base class and its wrapper, call
inspect.signature() on both and assert the wrapper's signature can accept the
base signature (i.e., every parameter in the base method exists in the wrapper
with the same kind and default behavior, allowing the wrapper to also include
extra optional parameters or **kwargs/*args), and keep the existing async/sync
parity checks; implement this check where the tests currently compare method
names and async/sync parity so any change in a base method's parameters (e.g.,
added required args) will fail the test.

In `@packages/evaluator/tests/test_ragas_adapters_e2e.py`:
- Around line 133-164: The test's allowed_exception_types tuple only includes
RagasOutputParserException but the docstring says downstream parser/schema
failures can be RagasOutputParserException or any RagasException subclass;
update the try/except import block to also import RagasException (e.g., from
ragas.exceptions import RagasOutputParserException, RagasException) and include
RagasException in allowed_exception_types so the except allowed_exception_types:
branch catches both; check symbols allowed_exception_types,
RagasOutputParserException, and RagasException in the test and ensure the
behavior around evaluate(...) and the wrapper names remains unchanged.

In `@packages/evaluator/tests/test_ragas_adapters.py`:
- Around line 128-150: Replace the racy dict counter with a thread-safe counter
inside CountingJudge: make CountingJudge hold an integer attribute (e.g., self.i
= 0) and a threading.Lock (e.g., self.lock = threading.Lock()), and in
CountingJudge.judge() acquire the lock, increment self.i, then release the lock
before returning the sample string; construct the judge instance (e.g.,
counting_judge = CountingJudge()) and pass it to RagForgeRagasLLM, and update
the assertion to check counting_judge.i == 3 instead of counter["i"] to ensure
deterministic, thread-safe counting.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 1bf5d064-d222-438f-a5a1-08d700d52402

📥 Commits

Reviewing files that changed from the base of the PR and between 16664c2 and 9c2f457.

📒 Files selected for processing (8)

packages/evaluator/src/rag_forge_evaluator/engines/ragas_adapters.py
packages/evaluator/src/rag_forge_evaluator/engines/ragas_evaluator.py
packages/evaluator/tests/test_cycle2_regression.py
packages/evaluator/tests/test_evaluator_factory.py
packages/evaluator/tests/test_ragas_adapters.py
packages/evaluator/tests/test_ragas_adapters_contract.py
packages/evaluator/tests/test_ragas_adapters_e2e.py
packages/evaluator/tests/test_ragas_skip_counter.py

coderabbitai · 2026-04-15T22:23:40Z

+    allowed_exception_types: tuple[type[BaseException], ...]
+    try:
+        from ragas.exceptions import RagasOutputParserException  # type: ignore[attr-defined]
+
+        allowed_exception_types = (RagasOutputParserException,)
+    except ImportError:
+        allowed_exception_types = ()
+
+    try:
+        evaluate(
+            dataset=dataset,
+            metrics=[faithfulness],
+            llm=llm,
+            embeddings=embeddings,
+            raise_exceptions=True,
+            show_progress=False,
+        )
+    except AttributeError as exc:
+        wrapper_names = ("RagForgeRagasLLM", "RagForgeRagasEmbeddings")
+        if any(name in str(exc) for name in wrapper_names):
+            pytest.fail(
+                f"ragas called a method on our wrapper that doesn't "
+                f"exist: {exc!r}. Add the missing shim in "
+                f"ragas_adapters.py and update the contract test."
+            )
+        raise
+    except allowed_exception_types:
+        # Parser / schema failure downstream of our wrappers. The
+        # wrapper dispatch worked; ragas is now doing its own thing
+        # with a stub judge that returns JSON not matching its
+        # internal pydantic schemas. Treat as success.
+        pass


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

ragas library exception types RagasException RagasOutputParserException

💡 Result:

The Ragas library (Python framework for LLM evaluation) defines custom exception types in src/ragas/exceptions.py. RagasException is the base exception class for all Ragas-specific errors. RagasOutputParserException inherits from RagasException and is raised when the output parser fails to parse LLM output, even after retries. Message: "The output parser failed to parse the output including retries." Other notable RagasException subclasses: - ExceptionInRunner: Raised when an exception occurs in the executor/runner thread. - LLMDidNotFinishException: Raised when LLM generation is incomplete (e.g., max tokens reached). Additionally, there is a parallel hierarchy under RagasError (direct Exception subclass, migrated from experimental module): - ValidationError - DuplicateError - NotFoundError and subclasses (ResourceNotFoundError, ProjectNotFoundError, DatasetNotFoundError, ExperimentNotFoundError) - Various Duplicate*Error subclasses (e.g., DuplicateResourceError) Official docs primarily cover experimental exceptions under RagasError at https://exp.docs.ragas.io/exceptions.html, but core RagasException types are in the source code. Recent changes (PR #2219, Aug 2025) moved exceptions to the main package.

Citations:

1: https://github.com/explodinggradients/ragas/blob/298b6827/src/ragas/exceptions.py

2: https://exp.docs.ragas.io/exceptions.html

3: Move Utils / Exceptions module to main package and update imports vibrantlabsai/ragas#2219

🏁 Script executed:

# Examine the test file and ragas_adapters.py for context fd -t f "test_ragas_adapters_e2e.py" -o -type f "ragas_adapters.py" | head -5

Repository: hallengray/rag-forge

Length of output: 180

🏁 Script executed:

# Search for ragas_adapters.py to understand how the wrappers are implemented rg "ragas_adapters" --type py -l

Repository: hallengray/rag-forge

Length of output: 359

🏁 Script executed:

# Look at the test file to understand the full context find . -name "test_ragas_adapters_e2e.py" -type f -exec wc -l {} \; | head -1

Repository: hallengray/rag-forge

Length of output: 121

🏁 Script executed:

# Read the test file to understand the context cat -n packages/evaluator/tests/test_ragas_adapters_e2e.py | sed -n '120,164p'

Repository: hallengray/rag-forge

Length of output: 2357

🏁 Script executed:

# Check if there's more context before line 133 cat -n packages/evaluator/tests/test_ragas_adapters_e2e.py | sed -n '100,145p'

Repository: hallengray/rag-forge

Length of output: 2151

🏁 Script executed:

# Examine ragas_adapters.py to understand the wrapper implementation wc -l packages/evaluator/src/rag_forge_evaluator/engines/ragas_adapters.py

Repository: hallengray/rag-forge

Length of output: 136

🏁 Script executed:

# Look at imports and exception handling in ragas_adapters.py head -50 packages/evaluator/src/rag_forge_evaluator/engines/ragas_adapters.py

Repository: hallengray/rag-forge

Length of output: 2656

Update exception allowlist to match documented intent.

The docstring (lines 122–123) explicitly states that downstream parser/schema failures raise "either RagasOutputParserException or a RagasException subclass," but the code only accepts RagasOutputParserException. This creates a version-sensitivity risk: if ragas wraps the same parser failure as a broader RagasException in a newer version, this test will fail despite the wrapper contract succeeding.

Add RagasException to the allowed exceptions tuple to align the implementation with the documented behavior:

Proposed fix

- allowed_exception_types: tuple[type[BaseException], ...] + allowed_exception_types: tuple[type[BaseException], ...] = () try: - from ragas.exceptions import RagasOutputParserException # type: ignore[attr-defined] - - allowed_exception_types = (RagasOutputParserException,) + from ragas.exceptions import ( # type: ignore[attr-defined] + RagasException, + RagasOutputParserException, + ) + allowed_exception_types = ( + RagasOutputParserException, + RagasException, + ) except ImportError: - allowed_exception_types = () + pass

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@packages/evaluator/tests/test_ragas_adapters_e2e.py` around lines 133 - 164, The test's allowed_exception_types tuple only includes RagasOutputParserException but the docstring says downstream parser/schema failures can be RagasOutputParserException or any RagasException subclass; update the try/except import block to also import RagasException (e.g., from ragas.exceptions import RagasOutputParserException, RagasException) and include RagasException in allowed_exception_types so the except allowed_exception_types: branch catches both; check symbols allowed_exception_types, RagasOutputParserException, and RagasException in the test and ensure the behavior around evaluate(...) and the wrapper names remains unchanged.

The n>1 fan-out fix in the previous commit built a fused LLMResult via langchain_core.outputs.LLMResult. On CI without the [ragas] extra installed, langchain isn't available and generate_text falls back to the _StringLLMResult stub — which _fuse_llm_results had no fuse path for, so it hit the ImportError fallback and returned results[0] (a single generation). The n>1 test then asserted inner length == 3 and failed with 1. Fix: separate the flatten step (build the fused_generations list) from the wrap step (pick LLMResult or _StringLLMResult). The flatten works on either shape because both expose the same .generations[0] interface. Wrap picks langchain if available, otherwise constructs via a new _StringLLMResult._from_generations alt constructor that carries a pre-fused list. Local verification: - without [ragas]: 18/18 adapter tests pass (contract + e2e skip) - with [ragas]: 28/28 adapter tests pass

CodeRabbit on PR #38 spotted the leftover real_import = mod.__dict__.copy() # line 96 _ = real_import # line 111 in test_whole_batch_crash_fans_out_to_every_sample_metric_pair. The dict was copied but never used — an artefact of an earlier monkeypatching approach that got simplified but not cleaned up. Removing it and the unnecessary module-level import that fed it. No behavioural change; test still passes.

Three more findings from the second CodeRabbit review on PR #36 / #38. ### _fuse_llm_results now fails loud on malformed input The previous commit's AttributeError/IndexError fallback silently returned results[0] — turning a requested n>1 generation into a single-sample result with no signal. Downstream ragas metrics that rely on sample diversity would consume the degraded result as if it were a valid N-sample fuse. CodeRabbit rightly flagged this as hiding real correctness bugs. Fix: raise ValueError with the observed result types and the underlying exception chained. Added a targeted test (test_fuse_llm_results_raises_on_malformed_input) covering three malformed-input shapes: objects without .generations, empty outer lists, and empty input lists. ### CountingJudge in n>1 test is now thread-safe RagForgeRagasLLM.generate fans out via asyncio.gather, and each agenerate_text call runs in a worker thread via asyncio.to_thread. counter["i"] += 1 is not atomic across threads — the read-modify- write race would produce duplicate sample labels or an undercounted total, flaking the test. Added a threading.Lock around the increment. Test now deterministic under concurrent fan-out. ### Contract tests now check parameter parity, not just name + async Previously we asserted that each public method on BaseRagasLLM / BaseRagasEmbeddings exists on our wrapper and matches async/sync. That missed parameter drift — a future ragas release that adds max_tokens to generate_text would silently break our wrapper until a user audit caught it. Fix: two new tests (test_llm_wrapper_parameter_names_cover_base_class and test_embeddings_wrapper_parameter_names_cover_base_class) use inspect.signature to enumerate the named parameters on each base- class method and assert the wrapper accepts every name. Variadic *args / **kwargs are ignored (they can absorb any kwarg by definition). A helper _required_param_names() shares the logic. Local: 30 passed on adapter + contract + e2e suites with ragas installed, 18 passed on adapter suite alone without the [ragas] extra. Ruff + mypy clean.

Cycle 3's PearMedica audit (2026-04-15) documented two residual skip- handling gaps in v0.2.1, separate from the C3-2 generate() AttributeError: 1. EvaluationResult.skipped_evaluations (the integer counter the report's TL;DR "Skipped: N" line reads) was never set by RagasEvaluator. skipped_samples held real SkipRecords, but the counter stayed at 0. Users reading the top-level summary saw "Scored: 0, Skipped: 0" and thought nothing had happened when in reality every job had crashed. 2. Individual-metric extraction failures created one SkipRecord per metric name with sample_id="<aggregate>". A 12-sample x 4-metric run that failed entirely produced 4 records instead of 48 — the blast radius was under-reported by 12x. Fix: - Extract skip-record creation into _fan_out_skip_records() — fans a single coarse failure out into one record per (sample, metric) pair with the real sample_id attached. - Both the whole-batch crash path and the per-metric extraction failure path now use it. - Set EvaluationResult.skipped_evaluations = len(skipped_samples) on both return sites so the counter and the detail list can never drift apart silently again. - Truncate reason strings to 400 chars (with trailing ellipsis) so long Python tracebacks don't blow up HTML/PDF rendering downstream. Tests: - test_ragas_skip_counter.py — three new tests covering: (a) skipped_evaluations counter equals len(skipped_samples), (b) whole-batch crashes fan out to sample_count * metric_count records with real sample_ids (not "<aggregate>"), (c) reason truncation at 400 chars. - test_cycle2_regression.py — updated two assertions that pre-dated the fan-out change. The original assertions demanded "metrics must be populated"; with fan-out, a MockJudge run legitimately produces zero scored metrics and a full skip list. Tests now assert the absence of Finding #4 (embed_query AttributeError) and Finding #5 (InstructorRetryException / max_tokens truncation) signatures in the skip records instead, which is the actual regression guard. - test_evaluator_factory.py — gate test_ragas_not_installed_raises_on_evaluate with a reverse importorskip so it runs only on systems without the [ragas] extra. CI matrices with ragas installed now skip it instead of failing. Pre-existing mypy arg-type errors on the ragas_evaluate() kwargs silenced with targeted # type: ignore comments and a docstring note pointing readers at the adapter contract test as the real interface guard — duck typing is the deliberate design, not an oversight.

…ores The # type: ignore[arg-type] comments on the ragas_evaluate() kwargs only fire when mypy can see ragas's real type signatures — i.e. when the [ragas] extra is installed (local dev). On CI without the extra, mypy falls back to Any and flags the ignores as [unused-ignore]. Adding unused-ignore to the ignore codes tells mypy to tolerate the comment when it has nothing to suppress. Both environments are now clean: local mypy with the extra still sees the arg-type mismatch and honours the suppression; CI mypy without the extra silently accepts the comment as dead.

CodeRabbit on PR #38 spotted the leftover real_import = mod.__dict__.copy() # line 96 _ = real_import # line 111 in test_whole_batch_crash_fans_out_to_every_sample_metric_pair. The dict was copied but never used — an artefact of an earlier monkeypatching approach that got simplified but not cleaned up. Removing it and the unnecessary module-level import that fed it. No behavioural change; test still passes.

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

packages/evaluator/src/rag_forge_evaluator/engines/ragas_evaluator.py (1)
179-200: ⚠️ Potential issue | 🟠 Major

Add non-finite score validation to _extract_ragas_score or its caller to ensure failed metrics are tracked as skips.

The _extract_ragas_score() function accepts float("nan") and infinity values without raising ValueError. If ragas returns non-finite scores, the code will emit MetricResult(score=nan) instead of entering the skip path, leaving skipped_evaluations wrong for the per-metric failure mode this PR addresses.

Either validate in _extract_ragas_score() itself, or add a check immediately after extraction (before MetricResult creation) to reject non-finite values:
Suggested fix
         for metric_name in _METRIC_NAMES:
             try:
                 score = _extract_ragas_score(result, metric_name)
+                if score != score or score in (float("inf"), float("-inf")):
+                    raise ValueError(
+                        f"non-finite ragas score for metric {metric_name!r}: {score!r}"
+                    )
             except ValueError as exc:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/evaluator/src/rag_forge_evaluator/engines/ragas_evaluator.py` around
lines 179 - 200, After extracting score via _extract_ragas_score, ensure
non-finite values are treated as failures: validate that the returned score is a
finite number (not NaN, +Inf, -Inf) and if it is non-finite raise a ValueError
(or call the same skip path) so control flows into the existing except handler
that calls self._fan_out_skip_records; alternatively implement this validation
inside _extract_ragas_score itself so any non-finite returns cause a ValueError
and the MetricResult creation (and appended aggregated record with
MetricResult(name=..., score=...)) only ever receives finite scores.

♻️ Duplicate comments (1)

packages/evaluator/tests/test_ragas_adapters_contract.py (1)
149-215: ⚠️ Potential issue | 🟠 Major

The contract test still misses breaking signature drift.

This only compares parameter names. A ragas change from optional → required, positional-only → keyword-only, or a kind/default change will still pass here and fail at runtime. _required_param_names() also strips **kwargs, so a wrapper that is intentionally future-proofed with a keyword catch-all would be flagged as incompatible even though it could accept the new API surface. Please compare full inspect.Signature compatibility instead of set subtraction.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/evaluator/tests/test_ragas_adapters_contract.py` around lines 149 -
215, The tests currently use _required_param_names to compare only parameter
names, which misses changes in parameter kinds, defaults, or
required-vs-optional and incorrectly flags wrappers that accept **kwargs; update
the two tests (test_llm_wrapper_parameter_names_cover_base_class and
test_embeddings_wrapper_parameter_names_cover_base_class) to fetch full
inspect.signature(...) for each method on ragas_base_* and the corresponding
RagForge wrapper, then verify compatibility by ensuring the wrapper's Signature
can accept calls valid for the base Signature (e.g., use
Signature.bind_partial/try-except or compare each Parameter's .kind and .default
and treat a wrapper with a VAR_KEYWORD parameter as compatible for extra
keywords); replace use of _required_param_names or change it to return full
Signatures and perform the compatibility checks so changes from
optional→required, positional-only→keyword-only, and default changes are
detected and **kwargs are handled as allowed compatibility.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/evaluator/tests/test_ragas_skip_counter.py`:
- Around line 74-168: Add a new test that mirrors
test_whole_batch_crash_fans_out_to_every_sample_metric_pair but forces the
per-metric extraction branch in RagasEvaluator.evaluate to raise an exception
(patch the function RagasEvaluator calls for per-metric extraction — e.g., the
ragas.extract_metric helper or the per-metric code path that
RagasEvaluator.evaluate invokes — to raise a synthetic exception), then run
evaluator.evaluate(samples) and assert that result.skipped_samples and
result.skipped_evaluations equal len(samples) * <number of metrics in
RagasEvaluator>, that every skip.sample_id is one of the real sample IDs, and
that each skip.exception_type and skip.reason reflect the synthetic per-metric
error (reason truncated/contains message as in the other tests).

---

Outside diff comments:
In `@packages/evaluator/src/rag_forge_evaluator/engines/ragas_evaluator.py`:
- Around line 179-200: After extracting score via _extract_ragas_score, ensure
non-finite values are treated as failures: validate that the returned score is a
finite number (not NaN, +Inf, -Inf) and if it is non-finite raise a ValueError
(or call the same skip path) so control flows into the existing except handler
that calls self._fan_out_skip_records; alternatively implement this validation
inside _extract_ragas_score itself so any non-finite returns cause a ValueError
and the MetricResult creation (and appended aggregated record with
MetricResult(name=..., score=...)) only ever receives finite scores.

---

Duplicate comments:
In `@packages/evaluator/tests/test_ragas_adapters_contract.py`:
- Around line 149-215: The tests currently use _required_param_names to compare
only parameter names, which misses changes in parameter kinds, defaults, or
required-vs-optional and incorrectly flags wrappers that accept **kwargs; update
the two tests (test_llm_wrapper_parameter_names_cover_base_class and
test_embeddings_wrapper_parameter_names_cover_base_class) to fetch full
inspect.signature(...) for each method on ragas_base_* and the corresponding
RagForge wrapper, then verify compatibility by ensuring the wrapper's Signature
can accept calls valid for the base Signature (e.g., use
Signature.bind_partial/try-except or compare each Parameter's .kind and .default
and treat a wrapper with a VAR_KEYWORD parameter as compatible for extra
keywords); replace use of _required_param_names or change it to return full
Signatures and perform the compatibility checks so changes from
optional→required, positional-only→keyword-only, and default changes are
detected and **kwargs are handled as allowed compatibility.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: ad5389ad-696d-47b7-b744-13c64096c8c2

📥 Commits

Reviewing files that changed from the base of the PR and between 9c2f457 and 13dbf26.

📒 Files selected for processing (7)

packages/evaluator/src/rag_forge_evaluator/engines/ragas_adapters.py
packages/evaluator/src/rag_forge_evaluator/engines/ragas_evaluator.py
packages/evaluator/tests/test_cycle2_regression.py
packages/evaluator/tests/test_evaluator_factory.py
packages/evaluator/tests/test_ragas_adapters.py
packages/evaluator/tests/test_ragas_adapters_contract.py
packages/evaluator/tests/test_ragas_skip_counter.py

📜 Review details

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: RAG Quality Gate
GitHub Check: Lint, Typecheck & Test

coderabbitai · 2026-04-15T22:52:07Z

+def test_whole_batch_crash_fans_out_to_every_sample_metric_pair() -> None:
+    """A whole-batch ragas crash must produce one SkipRecord per
+    (sample, metric) pair, not one per metric name. Cycle 3 reported
+    4 aggregate skip records for a run that should have had 48.
+    """
+    pytest.importorskip("ragas")
+
+    # Construct an evaluator that will trigger the whole-batch catch
+    # path: we patch the ragas import at call time to force an
+    # exception before ragas_evaluate() ever runs.
+    evaluator = RagasEvaluator(
+        judge=MockJudge(),
+        thresholds={},
+        embeddings_provider="mock",
+    )
+    samples = _make_samples(3)
+
+    # Inject a synthetic whole-batch failure by patching ragas.evaluate.
+    # This is the exact code path the Cycle 3 AttributeError took before
+    # v0.2.2 G1 fixed the generate() shim.
+    def _boom(*args: object, **kwargs: object) -> None:
+        _ = args, kwargs
+        raise RuntimeError("synthetic whole-batch ragas failure")
+
+    import ragas
+
+    original_evaluate = ragas.evaluate
+    ragas.evaluate = _boom  # type: ignore[assignment]
+    try:
+        result = evaluator.evaluate(samples)
+    finally:
+        ragas.evaluate = original_evaluate  # type: ignore[assignment]
+
+    # 3 samples x 4 metrics = 12 skip records, not 4.
+    expected = len(samples) * 4  # 4 metric names in RagasEvaluator
+    assert len(result.skipped_samples) == expected, (
+        f"expected {expected} skip records (3 samples x 4 metrics), "
+        f"got {len(result.skipped_samples)}. Whole-batch failures must "
+        f"fan out to every (sample, metric) pair."
+    )
+    assert result.skipped_evaluations == expected, (
+        f"skipped_evaluations counter should equal the fan-out size "
+        f"({expected}), got {result.skipped_evaluations}"
+    )
+
+    # Every skip record should name a real sample_id (not "<aggregate>")
+    # so post-hoc analysis in the report can attribute failures to
+    # specific samples.
+    real_sample_ids = {s.sample_id for s in samples}
+    for skip in result.skipped_samples:
+        assert skip.sample_id in real_sample_ids, (
+            f"skip record has sample_id={skip.sample_id!r} which is not "
+            f"one of the real sample IDs {real_sample_ids!r} — aggregate "
+            f"attribution has regressed"
+        )
+        assert skip.exception_type == "RuntimeError"
+        assert "synthetic whole-batch" in skip.reason
+
+
+def test_skip_reason_truncated_to_400_chars() -> None:
+    """Long Python tracebacks must be truncated to prevent HTML / PDF
+    rendering blowing up. The limit is 400 chars with a trailing '...'.
+    """
+    pytest.importorskip("ragas")
+
+    evaluator = RagasEvaluator(
+        judge=MockJudge(),
+        thresholds={},
+        embeddings_provider="mock",
+    )
+    samples = _make_samples(1)
+
+    # Synthetic exception with a very long message.
+    import ragas
+
+    long_message = "x" * 5000
+    original_evaluate = ragas.evaluate
+
+    def _boom(*args: object, **kwargs: object) -> None:
+        _ = args, kwargs
+        raise ValueError(long_message)
+
+    ragas.evaluate = _boom  # type: ignore[assignment]
+    try:
+        result = evaluator.evaluate(samples)
+    finally:
+        ragas.evaluate = original_evaluate  # type: ignore[assignment]
+
+    for skip in result.skipped_samples:
+        assert len(skip.reason) <= 400, (
+            f"skip reason longer than 400 chars: {len(skip.reason)}"
+        )
+        assert skip.reason.endswith("..."), (
+            "long reasons should end with ellipsis to indicate truncation"
+        )


🧹 Nitpick | 🔵 Trivial

Add a regression for the per-metric fan-out branch too.

These tests only force the whole-batch except Exception path by patching ragas.evaluate. The other behavior changed in this PR lives in RagasEvaluator.evaluate()'s per-metric extraction branch, and nothing here proves that a single failed metric now fans out to one SkipRecord per sample instead of regressing back to an aggregate record.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@packages/evaluator/tests/test_ragas_skip_counter.py` around lines 74 - 168, Add a new test that mirrors test_whole_batch_crash_fans_out_to_every_sample_metric_pair but forces the per-metric extraction branch in RagasEvaluator.evaluate to raise an exception (patch the function RagasEvaluator calls for per-metric extraction — e.g., the ragas.extract_metric helper or the per-metric code path that RagasEvaluator.evaluate invokes — to raise a synthetic exception), then run evaluator.evaluate(samples) and assert that result.skipped_samples and result.skipped_evaluations equal len(samples) * <number of metrics in RagasEvaluator>, that every skip.sample_id is one of the real sample IDs, and that each skip.exception_type and skip.reason reflect the synthetic per-metric error (reason truncated/contains message as in the other tests).

* fix(evaluator): add missing ragas BaseRagasLLM methods (G1) RAG-Forge v0.2.1's RagForgeRagasLLM wrapper was shipped without the concrete .generate() method that ragas 0.4.x's BaseRagasLLM exposes. Every RAGAS metric job crashed on first contact with AttributeError: 'RagForgeRagasLLM' object has no attribute 'generate' during the PearMedica Cycle 3 audit (2026-04-15), producing a Scored: 0, Overall: 0.0000 report across all 48 metric evaluations. Root cause: the adapter's design doc asserted "ragas only calls generate_text / agenerate_text / model_name" and deliberately avoided subclassing BaseRagasLLM to keep ragas as a soft dependency. That assumption was wrong — BaseRagasLLM.generate is a concrete async helper that ragas's metric code invokes on every LLM regardless of subclass status. The duck-typed wrapper must re-declare it. Fix preserves the no-hard-import design by adding duck-typed shims for every public method on BaseRagasLLM and BaseRagasEmbeddings: RagForgeRagasLLM: - async generate() — the specific shim Cycle 3 caught missing - is_finished() — was abstract on base, returns True - get_temperature(n) — matches base convention (0.01 / 0.3) - set_run_config() — stores ragas RunConfig for compatibility - run_config attribute — defaults to None RagForgeRagasEmbeddings: - embed_text(is_async) — dispatch helper on base class - embed_texts(is_async) — batch variant - set_run_config() - run_config attribute Tests: a contract test iterates every public method on the real ragas base classes and asserts our wrappers declare a callable of the same name, so the next ragas release that adds a method fails in CI not in a user audit. An end-to-end smoke test runs ragas.evaluate() against our wrappers on a 1-sample dataset and asserts it never raises an AttributeError naming our wrapper classes — the exact regression signature from Cycle 3. Three pre-existing test failures in test_cycle2_regression.py and test_evaluator_factory.py are unrelated to this change (pre-existing in v0.2.1 when ragas is installed; they assert on compound conditions that break when MockJudge triggers the skip path). They belong in workstream G3 (skip counter plumbing) and the G1 PR leaves them alone. * fix(evaluator): address CodeRabbit review on G1 adapter shims Five fixes from CodeRabbit review on PR #36. All were real issues the original contract test missed because it only checked method names, not signatures. ### generate(n > 1) now produces the correct LLMResult shape ragas uses n > 1 for multi-sample metrics (answer_correctness consistency checks, etc.) and expects LLMResult.generations shaped [[gen1, gen2, ..., genN]] — one prompt run with N candidate generations. The original shim ignored n and returned a single generation, silently breaking any metric that relied on sample diversity. Fix: when n > 1, fan out n independent agenerate_text calls via asyncio.gather and fuse the per-call results into a single [[gen1..genN]]-shaped LLMResult via the new _fuse_llm_results helper. n == 1 (the common case) stays a single call. New test: test_wrapper_async_generate_n_greater_than_one_produces_ fused_shape uses a counting judge to verify all N calls fire and the fused result carries all N distinct outputs. ### embed_text / embed_texts now async (crash fix) Critical bug: the original shims were sync methods wrapping asyncio.run(self.aembed_query(text)). ragas's BaseRagasEmbeddings declares embed_text and embed_texts as async coroutines and metric code invokes them with await embeddings.embed_text(...) from inside ragas's evaluation event loop. asyncio.run() inside a running loop crashes with RuntimeError: asyncio.run() cannot be called from a running event loop — a real live-fire crash that would have taken down any Cycle 4 run. Fix: make both methods async. They now just await the existing aembed_query / aembed_documents paths. The is_async parameter is accepted for signature parity with the base class but ignored — our underlying clients are synchronous and aembed_query already runs them in a worker thread via asyncio.to_thread, so both flag values land on the same code path. New tests: - test_embeddings_embed_text_is_async_and_awaitable: asserts both methods are inspect.iscoroutinefunction and round-trips through asyncio.run(embed.embed_text(...)). - test_embeddings_embed_text_callable_from_running_event_loop: exercises the exact asyncio.run-inside-loop path ragas creates. ### Contract test now checks async/sync parity, not just names CodeRabbit correctly pointed out that the original contract test compared method names but not their async/sync shape — exactly why the sync embed_text / embed_texts slipped past review. Fix: two new tests (test_llm_wrapper_async_signature_matches_base and test_embeddings_wrapper_async_signature_matches_base) iterate every public method on the real ragas base classes and assert inspect.iscoroutinefunction matches on both sides. A future release that adds a method in one shape or the other will fail in CI. ### generate() default temperature: None -> 0.01 BaseRagasLLM.generate's default is 0.01. Our shim had None, which is functionally equivalent (None triggers get_temperature()) but diverges from the base signature. Changed to 0.01 to match. Tests that exercise the None fallback still pass None explicitly. ### Tightened existing temperature-fallback test The original test asserted "doesn't raise" but would have passed even if get_temperature() was never called. Now patches llm.get_temperature with wraps= and asserts assert_called_once_with(1). Added inverse test that patches get_temperature and asserts it was NOT called when an explicit temperature is passed — guards against over-eager fallback firing. ### Narrowed e2e smoke test exception handler Original except Exception accepted any non-AttributeError as a pass. A bad shim raising TypeError or a nested-loop bug raising RuntimeError would have counted as success. Tightened to allow only RagasOutputParserException (expected downstream parser failure from the stub judge's canned JSON); everything else propagates. * fix(evaluator): fuse _StringLLMResult stubs in n>1 path for no-ragas CI The n>1 fan-out fix in the previous commit built a fused LLMResult via langchain_core.outputs.LLMResult. On CI without the [ragas] extra installed, langchain isn't available and generate_text falls back to the _StringLLMResult stub — which _fuse_llm_results had no fuse path for, so it hit the ImportError fallback and returned results[0] (a single generation). The n>1 test then asserted inner length == 3 and failed with 1. Fix: separate the flatten step (build the fused_generations list) from the wrap step (pick LLMResult or _StringLLMResult). The flatten works on either shape because both expose the same .generations[0] interface. Wrap picks langchain if available, otherwise constructs via a new _StringLLMResult._from_generations alt constructor that carries a pre-fused list. Local verification: - without [ragas]: 18/18 adapter tests pass (contract + e2e skip) - with [ragas]: 28/28 adapter tests pass * fix(evaluator): address CodeRabbit round-2 findings on G1 Three more findings from the second CodeRabbit review on PR #36 / #38. ### _fuse_llm_results now fails loud on malformed input The previous commit's AttributeError/IndexError fallback silently returned results[0] — turning a requested n>1 generation into a single-sample result with no signal. Downstream ragas metrics that rely on sample diversity would consume the degraded result as if it were a valid N-sample fuse. CodeRabbit rightly flagged this as hiding real correctness bugs. Fix: raise ValueError with the observed result types and the underlying exception chained. Added a targeted test (test_fuse_llm_results_raises_on_malformed_input) covering three malformed-input shapes: objects without .generations, empty outer lists, and empty input lists. ### CountingJudge in n>1 test is now thread-safe RagForgeRagasLLM.generate fans out via asyncio.gather, and each agenerate_text call runs in a worker thread via asyncio.to_thread. counter["i"] += 1 is not atomic across threads — the read-modify- write race would produce duplicate sample labels or an undercounted total, flaking the test. Added a threading.Lock around the increment. Test now deterministic under concurrent fan-out. ### Contract tests now check parameter parity, not just name + async Previously we asserted that each public method on BaseRagasLLM / BaseRagasEmbeddings exists on our wrapper and matches async/sync. That missed parameter drift — a future ragas release that adds max_tokens to generate_text would silently break our wrapper until a user audit caught it. Fix: two new tests (test_llm_wrapper_parameter_names_cover_base_class and test_embeddings_wrapper_parameter_names_cover_base_class) use inspect.signature to enumerate the named parameters on each base- class method and assert the wrapper accepts every name. Variadic *args / **kwargs are ignored (they can absorb any kwarg by definition). A helper _required_param_names() shares the logic. Local: 30 passed on adapter + contract + e2e suites with ragas installed, 18 passed on adapter suite alone without the [ragas] extra. Ruff + mypy clean.

coderabbitai Bot reviewed Apr 15, 2026

View reviewed changes

hallengray force-pushed the fix/v0.2.2-g3-ragas-skip-counter branch from e61ee27 to 9c2f457 Compare April 15, 2026 22:15

coderabbitai Bot reviewed Apr 15, 2026

View reviewed changes

hallengray force-pushed the fix/v0.2.2-g3-ragas-skip-counter branch from 9c2f457 to 7112da7 Compare April 15, 2026 22:26

hallengray added 4 commits April 15, 2026 23:45

hallengray force-pushed the fix/v0.2.2-g3-ragas-skip-counter branch from 7112da7 to 13dbf26 Compare April 15, 2026 22:46

coderabbitai Bot reviewed Apr 15, 2026

View reviewed changes

hallengray merged commit 70c1d69 into main Apr 15, 2026
3 checks passed

hallengray deleted the fix/v0.2.2-g3-ragas-skip-counter branch April 15, 2026 22:53

hallengray mentioned this pull request Apr 15, 2026

chore(release): prep v0.2.2 — lockstep version bump (6 packages) #41

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(evaluator): route RAGAS exceptions into Skipped counter (v0.2.2 G3)#38

fix(evaluator): route RAGAS exceptions into Skipped counter (v0.2.2 G3)#38
hallengray merged 7 commits into
mainfrom
fix/v0.2.2-g3-ragas-skip-counter

hallengray commented Apr 15, 2026

Uh oh!

coderabbitai Bot commented Apr 15, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot Apr 15, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Apr 15, 2026

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Apr 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hallengray commented Apr 15, 2026

Summary

Changes

Tests

Merge order

Uh oh!

coderabbitai Bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Apr 15, 2026 •

edited

Loading