feat(adapter): add experimental live endpoint response collector#142
feat(adapter): add experimental live endpoint response collector#142csoceanu wants to merge 2 commits into
Conversation
Add a collection utility to the adapter SDK that queries a live chatbot endpoint with test questions and produces an evaluation-ready JSONL dataset. Supports OpenAI-compatible chat completions and generic HTTP endpoints (MCP, Langflow, custom APIs) with configurable request templates and response path extraction. Integrates with existing SDK auth (resolve_model_credentials) and TLS patterns (CA bundle auto-detection, insecure flag). Config is passed through JobSpec.parameters["live_collection"] as an experimental shape. Refs eval-hub#135 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Warning Review limit reached
More reviews will be available in 46 minutes and 52 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThis pull request adds a new experimental "live endpoint response collection" module to the adapter SDK, enabling adapters to query chatbot endpoints with test questions and collect responses during the evaluation pipeline. The implementation includes configuration models, protocol handlers for both OpenAI-compatible and generic HTTP endpoints, comprehensive error handling with retries, and a complete test suite. ChangesLive Endpoint Response Collection Feature
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes This is a substantial new module introducing multiple interacting components: configuration validation, protocol handlers for two endpoint types, retry logic with exponential backoff, authentication/TLS infrastructure, multi-format question loading, and response flattening. The logic density is moderate but spread across many functions with distinct responsibilities. Comprehensive test coverage validates behavior across all major paths, reducing verification burden per reviewer. Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
src/evalhub/adapter/collector.py (1)
247-251: 💤 Low valueConsider writing a partial manifest on fail_fast failure.
When
fail_fast=Trueand collection fails, the exception propagates before the manifest is written (lines 261-276). This leavesresponses.jsonlwith partial data but nomanifest.json, making it harder for operators to inspect the failure state.Consider either:
- Writing a partial manifest with
failed=1before raising, or- Adding a clear log message with progress info before raising.
Also applies to: 261-276
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/evalhub/adapter/collector.py` around lines 247 - 251, When config.fail_fast is true and you raise CollectorError in the collection loop, write a partial manifest before raising so operators can see progress; update the code path that currently raises CollectorError (the block referencing config.fail_fast and CollectorError) to first build a manifest object containing progress fields (total attempted, succeeded, failed, and a failed=1 or partial=true flag) and call the existing manifest-writer used elsewhere in this module (look for the manifest writing code near lines 261-276 or functions named write_manifest/_write_manifest) to persist that manifest, or at minimum emit a concise log with those progress numbers, then re-raise the CollectorError. Ensure the same change is applied to the other fail_fast sites mentioned (around lines 261-276).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/evalhub/adapter/collector.py`:
- Around line 334-342: substitute_template currently calls format_map() on
strings which raises a bare KeyError for missing placeholders; wrap the string
formatting in a try/except that catches KeyError from
template.format_map(variables) and re-raise a clearer error (ValueError or
KeyError) that includes the missing placeholder name and the template string
(and chain the original exception) so callers can see which template/value
failed; update the substitute_template function to perform this localized catch
for the str branch while leaving dict/list recursion as-is.
- Around line 494-508: _flattent_record currently overwrites keys from
record.source_fields when they collide with reserved output names (response,
raw_response, error, latency_ms) or with record.extra_fields; update
_flatten_record to detect collisions before merging: compute the intersection
between record.source_fields.keys() and the reserved set
{"response","raw_response","error","latency_ms"} and also between source_fields
and record.extra_fields.keys(), and emit a warning (use the module/logger used
elsewhere in this file) listing the colliding keys and the record identifier if
available; after logging, proceed with the existing merge behavior so output
remains unchanged but collisions are surfaced for debugging.
---
Nitpick comments:
In `@src/evalhub/adapter/collector.py`:
- Around line 247-251: When config.fail_fast is true and you raise
CollectorError in the collection loop, write a partial manifest before raising
so operators can see progress; update the code path that currently raises
CollectorError (the block referencing config.fail_fast and CollectorError) to
first build a manifest object containing progress fields (total attempted,
succeeded, failed, and a failed=1 or partial=true flag) and call the existing
manifest-writer used elsewhere in this module (look for the manifest writing
code near lines 261-276 or functions named write_manifest/_write_manifest) to
persist that manifest, or at minimum emit a concise log with those progress
numbers, then re-raise the CollectorError. Ensure the same change is applied to
the other fail_fast sites mentioned (around lines 261-276).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 47fdfff3-877a-4fa8-8c65-bd8911550c89
📒 Files selected for processing (4)
README.mdsrc/evalhub/adapter/__init__.pysrc/evalhub/adapter/collector.pytests/unit/test_collector.py
…rnings Address CodeRabbit review findings: - Wrap template format_map() to catch KeyError and provide a clear message listing available placeholders - Log a warning when input source fields collide with reserved collector output field names (response, raw_response, error, latency_ms) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
What and why
Adds an experimental adapter-side utility for collecting responses from live chatbot endpoints during
JobPhase.LOADING_DATA. This addresses the gap identified in #135 — chatbot teams currently need to write custom scripts to query their endpoint and format responses before submitting an EvalHub job.The collector supports:
{question}placeholder substitutionoutput.answer,result.content.0.text)extra_response_pathsresolve_model_credentials(),api_key_env, custom headers)Config is passed through
JobSpec.parameters["live_collection"]as an experimental shape, with the intent to promote totest_data_refonce validated (per discussion in #135).This utility would be useful for evaluation adapters like the RAGAS adapter (eval-hub-contrib#36) that evaluate chatbot/RAG responses against quality metrics.
Closes #135
Type
Testing
Also tested live against a real chatbot endpoint using the generic HTTP protocol with CA bundle authentication.
Breaking changes
None. This is new experimental adapter SDK surface.
Summary by CodeRabbit
New Features
Documentation