feat(adapter): add experimental live endpoint response collector by csoceanu · Pull Request #142 · eval-hub/eval-hub-sdk

csoceanu · 2026-06-09T13:42:09Z

What and why

Adds an experimental adapter-side utility for collecting responses from live chatbot endpoints during JobPhase.LOADING_DATA. This addresses the gap identified in #135 — chatbot teams currently need to write custom scripts to query their endpoint and format responses before submitting an EvalHub job.

The collector supports:

OpenAI-compatible chat completions and generic HTTP endpoints (MCP, Langflow, custom APIs)
Configurable request templates with {question} placeholder substitution
Configurable response extraction via dot-separated paths (e.g. output.answer, result.content.0.text)
Additional field extraction (e.g. retrieved contexts for RAG evaluation) via extra_response_paths
SDK auth integration (resolve_model_credentials(), api_key_env, custom headers)
TLS/CA bundle support following existing SDK patterns
Configurable failure handling (fail-fast or best-effort)
Progress callback support

Config is passed through JobSpec.parameters["live_collection"] as an experimental shape, with the intent to promote to test_data_ref once validated (per discussion in #135).

This utility would be useful for evaluation adapters like the RAGAS adapter (eval-hub-contrib#36) that evaluate chatbot/RAG responses against quality metrics.

Closes #135

Type

Testing

Tests added or updated
Tested manually

uv run ruff check src/evalhub/adapter/collector.py tests/unit/test_collector.py
uv run mypy --config-file=pyproject.toml src/evalhub/adapter/collector.py
uv run pytest tests/unit/test_collector.py -v  (46 passed)
uv run pytest tests/unit/ -q  (299 passed, 0 failures)

Also tested live against a real chatbot endpoint using the generic HTTP protocol with CA bundle authentication.

Breaking changes

None. This is new experimental adapter SDK surface.

Summary by CodeRabbit

New Features

Added experimental "Live Endpoint Response Collection" feature to query chatbot endpoints with test questions and save responses for evaluation
Supports OpenAI-compatible and generic HTTP endpoint configurations with customizable request/response path mappings
Loads test questions from CSV, JSON, and JSONL formats with automatic retry logic and progress tracking

Documentation

Added comprehensive "Live Endpoint Response Collection" section to README with workflow examples and configuration patterns

Add a collection utility to the adapter SDK that queries a live chatbot endpoint with test questions and produces an evaluation-ready JSONL dataset. Supports OpenAI-compatible chat completions and generic HTTP endpoints (MCP, Langflow, custom APIs) with configurable request templates and response path extraction. Integrates with existing SDK auth (resolve_model_credentials) and TLS patterns (CA bundle auto-detection, insecure flag). Config is passed through JobSpec.parameters["live_collection"] as an experimental shape. Refs eval-hub#135 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-06-09T13:42:27Z

Warning

Review limit reached

@csoceanu, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 46 minutes and 52 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7be0b978-14be-42ab-81d0-e31a1cd7727d

📥 Commits

Reviewing files that changed from the base of the PR and between a40f551 and 75230ec.

📒 Files selected for processing (1)

src/evalhub/adapter/collector.py

📝 Walkthrough

Walkthrough

This pull request adds a new experimental "live endpoint response collection" module to the adapter SDK, enabling adapters to query chatbot endpoints with test questions and collect responses during the evaluation pipeline. The implementation includes configuration models, protocol handlers for both OpenAI-compatible and generic HTTP endpoints, comprehensive error handling with retries, and a complete test suite.

Changes

Live Endpoint Response Collection Feature

Layer / File(s)	Summary
Configuration Models and Public API Exports `src/evalhub/adapter/collector.py` (1–183), `src/evalhub/adapter/__init__.py`	`CollectorProtocol` (OpenAI vs generic HTTP), `CollectorError`, `CollectorConfig` with validation, and data models (`LiveQuestion`, `CollectedRecord`, `CollectionManifest`) define the collection contract. Public exports from `__init__.py` expose all collector utilities.
Question Loading and Data Utilities `src/evalhub/adapter/collector.py` (288–343)	`load_questions` dispatches CSV/JSON/JSONL file loading; `extract_by_path` and `substitute_template` support dotted-path traversal and recursive variable substitution for request templates and response extraction.
Authentication and TLS Setup `src/evalhub/adapter/collector.py` (345–377)	`_resolve_auth_headers` merges credential-derived bearer tokens, environment API keys, and config headers. `_resolve_verify` derives TLS verification from configuration, CA bundle paths, and service-account certificates.
Protocol-Specific Collection Handlers `src/evalhub/adapter/collector.py` (390–436, 523–533)	`_collect_openai` builds chat-completions request bodies with system prompts; `_collect_generic_http` performs template substitution and delegates to request execution. Protocol-specific extractors (`_extract_openai`, `_extract_generic`) parse responses.
Request Execution and Response Processing `src/evalhub/adapter/collector.py` (379–492)	`_collect_one` dispatches by protocol. `_send_request` handles exponential backoff retries, latency measurement, redirect blocking, JSON validation, response extraction, and error aggregation. Response records are flattened to JSONL with optional extra-field extraction.
Main Collection Orchestration `src/evalhub/adapter/collector.py` (191–286)	`collect_responses` orchestrates the full flow: resolves credentials, loads questions, owns/manages HTTP client, iterates collection with fail-fast semantics, writes `responses.jsonl` and `manifest.json`, and reports progress. `collect_responses_from_parameters` wraps configuration construction.
Comprehensive Test Suite `tests/unit/test_collector.py`	Tests cover configuration validation, auth resolution, question loading (CSV/JSON/JSONL), utility functions (path extraction, template substitution), OpenAI collection with manifest output and retry behavior, generic HTTP collection with request templating and extra-field extraction, redirect/error handling, progress callbacks, and fail-fast semantics.
User Documentation `README.md` (532–592)	Documents the experimental live response collection feature with example configurations for OpenAI-compatible and generic HTTP endpoints, parameter structures, credential resolution, and TLS auto-detection.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

This is a substantial new module introducing multiple interacting components: configuration validation, protocol handlers for two endpoint types, retry logic with exponential backoff, authentication/TLS infrastructure, multi-format question loading, and response flattening. The logic density is moderate but spread across many functions with distinct responsibilities. Comprehensive test coverage validates behavior across all major paths, reducing verification burden per reviewer.

Suggested labels

enhancement

Suggested reviewers

tarilabs

Poem

🐰 A question walks into an endpoint,
The collector gathers answers with care,
OpenAI or HTTP, it doesn't care—
Retries with backoff, manifests appear,
Responses flow to JSONL, crystal clear! 🌟

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title concisely describes the main feature: adding an experimental live endpoint response collector to the adapter module.
Description check	✅ Passed	The PR description comprehensively covers objectives, implementation details, testing, and breaking changes; it follows the template with proper sections and demonstrates thorough work.
Linked Issues check	✅ Passed	The implementation fully addresses issue `#135` by providing SDK-level capability for collecting responses from live chatbot endpoints with configurable templates, auth integration, and failure handling.
Out of Scope Changes check	✅ Passed	All changes are in-scope: README documentation of the feature, new collector module with implementation, public API exports, and comprehensive unit tests directly support the `#135` objectives.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

src/evalhub/adapter/collector.py (1)
247-251: 💤 Low value

Consider writing a partial manifest on fail_fast failure.

When fail_fast=True and collection fails, the exception propagates before the manifest is written (lines 261-276). This leaves responses.jsonl with partial data but no manifest.json, making it harder for operators to inspect the failure state.

Consider either:

Writing a partial manifest with failed=1 before raising, or

Adding a clear log message with progress info before raising.

Also applies to: 261-276
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/evalhub/adapter/collector.py` around lines 247 - 251, When
config.fail_fast is true and you raise CollectorError in the collection loop,
write a partial manifest before raising so operators can see progress; update
the code path that currently raises CollectorError (the block referencing
config.fail_fast and CollectorError) to first build a manifest object containing
progress fields (total attempted, succeeded, failed, and a failed=1 or
partial=true flag) and call the existing manifest-writer used elsewhere in this
module (look for the manifest writing code near lines 261-276 or functions named
write_manifest/_write_manifest) to persist that manifest, or at minimum emit a
concise log with those progress numbers, then re-raise the CollectorError.
Ensure the same change is applied to the other fail_fast sites mentioned (around
lines 261-276).

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/evalhub/adapter/collector.py`:
- Around line 334-342: substitute_template currently calls format_map() on
strings which raises a bare KeyError for missing placeholders; wrap the string
formatting in a try/except that catches KeyError from
template.format_map(variables) and re-raise a clearer error (ValueError or
KeyError) that includes the missing placeholder name and the template string
(and chain the original exception) so callers can see which template/value
failed; update the substitute_template function to perform this localized catch
for the str branch while leaving dict/list recursion as-is.
- Around line 494-508: _flattent_record currently overwrites keys from
record.source_fields when they collide with reserved output names (response,
raw_response, error, latency_ms) or with record.extra_fields; update
_flatten_record to detect collisions before merging: compute the intersection
between record.source_fields.keys() and the reserved set
{"response","raw_response","error","latency_ms"} and also between source_fields
and record.extra_fields.keys(), and emit a warning (use the module/logger used
elsewhere in this file) listing the colliding keys and the record identifier if
available; after logging, proceed with the existing merge behavior so output
remains unchanged but collisions are surfaced for debugging.

---

Nitpick comments:
In `@src/evalhub/adapter/collector.py`:
- Around line 247-251: When config.fail_fast is true and you raise
CollectorError in the collection loop, write a partial manifest before raising
so operators can see progress; update the code path that currently raises
CollectorError (the block referencing config.fail_fast and CollectorError) to
first build a manifest object containing progress fields (total attempted,
succeeded, failed, and a failed=1 or partial=true flag) and call the existing
manifest-writer used elsewhere in this module (look for the manifest writing
code near lines 261-276 or functions named write_manifest/_write_manifest) to
persist that manifest, or at minimum emit a concise log with those progress
numbers, then re-raise the CollectorError. Ensure the same change is applied to
the other fail_fast sites mentioned (around lines 261-276).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 47fdfff3-877a-4fa8-8c65-bd8911550c89

📥 Commits

Reviewing files that changed from the base of the PR and between 162fa69 and a40f551.

📒 Files selected for processing (4)

README.md
src/evalhub/adapter/__init__.py
src/evalhub/adapter/collector.py
tests/unit/test_collector.py

…rnings Address CodeRabbit review findings: - Wrap template format_map() to catch KeyError and provide a clear message listing available placeholders - Log a warning when input source fields collide with reserved collector output field names (response, raw_response, error, latency_ms) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai Bot reviewed Jun 9, 2026

View reviewed changes

Comment thread src/evalhub/adapter/collector.py

Comment thread src/evalhub/adapter/collector.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(adapter): add experimental live endpoint response collector#142

feat(adapter): add experimental live endpoint response collector#142
csoceanu wants to merge 2 commits into
eval-hub:mainfrom
csoceanu:feat/live-endpoint-collector

csoceanu commented Jun 9, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 9, 2026 •

edited

Loading

Review limit reached

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

csoceanu commented Jun 9, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What and why

Type

Testing

Breaking changes

Summary by CodeRabbit

New Features

Documentation

Uh oh!

coderabbitai Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

csoceanu commented Jun 9, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 9, 2026 •

edited

Loading