Skip to content

feat(adapter): add experimental live endpoint response collector#142

Open
csoceanu wants to merge 2 commits into
eval-hub:mainfrom
csoceanu:feat/live-endpoint-collector
Open

feat(adapter): add experimental live endpoint response collector#142
csoceanu wants to merge 2 commits into
eval-hub:mainfrom
csoceanu:feat/live-endpoint-collector

Conversation

@csoceanu

@csoceanu csoceanu commented Jun 9, 2026

Copy link
Copy Markdown

What and why

Adds an experimental adapter-side utility for collecting responses from live chatbot endpoints during JobPhase.LOADING_DATA. This addresses the gap identified in #135 — chatbot teams currently need to write custom scripts to query their endpoint and format responses before submitting an EvalHub job.

The collector supports:

  • OpenAI-compatible chat completions and generic HTTP endpoints (MCP, Langflow, custom APIs)
  • Configurable request templates with {question} placeholder substitution
  • Configurable response extraction via dot-separated paths (e.g. output.answer, result.content.0.text)
  • Additional field extraction (e.g. retrieved contexts for RAG evaluation) via extra_response_paths
  • SDK auth integration (resolve_model_credentials(), api_key_env, custom headers)
  • TLS/CA bundle support following existing SDK patterns
  • Configurable failure handling (fail-fast or best-effort)
  • Progress callback support

Config is passed through JobSpec.parameters["live_collection"] as an experimental shape, with the intent to promote to test_data_ref once validated (per discussion in #135).

This utility would be useful for evaluation adapters like the RAGAS adapter (eval-hub-contrib#36) that evaluate chatbot/RAG responses against quality metrics.

Closes #135

Type

  • feat
  • fix
  • docs
  • refactor / chore
  • test / ci

Testing

  • Tests added or updated
  • Tested manually
uv run ruff check src/evalhub/adapter/collector.py tests/unit/test_collector.py
uv run mypy --config-file=pyproject.toml src/evalhub/adapter/collector.py
uv run pytest tests/unit/test_collector.py -v  (46 passed)
uv run pytest tests/unit/ -q  (299 passed, 0 failures)

Also tested live against a real chatbot endpoint using the generic HTTP protocol with CA bundle authentication.

Breaking changes

None. This is new experimental adapter SDK surface.

Summary by CodeRabbit

New Features

  • Added experimental "Live Endpoint Response Collection" feature to query chatbot endpoints with test questions and save responses for evaluation
  • Supports OpenAI-compatible and generic HTTP endpoint configurations with customizable request/response path mappings
  • Loads test questions from CSV, JSON, and JSONL formats with automatic retry logic and progress tracking

Documentation

  • Added comprehensive "Live Endpoint Response Collection" section to README with workflow examples and configuration patterns

Add a collection utility to the adapter SDK that queries a live chatbot
endpoint with test questions and produces an evaluation-ready JSONL dataset.
Supports OpenAI-compatible chat completions and generic HTTP endpoints
(MCP, Langflow, custom APIs) with configurable request templates and
response path extraction.

Integrates with existing SDK auth (resolve_model_credentials) and TLS
patterns (CA bundle auto-detection, insecure flag). Config is passed
through JobSpec.parameters["live_collection"] as an experimental shape.

Refs eval-hub#135

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Warning

Review limit reached

@csoceanu, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 46 minutes and 52 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7be0b978-14be-42ab-81d0-e31a1cd7727d

📥 Commits

Reviewing files that changed from the base of the PR and between a40f551 and 75230ec.

📒 Files selected for processing (1)
  • src/evalhub/adapter/collector.py
📝 Walkthrough

Walkthrough

This pull request adds a new experimental "live endpoint response collection" module to the adapter SDK, enabling adapters to query chatbot endpoints with test questions and collect responses during the evaluation pipeline. The implementation includes configuration models, protocol handlers for both OpenAI-compatible and generic HTTP endpoints, comprehensive error handling with retries, and a complete test suite.

Changes

Live Endpoint Response Collection Feature

Layer / File(s) Summary
Configuration Models and Public API Exports
src/evalhub/adapter/collector.py (1–183), src/evalhub/adapter/__init__.py
CollectorProtocol (OpenAI vs generic HTTP), CollectorError, CollectorConfig with validation, and data models (LiveQuestion, CollectedRecord, CollectionManifest) define the collection contract. Public exports from __init__.py expose all collector utilities.
Question Loading and Data Utilities
src/evalhub/adapter/collector.py (288–343)
load_questions dispatches CSV/JSON/JSONL file loading; extract_by_path and substitute_template support dotted-path traversal and recursive variable substitution for request templates and response extraction.
Authentication and TLS Setup
src/evalhub/adapter/collector.py (345–377)
_resolve_auth_headers merges credential-derived bearer tokens, environment API keys, and config headers. _resolve_verify derives TLS verification from configuration, CA bundle paths, and service-account certificates.
Protocol-Specific Collection Handlers
src/evalhub/adapter/collector.py (390–436, 523–533)
_collect_openai builds chat-completions request bodies with system prompts; _collect_generic_http performs template substitution and delegates to request execution. Protocol-specific extractors (_extract_openai, _extract_generic) parse responses.
Request Execution and Response Processing
src/evalhub/adapter/collector.py (379–492)
_collect_one dispatches by protocol. _send_request handles exponential backoff retries, latency measurement, redirect blocking, JSON validation, response extraction, and error aggregation. Response records are flattened to JSONL with optional extra-field extraction.
Main Collection Orchestration
src/evalhub/adapter/collector.py (191–286)
collect_responses orchestrates the full flow: resolves credentials, loads questions, owns/manages HTTP client, iterates collection with fail-fast semantics, writes responses.jsonl and manifest.json, and reports progress. collect_responses_from_parameters wraps configuration construction.
Comprehensive Test Suite
tests/unit/test_collector.py
Tests cover configuration validation, auth resolution, question loading (CSV/JSON/JSONL), utility functions (path extraction, template substitution), OpenAI collection with manifest output and retry behavior, generic HTTP collection with request templating and extra-field extraction, redirect/error handling, progress callbacks, and fail-fast semantics.
User Documentation
README.md (532–592)
Documents the experimental live response collection feature with example configurations for OpenAI-compatible and generic HTTP endpoints, parameter structures, credential resolution, and TLS auto-detection.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

This is a substantial new module introducing multiple interacting components: configuration validation, protocol handlers for two endpoint types, retry logic with exponential backoff, authentication/TLS infrastructure, multi-format question loading, and response flattening. The logic density is moderate but spread across many functions with distinct responsibilities. Comprehensive test coverage validates behavior across all major paths, reducing verification burden per reviewer.

Suggested labels

enhancement

Suggested reviewers

  • tarilabs

Poem

🐰 A question walks into an endpoint,
The collector gathers answers with care,
OpenAI or HTTP, it doesn't care—
Retries with backoff, manifests appear,
Responses flow to JSONL, crystal clear! 🌟

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title concisely describes the main feature: adding an experimental live endpoint response collector to the adapter module.
Description check ✅ Passed The PR description comprehensively covers objectives, implementation details, testing, and breaking changes; it follows the template with proper sections and demonstrates thorough work.
Linked Issues check ✅ Passed The implementation fully addresses issue #135 by providing SDK-level capability for collecting responses from live chatbot endpoints with configurable templates, auth integration, and failure handling.
Out of Scope Changes check ✅ Passed All changes are in-scope: README documentation of the feature, new collector module with implementation, public API exports, and comprehensive unit tests directly support the #135 objectives.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
src/evalhub/adapter/collector.py (1)

247-251: 💤 Low value

Consider writing a partial manifest on fail_fast failure.

When fail_fast=True and collection fails, the exception propagates before the manifest is written (lines 261-276). This leaves responses.jsonl with partial data but no manifest.json, making it harder for operators to inspect the failure state.

Consider either:

  1. Writing a partial manifest with failed=1 before raising, or
  2. Adding a clear log message with progress info before raising.

Also applies to: 261-276

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/evalhub/adapter/collector.py` around lines 247 - 251, When
config.fail_fast is true and you raise CollectorError in the collection loop,
write a partial manifest before raising so operators can see progress; update
the code path that currently raises CollectorError (the block referencing
config.fail_fast and CollectorError) to first build a manifest object containing
progress fields (total attempted, succeeded, failed, and a failed=1 or
partial=true flag) and call the existing manifest-writer used elsewhere in this
module (look for the manifest writing code near lines 261-276 or functions named
write_manifest/_write_manifest) to persist that manifest, or at minimum emit a
concise log with those progress numbers, then re-raise the CollectorError.
Ensure the same change is applied to the other fail_fast sites mentioned (around
lines 261-276).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/evalhub/adapter/collector.py`:
- Around line 334-342: substitute_template currently calls format_map() on
strings which raises a bare KeyError for missing placeholders; wrap the string
formatting in a try/except that catches KeyError from
template.format_map(variables) and re-raise a clearer error (ValueError or
KeyError) that includes the missing placeholder name and the template string
(and chain the original exception) so callers can see which template/value
failed; update the substitute_template function to perform this localized catch
for the str branch while leaving dict/list recursion as-is.
- Around line 494-508: _flattent_record currently overwrites keys from
record.source_fields when they collide with reserved output names (response,
raw_response, error, latency_ms) or with record.extra_fields; update
_flatten_record to detect collisions before merging: compute the intersection
between record.source_fields.keys() and the reserved set
{"response","raw_response","error","latency_ms"} and also between source_fields
and record.extra_fields.keys(), and emit a warning (use the module/logger used
elsewhere in this file) listing the colliding keys and the record identifier if
available; after logging, proceed with the existing merge behavior so output
remains unchanged but collisions are surfaced for debugging.

---

Nitpick comments:
In `@src/evalhub/adapter/collector.py`:
- Around line 247-251: When config.fail_fast is true and you raise
CollectorError in the collection loop, write a partial manifest before raising
so operators can see progress; update the code path that currently raises
CollectorError (the block referencing config.fail_fast and CollectorError) to
first build a manifest object containing progress fields (total attempted,
succeeded, failed, and a failed=1 or partial=true flag) and call the existing
manifest-writer used elsewhere in this module (look for the manifest writing
code near lines 261-276 or functions named write_manifest/_write_manifest) to
persist that manifest, or at minimum emit a concise log with those progress
numbers, then re-raise the CollectorError. Ensure the same change is applied to
the other fail_fast sites mentioned (around lines 261-276).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 47fdfff3-877a-4fa8-8c65-bd8911550c89

📥 Commits

Reviewing files that changed from the base of the PR and between 162fa69 and a40f551.

📒 Files selected for processing (4)
  • README.md
  • src/evalhub/adapter/__init__.py
  • src/evalhub/adapter/collector.py
  • tests/unit/test_collector.py

Comment thread src/evalhub/adapter/collector.py
Comment thread src/evalhub/adapter/collector.py
…rnings

Address CodeRabbit review findings:
- Wrap template format_map() to catch KeyError and provide a clear message
  listing available placeholders
- Log a warning when input source fields collide with reserved collector
  output field names (response, raw_response, error, latency_ms)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: add live endpoint response collection capability to the adapter SDK

1 participant