Skip to content

feat: add experimental live collection helper#139

Open
ai-hustle-bro wants to merge 5 commits into
eval-hub:mainfrom
ai-hustle-bro:codex/live-collection-helper
Open

feat: add experimental live collection helper#139
ai-hustle-bro wants to merge 5 commits into
eval-hub:mainfrom
ai-hustle-bro:codex/live-collection-helper

Conversation

@ai-hustle-bro

@ai-hustle-bro ai-hustle-bro commented Jun 6, 2026

Copy link
Copy Markdown

What and why

Closes #135

Adds an experimental adapter-side live response collection helper that adapters can call during JobPhase.LOADING_DATA while the longer-term API shape is still being decided.

The helper:

  • reads questions from CSV, JSON, or JSONL inputs
  • calls an OpenAI-compatible /v1/chat/completions endpoint
  • writes responses.jsonl plus manifest.json for evaluation framework loaders
  • records per-row request failures instead of aborting the whole collection run
  • keeps httpx optional via lazy import and supports dependency injection for tests
  • documents the trusted-submitter boundary for endpoint URLs, request headers, and api_key_env

Type

  • feat
  • fix
  • docs
  • refactor / chore
  • test / ci

Testing

  • Tests added or updated
  • Tested manually

Commands run:

uv run ruff check src/evalhub/adapter/live_collection.py src/evalhub/adapter/__init__.py tests/unit/test_live_collection.py
uv run mypy src/evalhub/adapter/live_collection.py tests/unit/test_live_collection.py
uv run pytest tests/unit/test_live_collection.py -q
uv run pytest tests/unit/test_adapter_models.py tests/unit/test_live_collection.py -q

I also ran the full unit suite after uv sync --extra cli --extra mcp --group dev; it reached 492 passed with 2 failures in tests/unit/test_cli_config.py that assert POSIX owner-only permission bits on Windows.

Breaking changes

None.

Summary by CodeRabbit

  • New Features

    • Added an experimental live response collection adapter to capture chatbot/RAG responses during data-loading jobs; produces per-item response records and a manifest summary.
  • Documentation

    • Added README section with usage examples, configuration schema, behavior notes (no redirect following, per-row failure recording, capped retry backoff), and output file descriptions (responses.jsonl, manifest.json).
  • Public API

    • Exposed live-collection helpers and models as part of the adapter package.
  • Tests

    • Added unit tests for input formats, retry/error handling, outputs, and config validation.

@coderabbitai

coderabbitai Bot commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ac0dddfd-75b2-4026-bb38-94a08075f76d

📥 Commits

Reviewing files that changed from the base of the PR and between 89be251 and 04c5563.

📒 Files selected for processing (1)
  • tests/unit/test_live_collection.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/unit/test_live_collection.py

📝 Walkthrough

Walkthrough

Adds an experimental live-response collection module: Pydantic config/models, CSV/JSON/JSONL question loading, OpenAI-compatible chat-completions HTTP orchestration with retry/backoff and non-redirect enforcement, JSONL/manifest outputs, package re-exports, README docs, and unit tests.

Changes

Live Response Collection Feature

Layer / File(s) Summary
Configuration and Data Models
src/evalhub/adapter/live_collection.py
LiveCollectionConfig, LiveQuestion, LiveCollectionRecord, and LiveCollectionManifest Pydantic models with URL validation and from_parameters.
Question Loading and Parsing
src/evalhub/adapter/live_collection.py
load_live_questions dispatches CSV/JSON/JSONL parsing, normalizes IDs (1-based fallback), skips empty questions, trims values, and extracts metadata from CSV columns or JSON metadata.
Collection Orchestration and HTTP Interaction
src/evalhub/adapter/live_collection.py
collect_openai_chat_completions loads questions, builds headers (env-sourced bearer token optional), lazily creates an httpx client with follow_redirects=False, posts per-question chat-completions with max_retries/exponential backoff, rejects 3xx redirects, extracts assistant content (string or parts list), writes responses.jsonl rows (content or error), and writes manifest.json. Helper functions build payloads and extract message content.
Package Exports
src/evalhub/adapter/__init__.py
Re-exports LiveCollectionConfig, LiveCollectionManifest, LiveCollectionRecord, LiveQuestion, collect_live_responses_from_parameters, collect_openai_chat_completions, and load_live_questions in __all__.
Usage Documentation
README.md
Experimental section documents live_collection parameters, Python usage example, expected parameters["live_collection"] shape, behavior for redirects and per-row failures, retry/backoff behavior, and output files responses.jsonl/manifest.json.
Test Coverage
tests/unit/test_live_collection.py
Tests include StubClient/StubResponse, validate CSV/JSONL loading with metadata and blanks handling, successful collection writing JSONL and manifest, canonical request behavior, redirect (302) handling recorded as per-row error, retry/backoff behavior, auth env var validation, non-HTTP URL rejection, and parameters-based wrapper.

Sequence Diagram

sequenceDiagram
  participant Adapter
  participant Collector as collect_openai_chat_completions
  participant FileSystem
  participant ChatbotEndpoint

  Adapter->>Collector: call with LiveCollectionConfig
  Collector->>FileSystem: load questions from CSV/JSON/JSONL
  FileSystem-->>Collector: List[LiveQuestion]
  Collector->>Collector: build headers (env API key + configured headers)

  loop for each question
    Collector->>ChatbotEndpoint: POST /v1/chat/completions (model + messages + extra_body)
    alt success (2xx)
      ChatbotEndpoint-->>Collector: JSON response with choices[0].message.content
      Collector->>Collector: extract assistant text
    else non-2xx or exception
      ChatbotEndpoint-->>Collector: error status or exception
      Collector->>Collector: retry with exponential backoff (until max_retries) or record error
    end
    Collector->>FileSystem: append LiveCollectionRecord to responses.jsonl
  end

  Collector->>FileSystem: write manifest.json with totals
  Collector-->>Adapter: return LiveCollectionManifest
Loading

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~40 minutes

Poem

🐰 I hopped through CSV and JSONL clear,
I poked the endpoint, far and near,
Retries and backoffs kept me steady,
I logged each answer, blank or heady,
Responses saved — a tidy trail, hooray!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main change: adding an experimental live collection helper to the adapter. It accurately summarizes the primary contribution without being vague or off-topic.
Description check ✅ Passed The description follows the template with all required sections completed: "What and why" (with issue reference), Type checkboxes marked, Testing section with commands, and Breaking changes noted.
Linked Issues check ✅ Passed All coding requirements from issue #135 are met: live response collection from CSV/JSON/JSONL inputs, OpenAI-compatible endpoint querying, responses.jsonl and manifest.json output, per-row error recording without aborting, httpx optional with dependency injection, and security boundary documentation.
Out of Scope Changes check ✅ Passed All changes are directly aligned with issue #135 objectives. README documentation, public API exports, live collection implementation, and comprehensive unit tests all support the stated goal of adding an experimental live collection helper.
Docstring Coverage ✅ Passed Docstring coverage is 96.55% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/evalhub/adapter/live_collection.py (1)

343-380: 💤 Low value

Consider adding backoff delay between retries.

The retry loop has no delay between attempts, which could be aggressive for transient server issues (e.g., rate limiting). For an experimental feature with max_retries=0 default, this is acceptable, but a short backoff would improve reliability.

💡 Optional: Add exponential backoff
+import time
+
 def _collect_one_question(
     config: LiveCollectionConfig,
     client: Any,
     headers: Mapping[str, str],
     question: LiveQuestion,
 ) -> LiveCollectionRecord:
     last_error: str | None = None
     for _attempt in range(config.max_retries + 1):
         try:
             response = client.post(
                 ...
             )
             ...
         except Exception as exc:
             last_error = f"{type(exc).__name__}: {exc}"
+            if _attempt < config.max_retries:
+                time.sleep(min(2 ** _attempt, 10))  # Cap at 10 seconds
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/evalhub/adapter/live_collection.py` around lines 343 - 380, The retry
loop in _collect_one_question is missing any delay between attempts; add a short
backoff (preferably exponential with jitter) between retries to avoid aggressive
hammering on transient failures: inside _collect_one_question, after catching
the exception and before the next retry, compute a backoff delay using the retry
attempt index (e.g., base_delay * 2**_attempt capped by a max_backoff and add
small random jitter via random.uniform) and call time.sleep(delay) (or
asyncio.sleep if this code becomes async), ensuring you do not sleep after the
final attempt; reference config.max_retries and the _attempt loop when
implementing.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/evalhub/adapter/live_collection.py`:
- Around line 343-380: The retry loop in _collect_one_question is missing any
delay between attempts; add a short backoff (preferably exponential with jitter)
between retries to avoid aggressive hammering on transient failures: inside
_collect_one_question, after catching the exception and before the next retry,
compute a backoff delay using the retry attempt index (e.g., base_delay *
2**_attempt capped by a max_backoff and add small random jitter via
random.uniform) and call time.sleep(delay) (or asyncio.sleep if this code
becomes async), ensuring you do not sleep after the final attempt; reference
config.max_retries and the _attempt loop when implementing.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 66f8c691-c660-4a2b-b73e-81e07f281855

📥 Commits

Reviewing files that changed from the base of the PR and between 3ca8447 and 3a67f53.

📒 Files selected for processing (4)
  • README.md
  • src/evalhub/adapter/__init__.py
  • src/evalhub/adapter/live_collection.py
  • tests/unit/test_live_collection.py

@ai-hustle-bro

Copy link
Copy Markdown
Author

Addressed CodeRabbit's retry backoff nitpick in beebeec.

Changes:

  • added retry_backoff_seconds and max_retry_backoff_seconds to LiveCollectionConfig
  • sleep only between failed row attempts, never after the final attempt
  • documented the capped exponential backoff behavior
  • added a unit assertion that monkeypatches time.sleep and verifies the retry delay

Validation:

  • uv run ruff check src/evalhub/adapter/live_collection.py src/evalhub/adapter/__init__.py tests/unit/test_live_collection.py
  • uv run mypy src/evalhub/adapter/live_collection.py tests/unit/test_live_collection.py
  • uv run pytest tests/unit/test_live_collection.py -q
  • uv run pytest tests/unit/test_adapter_models.py tests/unit/test_live_collection.py -q

@ai-hustle-bro

Copy link
Copy Markdown
Author

Follow-up on the CodeRabbit docstring coverage warning in e0d9ea5.

Changes:

  • added docstrings for the live collection helper functions and endpoint URL validator
  • kept the change documentation-only; no behavior changes
  • local AST check now reports functions=13 documented=13 coverage=100.00% for src/evalhub/adapter/live_collection.py

Validation:

  • uv run ruff format --check src/evalhub/adapter/live_collection.py tests/unit/test_live_collection.py
  • uv run ruff check src/evalhub/adapter/live_collection.py src/evalhub/adapter/__init__.py tests/unit/test_live_collection.py
  • uv run mypy src/evalhub/adapter/live_collection.py tests/unit/test_live_collection.py
  • uv run pytest tests/unit/test_live_collection.py -q
  • uv run pytest tests/unit/test_adapter_models.py tests/unit/test_live_collection.py -q
  • Claude Code CLI review: PASS

@ai-hustle-bro

Copy link
Copy Markdown
Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 6, 2026

Copy link
Copy Markdown
Contributor
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/evalhub/adapter/live_collection.py (2)

417-421: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Prevent extra_body from overriding the canonical request fields.

body.update(config.extra_body) runs after model and messages are set, so a caller can silently replace the configured model or the loaded prompt payload. That breaks the contract between the on-wire request and the persisted manifest/row metadata. Reserve those keys or merge extra_body first and write the required fields last.

Proposed fix
-    body: dict[str, Any] = {
-        "model": config.model,
-        "messages": messages,
-    }
-    body.update(config.extra_body)
+    forbidden_keys = {"model", "messages"}
+    overlapping_keys = forbidden_keys & config.extra_body.keys()
+    if overlapping_keys:
+        raise ValueError(
+            "extra_body cannot override reserved chat-completions fields: "
+            + ", ".join(sorted(overlapping_keys))
+        )
+
+    body: dict[str, Any] = {
+        **config.extra_body,
+        "model": config.model,
+        "messages": messages,
+    }
     return body
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/evalhub/adapter/live_collection.py` around lines 417 - 421, Prevent
config.extra_body from overriding canonical fields by merging it before setting
required keys or by filtering out reserved keys; specifically, when building the
request body in the block that creates body and uses config.extra_body, either
apply body.update(config.extra_body) first and then set body["model"] =
config.model and body["messages"] = messages, or sanitize config.extra_body to
remove reserved keys ("model", "messages", any other canonical keys) before
calling body.update(config.extra_body) so that the values in config.model and
messages (from the manifest/row) always win.

327-345: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Normalize blank JSON/JSONL rows the same way as CSV.

A JSON/JSONL row with "question": "" reaches this helper and raises, which aborts the entire load before any collection happens. _load_csv_questions() skips the same case and also trims whitespace-only IDs, so identical data behaves differently depending on file format. Treat blank question strings as skippable rows and strip raw_id before deciding whether to fall back to row_index.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/evalhub/adapter/live_collection.py` around lines 327 - 345, The helper
currently raises on empty or whitespace-only JSON/JSONL questions; change it to
treat blank question strings as skippable (same behavior as _load_csv_questions)
by returning a sentinel (e.g., None) or otherwise skipping the row instead of
raising when raw_question is a string that is empty after raw_question.strip();
also normalize the id logic by trimming raw_id before deciding the fallback
(compute question_id from str(raw_id).strip() and if that is empty use
str(row_index)); keep the metadata assembly and LiveQuestion(...) construction
unchanged for non-skippable rows so callers receive a LiveQuestion with a
trimmed question and normalized question_id.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@src/evalhub/adapter/live_collection.py`:
- Around line 417-421: Prevent config.extra_body from overriding canonical
fields by merging it before setting required keys or by filtering out reserved
keys; specifically, when building the request body in the block that creates
body and uses config.extra_body, either apply body.update(config.extra_body)
first and then set body["model"] = config.model and body["messages"] = messages,
or sanitize config.extra_body to remove reserved keys ("model", "messages", any
other canonical keys) before calling body.update(config.extra_body) so that the
values in config.model and messages (from the manifest/row) always win.
- Around line 327-345: The helper currently raises on empty or whitespace-only
JSON/JSONL questions; change it to treat blank question strings as skippable
(same behavior as _load_csv_questions) by returning a sentinel (e.g., None) or
otherwise skipping the row instead of raising when raw_question is a string that
is empty after raw_question.strip(); also normalize the id logic by trimming
raw_id before deciding the fallback (compute question_id from
str(raw_id).strip() and if that is empty use str(row_index)); keep the metadata
assembly and LiveQuestion(...) construction unchanged for non-skippable rows so
callers receive a LiveQuestion with a trimmed question and normalized
question_id.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: df5f002d-1319-47a4-a86c-43bf6e60a86f

📥 Commits

Reviewing files that changed from the base of the PR and between 3a67f53 and e0d9ea5.

📒 Files selected for processing (3)
  • README.md
  • src/evalhub/adapter/live_collection.py
  • tests/unit/test_live_collection.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • README.md
  • tests/unit/test_live_collection.py

@ai-hustle-bro

Copy link
Copy Markdown
Author

Pushed 89be251 to address the two live review findings:

  • extra_body no longer overrides canonical model / messages request fields.
  • JSON/JSONL question loading now skips blank questions and normalizes blank IDs consistently with CSV.

Local validation on the pushed commit:

.\.venv\Scripts\ruff.exe check src\evalhub\adapter\live_collection.py tests\unit\test_live_collection.py
.\.venv\Scripts\mypy.exe src\evalhub\adapter\live_collection.py tests\unit\test_live_collection.py
.\.venv\Scripts\pytest.exe tests\unit\test_live_collection.py -q

Result: ruff passed, mypy passed, and 11 passed.

@ai-hustle-bro

ai-hustle-bro commented Jun 7, 2026

Copy link
Copy Markdown
Author

Pushed 04c5563 as a tiny follow-up for the CodeRabbit docstring-coverage warning.

What changed:

  • Added short docstrings to the live collection test stubs and test cases only.
  • No production logic changed.

Validation on the pushed branch:

.\.venv\Scripts\ruff.exe check src\evalhub\adapter\live_collection.py tests\unit\test_live_collection.py
.\.venv\Scripts\mypy.exe src\evalhub\adapter\live_collection.py tests\unit\test_live_collection.py
.\.venv\Scripts\pytest.exe tests\unit\test_live_collection.py -q
git diff --check

Result: ruff passed, mypy passed, 11 passed, and git diff --check passed.

Claude Code CLI review of this docstring-only diff returned READY.

@ruivieira ruivieira added the kind/feat Categorizes issue as a feature request label Jun 8, 2026
@ruivieira ruivieira moved this from Todo to In Progress in EvalHub Planning Jun 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/feat Categorizes issue as a feature request

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

feat: add live endpoint response collection capability to the adapter SDK

3 participants