feat: add experimental live collection helper by ai-hustle-bro · Pull Request #139 · eval-hub/eval-hub-sdk

ai-hustle-bro · 2026-06-06T02:52:07Z

What and why

Closes #135

Adds an experimental adapter-side live response collection helper that adapters can call during JobPhase.LOADING_DATA while the longer-term API shape is still being decided.

The helper:

reads questions from CSV, JSON, or JSONL inputs
calls an OpenAI-compatible /v1/chat/completions endpoint
writes responses.jsonl plus manifest.json for evaluation framework loaders
records per-row request failures instead of aborting the whole collection run
keeps httpx optional via lazy import and supports dependency injection for tests
documents the trusted-submitter boundary for endpoint URLs, request headers, and api_key_env

Type

Testing

Tests added or updated
Tested manually

Commands run:

uv run ruff check src/evalhub/adapter/live_collection.py src/evalhub/adapter/__init__.py tests/unit/test_live_collection.py
uv run mypy src/evalhub/adapter/live_collection.py tests/unit/test_live_collection.py
uv run pytest tests/unit/test_live_collection.py -q
uv run pytest tests/unit/test_adapter_models.py tests/unit/test_live_collection.py -q

I also ran the full unit suite after uv sync --extra cli --extra mcp --group dev; it reached 492 passed with 2 failures in tests/unit/test_cli_config.py that assert POSIX owner-only permission bits on Windows.

Breaking changes

None.

Summary by CodeRabbit

New Features
- Added an experimental live response collection adapter to capture chatbot/RAG responses during data-loading jobs; produces per-item response records and a manifest summary.
Documentation
- Added README section with usage examples, configuration schema, behavior notes (no redirect following, per-row failure recording, capped retry backoff), and output file descriptions (responses.jsonl, manifest.json).
Public API
- Exposed live-collection helpers and models as part of the adapter package.
Tests
- Added unit tests for input formats, retry/error handling, outputs, and config validation.

coderabbitai · 2026-06-06T02:52:18Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ac0dddfd-75b2-4026-bb38-94a08075f76d

📥 Commits

Reviewing files that changed from the base of the PR and between 89be251 and 04c5563.

📒 Files selected for processing (1)

tests/unit/test_live_collection.py

🚧 Files skipped from review as they are similar to previous changes (1)

tests/unit/test_live_collection.py

📝 Walkthrough

Walkthrough

Adds an experimental live-response collection module: Pydantic config/models, CSV/JSON/JSONL question loading, OpenAI-compatible chat-completions HTTP orchestration with retry/backoff and non-redirect enforcement, JSONL/manifest outputs, package re-exports, README docs, and unit tests.

Changes

Live Response Collection Feature

Layer / File(s)	Summary
Configuration and Data Models `src/evalhub/adapter/live_collection.py`	`LiveCollectionConfig`, `LiveQuestion`, `LiveCollectionRecord`, and `LiveCollectionManifest` Pydantic models with URL validation and `from_parameters`.
Question Loading and Parsing `src/evalhub/adapter/live_collection.py`	`load_live_questions` dispatches CSV/JSON/JSONL parsing, normalizes IDs (1-based fallback), skips empty questions, trims values, and extracts metadata from CSV columns or JSON `metadata`.
Collection Orchestration and HTTP Interaction `src/evalhub/adapter/live_collection.py`	`collect_openai_chat_completions` loads questions, builds headers (env-sourced bearer token optional), lazily creates an `httpx` client with `follow_redirects=False`, posts per-question chat-completions with `max_retries`/exponential backoff, rejects 3xx redirects, extracts assistant content (string or parts list), writes `responses.jsonl` rows (content or `error`), and writes `manifest.json`. Helper functions build payloads and extract message content.
Package Exports `src/evalhub/adapter/__init__.py`	Re-exports `LiveCollectionConfig`, `LiveCollectionManifest`, `LiveCollectionRecord`, `LiveQuestion`, `collect_live_responses_from_parameters`, `collect_openai_chat_completions`, and `load_live_questions` in `__all__`.
Usage Documentation `README.md`	Experimental section documents `live_collection` parameters, Python usage example, expected `parameters["live_collection"]` shape, behavior for redirects and per-row failures, retry/backoff behavior, and output files `responses.jsonl`/`manifest.json`.
Test Coverage `tests/unit/test_live_collection.py`	Tests include `StubClient`/`StubResponse`, validate CSV/JSONL loading with metadata and blanks handling, successful collection writing JSONL and manifest, canonical request behavior, redirect (302) handling recorded as per-row `error`, retry/backoff behavior, auth env var validation, non-HTTP URL rejection, and parameters-based wrapper.

Sequence Diagram

sequenceDiagram
  participant Adapter
  participant Collector as collect_openai_chat_completions
  participant FileSystem
  participant ChatbotEndpoint

  Adapter->>Collector: call with LiveCollectionConfig
  Collector->>FileSystem: load questions from CSV/JSON/JSONL
  FileSystem-->>Collector: List[LiveQuestion]
  Collector->>Collector: build headers (env API key + configured headers)

  loop for each question
    Collector->>ChatbotEndpoint: POST /v1/chat/completions (model + messages + extra_body)
    alt success (2xx)
      ChatbotEndpoint-->>Collector: JSON response with choices[0].message.content
      Collector->>Collector: extract assistant text
    else non-2xx or exception
      ChatbotEndpoint-->>Collector: error status or exception
      Collector->>Collector: retry with exponential backoff (until max_retries) or record error
    end
    Collector->>FileSystem: append LiveCollectionRecord to responses.jsonl
  end

  Collector->>FileSystem: write manifest.json with totals
  Collector-->>Adapter: return LiveCollectionManifest

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~40 minutes

Poem

🐰 I hopped through CSV and JSONL clear,
I poked the endpoint, far and near,
Retries and backoffs kept me steady,
I logged each answer, blank or heady,
Responses saved — a tidy trail, hooray!

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely describes the main change: adding an experimental live collection helper to the adapter. It accurately summarizes the primary contribution without being vague or off-topic.
Description check	✅ Passed	The description follows the template with all required sections completed: "What and why" (with issue reference), Type checkboxes marked, Testing section with commands, and Breaking changes noted.
Linked Issues check	✅ Passed	All coding requirements from issue `#135` are met: live response collection from CSV/JSON/JSONL inputs, OpenAI-compatible endpoint querying, responses.jsonl and manifest.json output, per-row error recording without aborting, httpx optional with dependency injection, and security boundary documentation.
Out of Scope Changes check	✅ Passed	All changes are directly aligned with issue `#135` objectives. README documentation, public API exports, live collection implementation, and comprehensive unit tests all support the stated goal of adding an experimental live collection helper.
Docstring Coverage	✅ Passed	Docstring coverage is 96.55% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

src/evalhub/adapter/live_collection.py (1)

343-380: 💤 Low value

Consider adding backoff delay between retries.

The retry loop has no delay between attempts, which could be aggressive for transient server issues (e.g., rate limiting). For an experimental feature with max_retries=0 default, this is acceptable, but a short backoff would improve reliability.

💡 Optional: Add exponential backoff

+import time
+
 def _collect_one_question(
     config: LiveCollectionConfig,
     client: Any,
     headers: Mapping[str, str],
     question: LiveQuestion,
 ) -> LiveCollectionRecord:
     last_error: str | None = None
     for _attempt in range(config.max_retries + 1):
         try:
             response = client.post(
                 ...
             )
             ...
         except Exception as exc:
             last_error = f"{type(exc).__name__}: {exc}"
+            if _attempt < config.max_retries:
+                time.sleep(min(2 ** _attempt, 10))  # Cap at 10 seconds

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/evalhub/adapter/live_collection.py` around lines 343 - 380, The retry
loop in _collect_one_question is missing any delay between attempts; add a short
backoff (preferably exponential with jitter) between retries to avoid aggressive
hammering on transient failures: inside _collect_one_question, after catching
the exception and before the next retry, compute a backoff delay using the retry
attempt index (e.g., base_delay * 2**_attempt capped by a max_backoff and add
small random jitter via random.uniform) and call time.sleep(delay) (or
asyncio.sleep if this code becomes async), ensuring you do not sleep after the
final attempt; reference config.max_retries and the _attempt loop when
implementing.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/evalhub/adapter/live_collection.py`:
- Around line 343-380: The retry loop in _collect_one_question is missing any
delay between attempts; add a short backoff (preferably exponential with jitter)
between retries to avoid aggressive hammering on transient failures: inside
_collect_one_question, after catching the exception and before the next retry,
compute a backoff delay using the retry attempt index (e.g., base_delay *
2**_attempt capped by a max_backoff and add small random jitter via
random.uniform) and call time.sleep(delay) (or asyncio.sleep if this code
becomes async), ensuring you do not sleep after the final attempt; reference
config.max_retries and the _attempt loop when implementing.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 66f8c691-c660-4a2b-b73e-81e07f281855

📥 Commits

Reviewing files that changed from the base of the PR and between 3ca8447 and 3a67f53.

📒 Files selected for processing (4)

README.md
src/evalhub/adapter/__init__.py
src/evalhub/adapter/live_collection.py
tests/unit/test_live_collection.py

ai-hustle-bro · 2026-06-06T03:09:21Z

Addressed CodeRabbit's retry backoff nitpick in beebeec.

Changes:

added retry_backoff_seconds and max_retry_backoff_seconds to LiveCollectionConfig
sleep only between failed row attempts, never after the final attempt
documented the capped exponential backoff behavior
added a unit assertion that monkeypatches time.sleep and verifies the retry delay

Validation:

uv run ruff check src/evalhub/adapter/live_collection.py src/evalhub/adapter/__init__.py tests/unit/test_live_collection.py
uv run mypy src/evalhub/adapter/live_collection.py tests/unit/test_live_collection.py
uv run pytest tests/unit/test_live_collection.py -q
uv run pytest tests/unit/test_adapter_models.py tests/unit/test_live_collection.py -q

ai-hustle-bro · 2026-06-06T03:40:43Z

Follow-up on the CodeRabbit docstring coverage warning in e0d9ea5.

Changes:

added docstrings for the live collection helper functions and endpoint URL validator
kept the change documentation-only; no behavior changes
local AST check now reports functions=13 documented=13 coverage=100.00% for src/evalhub/adapter/live_collection.py

Validation:

uv run ruff format --check src/evalhub/adapter/live_collection.py tests/unit/test_live_collection.py
uv run ruff check src/evalhub/adapter/live_collection.py src/evalhub/adapter/__init__.py tests/unit/test_live_collection.py
uv run mypy src/evalhub/adapter/live_collection.py tests/unit/test_live_collection.py
uv run pytest tests/unit/test_live_collection.py -q
uv run pytest tests/unit/test_adapter_models.py tests/unit/test_live_collection.py -q
Claude Code CLI review: PASS

ai-hustle-bro · 2026-06-06T04:06:08Z

@coderabbitai review

coderabbitai · 2026-06-06T04:06:14Z

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

src/evalhub/adapter/live_collection.py (2)

417-421: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Prevent extra_body from overriding the canonical request fields.

body.update(config.extra_body) runs after model and messages are set, so a caller can silently replace the configured model or the loaded prompt payload. That breaks the contract between the on-wire request and the persisted manifest/row metadata. Reserve those keys or merge extra_body first and write the required fields last.

Proposed fix

-    body: dict[str, Any] = {
-        "model": config.model,
-        "messages": messages,
-    }
-    body.update(config.extra_body)
+    forbidden_keys = {"model", "messages"}
+    overlapping_keys = forbidden_keys & config.extra_body.keys()
+    if overlapping_keys:
+        raise ValueError(
+            "extra_body cannot override reserved chat-completions fields: "
+            + ", ".join(sorted(overlapping_keys))
+        )
+
+    body: dict[str, Any] = {
+        **config.extra_body,
+        "model": config.model,
+        "messages": messages,
+    }
     return body

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/evalhub/adapter/live_collection.py` around lines 417 - 421, Prevent
config.extra_body from overriding canonical fields by merging it before setting
required keys or by filtering out reserved keys; specifically, when building the
request body in the block that creates body and uses config.extra_body, either
apply body.update(config.extra_body) first and then set body["model"] =
config.model and body["messages"] = messages, or sanitize config.extra_body to
remove reserved keys ("model", "messages", any other canonical keys) before
calling body.update(config.extra_body) so that the values in config.model and
messages (from the manifest/row) always win.

327-345: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Normalize blank JSON/JSONL rows the same way as CSV.

A JSON/JSONL row with "question": "" reaches this helper and raises, which aborts the entire load before any collection happens. _load_csv_questions() skips the same case and also trims whitespace-only IDs, so identical data behaves differently depending on file format. Treat blank question strings as skippable rows and strip raw_id before deciding whether to fall back to row_index.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/evalhub/adapter/live_collection.py` around lines 327 - 345, The helper
currently raises on empty or whitespace-only JSON/JSONL questions; change it to
treat blank question strings as skippable (same behavior as _load_csv_questions)
by returning a sentinel (e.g., None) or otherwise skipping the row instead of
raising when raw_question is a string that is empty after raw_question.strip();
also normalize the id logic by trimming raw_id before deciding the fallback
(compute question_id from str(raw_id).strip() and if that is empty use
str(row_index)); keep the metadata assembly and LiveQuestion(...) construction
unchanged for non-skippable rows so callers receive a LiveQuestion with a
trimmed question and normalized question_id.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@src/evalhub/adapter/live_collection.py`:
- Around line 417-421: Prevent config.extra_body from overriding canonical
fields by merging it before setting required keys or by filtering out reserved
keys; specifically, when building the request body in the block that creates
body and uses config.extra_body, either apply body.update(config.extra_body)
first and then set body["model"] = config.model and body["messages"] = messages,
or sanitize config.extra_body to remove reserved keys ("model", "messages", any
other canonical keys) before calling body.update(config.extra_body) so that the
values in config.model and messages (from the manifest/row) always win.
- Around line 327-345: The helper currently raises on empty or whitespace-only
JSON/JSONL questions; change it to treat blank question strings as skippable
(same behavior as _load_csv_questions) by returning a sentinel (e.g., None) or
otherwise skipping the row instead of raising when raw_question is a string that
is empty after raw_question.strip(); also normalize the id logic by trimming
raw_id before deciding the fallback (compute question_id from
str(raw_id).strip() and if that is empty use str(row_index)); keep the metadata
assembly and LiveQuestion(...) construction unchanged for non-skippable rows so
callers receive a LiveQuestion with a trimmed question and normalized
question_id.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: df5f002d-1319-47a4-a86c-43bf6e60a86f

📥 Commits

Reviewing files that changed from the base of the PR and between 3a67f53 and e0d9ea5.

📒 Files selected for processing (3)

README.md
src/evalhub/adapter/live_collection.py
tests/unit/test_live_collection.py

🚧 Files skipped from review as they are similar to previous changes (2)

README.md
tests/unit/test_live_collection.py

ai-hustle-bro · 2026-06-06T07:09:05Z

Pushed 89be251 to address the two live review findings:

extra_body no longer overrides canonical model / messages request fields.
JSON/JSONL question loading now skips blank questions and normalizes blank IDs consistently with CSV.

Local validation on the pushed commit:

.\.venv\Scripts\ruff.exe check src\evalhub\adapter\live_collection.py tests\unit\test_live_collection.py
.\.venv\Scripts\mypy.exe src\evalhub\adapter\live_collection.py tests\unit\test_live_collection.py
.\.venv\Scripts\pytest.exe tests\unit\test_live_collection.py -q

Result: ruff passed, mypy passed, and 11 passed.

ai-hustle-bro · 2026-06-07T16:09:30Z

Pushed 04c5563 as a tiny follow-up for the CodeRabbit docstring-coverage warning.

What changed:

Added short docstrings to the live collection test stubs and test cases only.
No production logic changed.

Validation on the pushed branch:

.\.venv\Scripts\ruff.exe check src\evalhub\adapter\live_collection.py tests\unit\test_live_collection.py
.\.venv\Scripts\mypy.exe src\evalhub\adapter\live_collection.py tests\unit\test_live_collection.py
.\.venv\Scripts\pytest.exe tests\unit\test_live_collection.py -q
git diff --check

Result: ruff passed, mypy passed, 11 passed, and git diff --check passed.

Claude Code CLI review of this docstring-only diff returned READY.

feat: add experimental live collection helper

3a67f53

coderabbitai Bot reviewed Jun 6, 2026

View reviewed changes

fix: add backoff to live collection retries

beebeec

docs: add live collection helper docstrings

e0d9ea5

coderabbitai Bot reviewed Jun 6, 2026

View reviewed changes

fix: address live collection review gaps

89be251

test: add live collection test docstrings

04c5563

ai-hustle-bro mentioned this pull request Jun 7, 2026

Add experimental live endpoint collection helper #137

Closed

ruivieira added the kind/feat Categorizes issue as a feature request label Jun 8, 2026

ruivieira added this to EvalHub Planning Jun 8, 2026

github-project-automation Bot moved this to Todo in EvalHub Planning Jun 8, 2026

ruivieira moved this from Todo to In Progress in EvalHub Planning Jun 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add experimental live collection helper#139

feat: add experimental live collection helper#139
ai-hustle-bro wants to merge 5 commits into
eval-hub:mainfrom
ai-hustle-bro:codex/live-collection-helper

ai-hustle-bro commented Jun 6, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 6, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram

Estimated Code Review Effort

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

ai-hustle-bro commented Jun 6, 2026

Uh oh!

ai-hustle-bro commented Jun 6, 2026

Uh oh!

ai-hustle-bro commented Jun 6, 2026

Uh oh!

coderabbitai Bot commented Jun 6, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

ai-hustle-bro commented Jun 6, 2026

Uh oh!

ai-hustle-bro commented Jun 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

ai-hustle-bro commented Jun 6, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What and why

Type

Testing

Breaking changes

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated Code Review Effort

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

ai-hustle-bro commented Jun 6, 2026

Uh oh!

ai-hustle-bro commented Jun 6, 2026

Uh oh!

ai-hustle-bro commented Jun 6, 2026

Uh oh!

coderabbitai Bot commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

ai-hustle-bro commented Jun 6, 2026

Uh oh!

ai-hustle-bro commented Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ai-hustle-bro commented Jun 6, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 6, 2026 •

edited

Loading

coderabbitai Bot commented Jun 6, 2026 •

edited

Loading

ai-hustle-bro commented Jun 7, 2026 •

edited

Loading