feat: add experimental live collection helper#139
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdds an experimental live-response collection module: Pydantic config/models, CSV/JSON/JSONL question loading, OpenAI-compatible chat-completions HTTP orchestration with retry/backoff and non-redirect enforcement, JSONL/manifest outputs, package re-exports, README docs, and unit tests. ChangesLive Response Collection Feature
Sequence DiagramsequenceDiagram
participant Adapter
participant Collector as collect_openai_chat_completions
participant FileSystem
participant ChatbotEndpoint
Adapter->>Collector: call with LiveCollectionConfig
Collector->>FileSystem: load questions from CSV/JSON/JSONL
FileSystem-->>Collector: List[LiveQuestion]
Collector->>Collector: build headers (env API key + configured headers)
loop for each question
Collector->>ChatbotEndpoint: POST /v1/chat/completions (model + messages + extra_body)
alt success (2xx)
ChatbotEndpoint-->>Collector: JSON response with choices[0].message.content
Collector->>Collector: extract assistant text
else non-2xx or exception
ChatbotEndpoint-->>Collector: error status or exception
Collector->>Collector: retry with exponential backoff (until max_retries) or record error
end
Collector->>FileSystem: append LiveCollectionRecord to responses.jsonl
end
Collector->>FileSystem: write manifest.json with totals
Collector-->>Adapter: return LiveCollectionManifest
Estimated Code Review Effort🎯 4 (Complex) | ⏱️ ~40 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
src/evalhub/adapter/live_collection.py (1)
343-380: 💤 Low valueConsider adding backoff delay between retries.
The retry loop has no delay between attempts, which could be aggressive for transient server issues (e.g., rate limiting). For an experimental feature with
max_retries=0default, this is acceptable, but a short backoff would improve reliability.💡 Optional: Add exponential backoff
+import time + def _collect_one_question( config: LiveCollectionConfig, client: Any, headers: Mapping[str, str], question: LiveQuestion, ) -> LiveCollectionRecord: last_error: str | None = None for _attempt in range(config.max_retries + 1): try: response = client.post( ... ) ... except Exception as exc: last_error = f"{type(exc).__name__}: {exc}" + if _attempt < config.max_retries: + time.sleep(min(2 ** _attempt, 10)) # Cap at 10 seconds🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/evalhub/adapter/live_collection.py` around lines 343 - 380, The retry loop in _collect_one_question is missing any delay between attempts; add a short backoff (preferably exponential with jitter) between retries to avoid aggressive hammering on transient failures: inside _collect_one_question, after catching the exception and before the next retry, compute a backoff delay using the retry attempt index (e.g., base_delay * 2**_attempt capped by a max_backoff and add small random jitter via random.uniform) and call time.sleep(delay) (or asyncio.sleep if this code becomes async), ensuring you do not sleep after the final attempt; reference config.max_retries and the _attempt loop when implementing.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@src/evalhub/adapter/live_collection.py`:
- Around line 343-380: The retry loop in _collect_one_question is missing any
delay between attempts; add a short backoff (preferably exponential with jitter)
between retries to avoid aggressive hammering on transient failures: inside
_collect_one_question, after catching the exception and before the next retry,
compute a backoff delay using the retry attempt index (e.g., base_delay *
2**_attempt capped by a max_backoff and add small random jitter via
random.uniform) and call time.sleep(delay) (or asyncio.sleep if this code
becomes async), ensuring you do not sleep after the final attempt; reference
config.max_retries and the _attempt loop when implementing.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 66f8c691-c660-4a2b-b73e-81e07f281855
📒 Files selected for processing (4)
README.mdsrc/evalhub/adapter/__init__.pysrc/evalhub/adapter/live_collection.pytests/unit/test_live_collection.py
|
Addressed CodeRabbit's retry backoff nitpick in beebeec. Changes:
Validation:
|
|
Follow-up on the CodeRabbit docstring coverage warning in e0d9ea5. Changes:
Validation:
|
|
@coderabbitai review |
✅ Action performedReview finished.
|
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/evalhub/adapter/live_collection.py (2)
417-421:⚠️ Potential issue | 🟠 Major | ⚡ Quick winPrevent
extra_bodyfrom overriding the canonical request fields.
body.update(config.extra_body)runs aftermodelandmessagesare set, so a caller can silently replace the configured model or the loaded prompt payload. That breaks the contract between the on-wire request and the persisted manifest/row metadata. Reserve those keys or mergeextra_bodyfirst and write the required fields last.Proposed fix
- body: dict[str, Any] = { - "model": config.model, - "messages": messages, - } - body.update(config.extra_body) + forbidden_keys = {"model", "messages"} + overlapping_keys = forbidden_keys & config.extra_body.keys() + if overlapping_keys: + raise ValueError( + "extra_body cannot override reserved chat-completions fields: " + + ", ".join(sorted(overlapping_keys)) + ) + + body: dict[str, Any] = { + **config.extra_body, + "model": config.model, + "messages": messages, + } return body🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/evalhub/adapter/live_collection.py` around lines 417 - 421, Prevent config.extra_body from overriding canonical fields by merging it before setting required keys or by filtering out reserved keys; specifically, when building the request body in the block that creates body and uses config.extra_body, either apply body.update(config.extra_body) first and then set body["model"] = config.model and body["messages"] = messages, or sanitize config.extra_body to remove reserved keys ("model", "messages", any other canonical keys) before calling body.update(config.extra_body) so that the values in config.model and messages (from the manifest/row) always win.
327-345:⚠️ Potential issue | 🟠 Major | ⚡ Quick winNormalize blank JSON/JSONL rows the same way as CSV.
A JSON/JSONL row with
"question": ""reaches this helper and raises, which aborts the entire load before any collection happens._load_csv_questions()skips the same case and also trims whitespace-only IDs, so identical data behaves differently depending on file format. Treat blank question strings as skippable rows and stripraw_idbefore deciding whether to fall back torow_index.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/evalhub/adapter/live_collection.py` around lines 327 - 345, The helper currently raises on empty or whitespace-only JSON/JSONL questions; change it to treat blank question strings as skippable (same behavior as _load_csv_questions) by returning a sentinel (e.g., None) or otherwise skipping the row instead of raising when raw_question is a string that is empty after raw_question.strip(); also normalize the id logic by trimming raw_id before deciding the fallback (compute question_id from str(raw_id).strip() and if that is empty use str(row_index)); keep the metadata assembly and LiveQuestion(...) construction unchanged for non-skippable rows so callers receive a LiveQuestion with a trimmed question and normalized question_id.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@src/evalhub/adapter/live_collection.py`:
- Around line 417-421: Prevent config.extra_body from overriding canonical
fields by merging it before setting required keys or by filtering out reserved
keys; specifically, when building the request body in the block that creates
body and uses config.extra_body, either apply body.update(config.extra_body)
first and then set body["model"] = config.model and body["messages"] = messages,
or sanitize config.extra_body to remove reserved keys ("model", "messages", any
other canonical keys) before calling body.update(config.extra_body) so that the
values in config.model and messages (from the manifest/row) always win.
- Around line 327-345: The helper currently raises on empty or whitespace-only
JSON/JSONL questions; change it to treat blank question strings as skippable
(same behavior as _load_csv_questions) by returning a sentinel (e.g., None) or
otherwise skipping the row instead of raising when raw_question is a string that
is empty after raw_question.strip(); also normalize the id logic by trimming
raw_id before deciding the fallback (compute question_id from
str(raw_id).strip() and if that is empty use str(row_index)); keep the metadata
assembly and LiveQuestion(...) construction unchanged for non-skippable rows so
callers receive a LiveQuestion with a trimmed question and normalized
question_id.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: df5f002d-1319-47a4-a86c-43bf6e60a86f
📒 Files selected for processing (3)
README.mdsrc/evalhub/adapter/live_collection.pytests/unit/test_live_collection.py
🚧 Files skipped from review as they are similar to previous changes (2)
- README.md
- tests/unit/test_live_collection.py
|
Pushed
Local validation on the pushed commit: .\.venv\Scripts\ruff.exe check src\evalhub\adapter\live_collection.py tests\unit\test_live_collection.py
.\.venv\Scripts\mypy.exe src\evalhub\adapter\live_collection.py tests\unit\test_live_collection.py
.\.venv\Scripts\pytest.exe tests\unit\test_live_collection.py -qResult: ruff passed, mypy passed, and |
|
Pushed What changed:
Validation on the pushed branch: .\.venv\Scripts\ruff.exe check src\evalhub\adapter\live_collection.py tests\unit\test_live_collection.py
.\.venv\Scripts\mypy.exe src\evalhub\adapter\live_collection.py tests\unit\test_live_collection.py
.\.venv\Scripts\pytest.exe tests\unit\test_live_collection.py -q
git diff --checkResult: ruff passed, mypy passed, Claude Code CLI review of this docstring-only diff returned |
What and why
Closes #135
Adds an experimental adapter-side live response collection helper that adapters can call during
JobPhase.LOADING_DATAwhile the longer-term API shape is still being decided.The helper:
/v1/chat/completionsendpointresponses.jsonlplusmanifest.jsonfor evaluation framework loadershttpxoptional via lazy import and supports dependency injection for testsapi_key_envType
Testing
Commands run:
I also ran the full unit suite after
uv sync --extra cli --extra mcp --group dev; it reached492 passedwith 2 failures intests/unit/test_cli_config.pythat assert POSIX owner-only permission bits on Windows.Breaking changes
None.
Summary by CodeRabbit
New Features
Documentation
Public API
Tests