refactor(evaluatorq-py): route simulate() through evaluatorq() (RES-594)#130
Conversation
…() (RES-594) Pre-step for RES-594: break the wrap_simulation_agent -> simulate cycle so simulate() can route THROUGH evaluatorq() without recursion. wrap_simulation_agent now: - Owns a SimulationRunner per call (shared across DataPoints) - Runs SimulationRunner.run() directly instead of recursing into simulate() - Exposes the runner on the returned job_fn as __closure_runner__ for lifecycle management - Adds _extract_single_datapoint helper that canonicalises the four legacy input shapes (datapoint / datapoints / persona+scenario / personas+scenarios) into a single Datapoint Tests updated to mock SimulationRunner.run instead of simulate.
…ork (RES-594)
simulate() and generate_and_simulate() now delegate to evaluatorq()
instead of running their own asyncio.gather loop and bespoke upload
call. Both entry points inherit auto-upload, OTel tracing, results
table display, CI gating, and dataset-id support for free — eliminating
duplicate parallelism and divergence between the direct and
wrap_simulation_agent paths.
Architecture:
- simulate() pre-materialises persona × scenario into a list of
Datapoints, then wraps each as an evaluatorq DataPoint
({"datapoint": dp.model_dump()}).
- _build_simulation_job_and_cache constructs one shared SimulationRunner
per simulate() call and a job_fn that stashes the raw SimulationResult
by id(DataPoint) in a closure-scoped cache.
- _adapt_simulation_scorer wraps SimulationScorer as an evaluatorq
Evaluator: it reads back from the cache by id(DataPoint), runs the
scorer, and also mirrors the value onto result.metadata.evaluator_scores
to preserve backward compatibility with callers that inspect the
metadata directly.
- evaluatorq() is called with _send_results=upload_results, description,
path, and parallelism. The shared runner is closed in the finally.
- generate_and_simulate() reuses the same path after a persona/scenario
generation step at the front.
Tests rewritten to patch sys.modules['evaluatorq.evaluatorq'].evaluatorq
(the package re-exports the function at evaluatorq.evaluatorq, so the
dotted name resolves to the function instead of the submodule).
Five obsolete tests that mocked upload_simulation_results or
SimulationRunner.run_batch were dropped — those paths no longer exist.
Verified: 1255 unit tests pass, basedpyright clean, live smoke run
uploaded to Orq with evaluator scores intact.
|
Claude finished @currentlycodinng's task in 4m 19s —— View job Review
The refactor is architecturally sound and the code is generally clean. The delegation to Notable issues
Minor issues
What looks good The test strategy is correct — patching |
| user_simulator=user_simulator, | ||
| judge=judge, | ||
| ) | ||
| result_cache: dict[int, SimulationResult] = {} |
There was a problem hiding this comment.
The id(data) cache key creates an implicit contract with evaluatorq's internals: correctness depends on process_data_point (in processings.py) passing the exact same DataPoint object instance to both job_fn and the scorer. This is true today (verified in processings.py:144 and :250), but if evaluatorq ever shallow-copies a DataPoint (e.g., for retries, parallelism isolation, or tracing enrichment), this will silently break — every scorer returns value=0.0, explanation="simulation result missing from cache" with no error raised.
A more resilient approach would be to embed a correlation key into each DataPoint's inputs:
eq_datapoints = [
DataPoint(inputs={"datapoint": dp.model_dump(mode="json"), "_sim_idx": i})
for i, dp in enumerate(sim_datapoints)
]Then use params["data"].inputs["_sim_idx"] as the cache key instead of id(data).
| finally: | ||
| await flush_tracing() | ||
| await runner.close() | ||
|
|
There was a problem hiding this comment.
When any individual simulation job_fn raises (e.g., the target callback throws, or the runner fails), process_job in processings.py catches the exception and returns JobResult(error=..., evaluator_scores=[]) without calling scorers. The DataPoint's id is never added to result_cache, so it's silently skipped here.
Callers receive a shorter list than len(sim_datapoints) with no indication of which runs failed or why. The previous implementation would propagate the exception. Consider either logging a warning when the result count doesn't match, or documenting this "partial results on failure" behaviour in the docstring.
| """ | ||
| from evaluatorq.simulation.convert import to_open_responses | ||
| from evaluatorq.simulation.runner.simulation import SimulationRunner | ||
| from evaluatorq.simulation.wrap_agent import _extract_single_datapoint |
There was a problem hiding this comment.
_extract_single_datapoint is a private helper (_-prefixed) in wrap_agent.py, but it's now imported here from that module. This creates a cross-module dependency on an implementation detail. Since api.py is now the primary caller (the simulate path always creates {"datapoint": ...} shaped inputs), consider either moving this helper to a shared _internals.py / convert.py, or documenting it as part of the internal shared interface.
| ) | ||
|
|
||
|
|
||
| async def _run_one_simulation( |
There was a problem hiding this comment.
The docstring says this function exists "so both the public wrap_simulation_agent job and simulate()'s internal job_fn share one execution path", but api.py's job_fn (in _build_simulation_job_and_cache) calls runner.run() directly — it doesn't go through _run_one_simulation. The two paths are not actually shared. The docstring is misleading and _run_one_simulation is only used by wrap_simulation_agent's job_fn. Either remove the misleading claim from the docstring, or actually route api.py's job_fn through this helper.
| target_callback: Callable[[list[ChatMessage]], str | Awaitable[str]] | None = None, | ||
| agent_key: str | None = None, | ||
| max_turns: int = 10, | ||
| model: str | None = None, |
There was a problem hiding this comment.
The evaluators: list[str] | None = None parameter was silently removed. Any callers passing evaluators=[...] to wrap_simulation_agent will now get a TypeError at runtime. If this is intentional (callers should configure evaluators on the evaluatorq() call instead), it's a breaking change that should be called out in the PR or a deprecation warning added.
| from evaluatorq.simulation.generators import FirstMessageGenerator | ||
|
|
||
| return results | ||
| api_key = _require_orq_api_key("simulate") |
There was a problem hiding this comment.
When generate_and_simulate triggers first-message generation (it passes personas and scenarios, never datapoints), this will call _require_orq_api_key("simulate"), producing the error message "Set it before calling simulate()" — but the user called generate_and_simulate(). The context string should be a parameter or derived from the caller.
api_key = _require_orq_api_key("simulate") # misleading when called from generate_and_simulate
Summary
Refactors
simulate()andgenerate_and_simulate()to delegate execution to theevaluatorq()framework instead of running their own parallel loop. Both entry points inherit auto-upload, OTel tracing, results-table display, CI gating, and dataset-id support for free.Stacks on: PR #119 (RES-598) — once that merges, this rebases onto
main.Architecture
wrap_simulation_agentdecoupled fromsimulate()in a separate commit on this branch: it now drivesSimulationRunner.run()directly so there's no recursion.What you get for free
Test Plan
basedpyrightclean🤖 Generated with Claude Code