Skip to content

refactor(evaluatorq-py): route simulate() through evaluatorq() (RES-594)#130

Draft
currentlycodinng wants to merge 2 commits into
aminaakhmedova/res-598-add-auto-upload-for-direct-simulate-pathfrom
aminaakhmedova/res-594-refactor-simulate-route-through-evaluatorq
Draft

refactor(evaluatorq-py): route simulate() through evaluatorq() (RES-594)#130
currentlycodinng wants to merge 2 commits into
aminaakhmedova/res-598-add-auto-upload-for-direct-simulate-pathfrom
aminaakhmedova/res-594-refactor-simulate-route-through-evaluatorq

Conversation

@currentlycodinng
Copy link
Copy Markdown
Contributor

Summary

Refactors simulate() and generate_and_simulate() to delegate execution to the evaluatorq() framework instead of running their own parallel loop. Both entry points inherit auto-upload, OTel tracing, results-table display, CI gating, and dataset-id support for free.

Stacks on: PR #119 (RES-598) — once that merges, this rebases onto main.

Architecture

simulate(personas, scenarios, target, ...)
  ├── Pre-materialise persona × scenario → list[Datapoint]
  ├── Wrap each as evaluatorq DataPoint(inputs={"datapoint": dp.model_dump()})
  ├── Build one shared SimulationRunner + closure result_cache
  ├── job_fn runs runner.run(datapoint=...) and stashes raw SimulationResult
  ├── Adapter scorers read SimulationResult from cache, run the scorer,
  │     mirror value onto result.metadata.evaluator_scores
  └── evaluatorq(name, data=DPs, jobs=[job_fn], evaluators=..., _send_results=upload_results)

wrap_simulation_agent decoupled from simulate() in a separate commit on this branch: it now drives SimulationRunner.run() directly so there's no recursion.

What you get for free

Capability Before After
Auto-upload to Orq Manual call inside simulate() Via evaluatorq()'s _send_results
OTel tracing Custom pipeline span evaluatorq's spans
Results table No Yes
Dataset ID support No Yes
Single parallelism layer No Yes

Test Plan

  • 1255 unit tests pass (5 obsolete upload tests dropped, 4 new wiring tests added)
  • basedpyright clean
  • Live smoke: real simulate() call uploaded an experiment with evaluator scores intact
  • CI on push

🤖 Generated with Claude Code

…() (RES-594)

Pre-step for RES-594: break the wrap_simulation_agent -> simulate cycle
so simulate() can route THROUGH evaluatorq() without recursion.

wrap_simulation_agent now:
- Owns a SimulationRunner per call (shared across DataPoints)
- Runs SimulationRunner.run() directly instead of recursing into simulate()
- Exposes the runner on the returned job_fn as __closure_runner__ for
  lifecycle management
- Adds _extract_single_datapoint helper that canonicalises the four
  legacy input shapes (datapoint / datapoints / persona+scenario /
  personas+scenarios) into a single Datapoint

Tests updated to mock SimulationRunner.run instead of simulate.
…ork (RES-594)

simulate() and generate_and_simulate() now delegate to evaluatorq()
instead of running their own asyncio.gather loop and bespoke upload
call. Both entry points inherit auto-upload, OTel tracing, results
table display, CI gating, and dataset-id support for free — eliminating
duplicate parallelism and divergence between the direct and
wrap_simulation_agent paths.

Architecture:
- simulate() pre-materialises persona × scenario into a list of
  Datapoints, then wraps each as an evaluatorq DataPoint
  ({"datapoint": dp.model_dump()}).
- _build_simulation_job_and_cache constructs one shared SimulationRunner
  per simulate() call and a job_fn that stashes the raw SimulationResult
  by id(DataPoint) in a closure-scoped cache.
- _adapt_simulation_scorer wraps SimulationScorer as an evaluatorq
  Evaluator: it reads back from the cache by id(DataPoint), runs the
  scorer, and also mirrors the value onto result.metadata.evaluator_scores
  to preserve backward compatibility with callers that inspect the
  metadata directly.
- evaluatorq() is called with _send_results=upload_results, description,
  path, and parallelism. The shared runner is closed in the finally.
- generate_and_simulate() reuses the same path after a persona/scenario
  generation step at the front.

Tests rewritten to patch sys.modules['evaluatorq.evaluatorq'].evaluatorq
(the package re-exports the function at evaluatorq.evaluatorq, so the
dotted name resolves to the function instead of the submodule).
Five obsolete tests that mocked upload_simulation_results or
SimulationRunner.run_batch were dropped — those paths no longer exist.

Verified: 1255 unit tests pass, basedpyright clean, live smoke run
uploaded to Orq with evaluator scores intact.
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 15, 2026

Claude finished @currentlycodinng's task in 4m 19s —— View job


Review

  • Gather context (PR description, changed files)
  • Check existing review threads (none found)
  • Read changed files and analyze diff
  • Provide inline feedback
  • Post summary

The refactor is architecturally sound and the code is generally clean. The delegation to evaluatorq() eliminates a lot of bespoke parallelism/tracing/upload logic and that's a clear win. A few issues worth addressing before merging:

Notable issues

  • id(DataPoint) cache key (api.py:303) — The scorer adapter recovers SimulationResults via result_cache[id(data)], which works because processings.py passes the same DataPoint instance to both the job and the scorer. This is a silent implicit contract: if evaluatorq ever copies a DataPoint (retries, tracing enrichment, etc.), every scorer silently returns value=0.0 with no error. Embedding a stable _sim_idx in each DataPoint's inputs would make this robust without any performance cost.

  • Silent result dropping on job failure (api.py:413) — If an individual job_fn throws, processings.py swallows the exception and the DataPoint's id is never written to result_cache, so it's filtered out by the final list comprehension. Callers get a shorter list than they supplied DataPoints with no indication of what failed. The old code raised the exception. Consider at least logging a warning when len(result_cache) < len(eq_datapoints).

  • Breaking removal of evaluators param (wrap_agent.py:130) — evaluators: list[str] | None was silently dropped from wrap_simulation_agent. Any existing caller passing evaluators=[...] gets a TypeError at runtime. If intentional, needs a changelog note or a **kwargs-based deprecation shim.

Minor issues

  • _run_one_simulation docstring (wrap_agent.py:108) — Claims both wrap_simulation_agent and simulate()'s internal job_fn use it, but api.py's job_fn calls runner.run() directly. The paths aren't actually shared; the docstring is misleading.

  • _extract_single_datapoint cross-module coupling (api.py:294) — Private helper from wrap_agent.py imported by api.py. The simulate path always produces {"datapoint": ...} shaped inputs so only the first branch of _extract_single_datapoint ever fires in this path. Worth moving to a shared module or at minimum documenting the dependency.

  • Misleading error message (api.py:233) — When generate_and_simulate generates first messages, _require_orq_api_key("simulate") fires with "Set it before calling simulate()" — but the user called generate_and_simulate().

What looks good

The test strategy is correct — patching evaluatorq at the module level and asserting the kwargs (_send_results, description, path, run_name format) tests exactly the right seam. The _adapt_simulation_scorer except-and-return-0.0 pattern is appropriate as a safety net during evaluation. generate_and_simulate correctly closes its HTTP client before triggering first-message generation.

user_simulator=user_simulator,
judge=judge,
)
result_cache: dict[int, SimulationResult] = {}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The id(data) cache key creates an implicit contract with evaluatorq's internals: correctness depends on process_data_point (in processings.py) passing the exact same DataPoint object instance to both job_fn and the scorer. This is true today (verified in processings.py:144 and :250), but if evaluatorq ever shallow-copies a DataPoint (e.g., for retries, parallelism isolation, or tracing enrichment), this will silently break — every scorer returns value=0.0, explanation="simulation result missing from cache" with no error raised.

A more resilient approach would be to embed a correlation key into each DataPoint's inputs:

eq_datapoints = [
    DataPoint(inputs={"datapoint": dp.model_dump(mode="json"), "_sim_idx": i})
    for i, dp in enumerate(sim_datapoints)
]

Then use params["data"].inputs["_sim_idx"] as the cache key instead of id(data).

Fix this →

finally:
await flush_tracing()
await runner.close()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When any individual simulation job_fn raises (e.g., the target callback throws, or the runner fails), process_job in processings.py catches the exception and returns JobResult(error=..., evaluator_scores=[]) without calling scorers. The DataPoint's id is never added to result_cache, so it's silently skipped here.

Callers receive a shorter list than len(sim_datapoints) with no indication of which runs failed or why. The previous implementation would propagate the exception. Consider either logging a warning when the result count doesn't match, or documenting this "partial results on failure" behaviour in the docstring.

"""
from evaluatorq.simulation.convert import to_open_responses
from evaluatorq.simulation.runner.simulation import SimulationRunner
from evaluatorq.simulation.wrap_agent import _extract_single_datapoint
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_extract_single_datapoint is a private helper (_-prefixed) in wrap_agent.py, but it's now imported here from that module. This creates a cross-module dependency on an implementation detail. Since api.py is now the primary caller (the simulate path always creates {"datapoint": ...} shaped inputs), consider either moving this helper to a shared _internals.py / convert.py, or documenting it as part of the internal shared interface.

)


async def _run_one_simulation(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring says this function exists "so both the public wrap_simulation_agent job and simulate()'s internal job_fn share one execution path", but api.py's job_fn (in _build_simulation_job_and_cache) calls runner.run() directly — it doesn't go through _run_one_simulation. The two paths are not actually shared. The docstring is misleading and _run_one_simulation is only used by wrap_simulation_agent's job_fn. Either remove the misleading claim from the docstring, or actually route api.py's job_fn through this helper.

target_callback: Callable[[list[ChatMessage]], str | Awaitable[str]] | None = None,
agent_key: str | None = None,
max_turns: int = 10,
model: str | None = None,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The evaluators: list[str] | None = None parameter was silently removed. Any callers passing evaluators=[...] to wrap_simulation_agent will now get a TypeError at runtime. If this is intentional (callers should configure evaluators on the evaluatorq() call instead), it's a breaking change that should be called out in the PR or a deprecation warning added.

from evaluatorq.simulation.generators import FirstMessageGenerator

return results
api_key = _require_orq_api_key("simulate")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When generate_and_simulate triggers first-message generation (it passes personas and scenarios, never datapoints), this will call _require_orq_api_key("simulate"), producing the error message "Set it before calling simulate()" — but the user called generate_and_simulate(). The context string should be a parameter or derived from the caller.

api_key = _require_orq_api_key("simulate")  # misleading when called from generate_and_simulate

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant