refactor(evaluatorq-py): route simulate() through evaluatorq() (RES-594) by currentlycodinng · Pull Request #130 · orq-ai/orqkit

currentlycodinng · 2026-05-15T13:39:37Z

Summary

Refactors simulate() and generate_and_simulate() to delegate execution to the evaluatorq() framework instead of running their own parallel loop. Both entry points inherit auto-upload, OTel tracing, results-table display, CI gating, and dataset-id support for free.

Stacks on: PR #119 (RES-598) — once that merges, this rebases onto main.

Architecture

simulate(personas, scenarios, target, ...)
  ├── Pre-materialise persona × scenario → list[Datapoint]
  ├── Wrap each as evaluatorq DataPoint(inputs={"datapoint": dp.model_dump()})
  ├── Build one shared SimulationRunner + closure result_cache
  ├── job_fn runs runner.run(datapoint=...) and stashes raw SimulationResult
  ├── Adapter scorers read SimulationResult from cache, run the scorer,
  │     mirror value onto result.metadata.evaluator_scores
  └── evaluatorq(name, data=DPs, jobs=[job_fn], evaluators=..., _send_results=upload_results)

wrap_simulation_agent decoupled from simulate() in a separate commit on this branch: it now drives SimulationRunner.run() directly so there's no recursion.

What you get for free

Capability	Before	After
Auto-upload to Orq	Manual call inside simulate()	Via evaluatorq()'s _send_results
OTel tracing	Custom pipeline span	evaluatorq's spans
Results table	No	Yes
Dataset ID support	No	Yes
Single parallelism layer	No	Yes

Test Plan

1255 unit tests pass (5 obsolete upload tests dropped, 4 new wiring tests added)
basedpyright clean
Live smoke: real simulate() call uploaded an experiment with evaluator scores intact
CI on push

🤖 Generated with Claude Code

…() (RES-594) Pre-step for RES-594: break the wrap_simulation_agent -> simulate cycle so simulate() can route THROUGH evaluatorq() without recursion. wrap_simulation_agent now: - Owns a SimulationRunner per call (shared across DataPoints) - Runs SimulationRunner.run() directly instead of recursing into simulate() - Exposes the runner on the returned job_fn as __closure_runner__ for lifecycle management - Adds _extract_single_datapoint helper that canonicalises the four legacy input shapes (datapoint / datapoints / persona+scenario / personas+scenarios) into a single Datapoint Tests updated to mock SimulationRunner.run instead of simulate.

…ork (RES-594) simulate() and generate_and_simulate() now delegate to evaluatorq() instead of running their own asyncio.gather loop and bespoke upload call. Both entry points inherit auto-upload, OTel tracing, results table display, CI gating, and dataset-id support for free — eliminating duplicate parallelism and divergence between the direct and wrap_simulation_agent paths. Architecture: - simulate() pre-materialises persona × scenario into a list of Datapoints, then wraps each as an evaluatorq DataPoint ({"datapoint": dp.model_dump()}). - _build_simulation_job_and_cache constructs one shared SimulationRunner per simulate() call and a job_fn that stashes the raw SimulationResult by id(DataPoint) in a closure-scoped cache. - _adapt_simulation_scorer wraps SimulationScorer as an evaluatorq Evaluator: it reads back from the cache by id(DataPoint), runs the scorer, and also mirrors the value onto result.metadata.evaluator_scores to preserve backward compatibility with callers that inspect the metadata directly. - evaluatorq() is called with _send_results=upload_results, description, path, and parallelism. The shared runner is closed in the finally. - generate_and_simulate() reuses the same path after a persona/scenario generation step at the front. Tests rewritten to patch sys.modules['evaluatorq.evaluatorq'].evaluatorq (the package re-exports the function at evaluatorq.evaluatorq, so the dotted name resolves to the function instead of the submodule). Five obsolete tests that mocked upload_simulation_results or SimulationRunner.run_batch were dropped — those paths no longer exist. Verified: 1255 unit tests pass, basedpyright clean, live smoke run uploaded to Orq with evaluator scores intact.

claude · 2026-05-15T13:39:53Z

Claude finished @currentlycodinng's task in 4m 19s —— View job

Review

Gather context (PR description, changed files)
Check existing review threads (none found)
Read changed files and analyze diff
Provide inline feedback
Post summary

The refactor is architecturally sound and the code is generally clean. The delegation to evaluatorq() eliminates a lot of bespoke parallelism/tracing/upload logic and that's a clear win. A few issues worth addressing before merging:

Notable issues

id(DataPoint) cache key (api.py:303) — The scorer adapter recovers SimulationResults via result_cache[id(data)], which works because processings.py passes the same DataPoint instance to both the job and the scorer. This is a silent implicit contract: if evaluatorq ever copies a DataPoint (retries, tracing enrichment, etc.), every scorer silently returns value=0.0 with no error. Embedding a stable _sim_idx in each DataPoint's inputs would make this robust without any performance cost.
Silent result dropping on job failure (api.py:413) — If an individual job_fn throws, processings.py swallows the exception and the DataPoint's id is never written to result_cache, so it's filtered out by the final list comprehension. Callers get a shorter list than they supplied DataPoints with no indication of what failed. The old code raised the exception. Consider at least logging a warning when len(result_cache) < len(eq_datapoints).
Breaking removal of evaluators param (wrap_agent.py:130) — evaluators: list[str] | None was silently dropped from wrap_simulation_agent. Any existing caller passing evaluators=[...] gets a TypeError at runtime. If intentional, needs a changelog note or a **kwargs-based deprecation shim.

Minor issues

_run_one_simulation docstring (wrap_agent.py:108) — Claims both wrap_simulation_agent and simulate()'s internal job_fn use it, but api.py's job_fn calls runner.run() directly. The paths aren't actually shared; the docstring is misleading.
_extract_single_datapoint cross-module coupling (api.py:294) — Private helper from wrap_agent.py imported by api.py. The simulate path always produces {"datapoint": ...} shaped inputs so only the first branch of _extract_single_datapoint ever fires in this path. Worth moving to a shared module or at minimum documenting the dependency.
Misleading error message (api.py:233) — When generate_and_simulate generates first messages, _require_orq_api_key("simulate") fires with "Set it before calling simulate()" — but the user called generate_and_simulate().

What looks good

The test strategy is correct — patching evaluatorq at the module level and asserting the kwargs (_send_results, description, path, run_name format) tests exactly the right seam. The _adapt_simulation_scorer except-and-return-0.0 pattern is appropriate as a safety net during evaluation. generate_and_simulate correctly closes its HTTP client before triggering first-message generation.

claude · 2026-05-15T13:43:07Z

+        user_simulator=user_simulator,
+        judge=judge,
+    )
+    result_cache: dict[int, SimulationResult] = {}


The id(data) cache key creates an implicit contract with evaluatorq's internals: correctness depends on process_data_point (in processings.py) passing the exact same DataPoint object instance to both job_fn and the scorer. This is true today (verified in processings.py:144 and :250), but if evaluatorq ever shallow-copies a DataPoint (e.g., for retries, parallelism isolation, or tracing enrichment), this will silently break — every scorer returns value=0.0, explanation="simulation result missing from cache" with no error raised.

A more resilient approach would be to embed a correlation key into each DataPoint's inputs:

eq_datapoints = [ DataPoint(inputs={"datapoint": dp.model_dump(mode="json"), "_sim_idx": i}) for i, dp in enumerate(sim_datapoints) ]

Then use params["data"].inputs["_sim_idx"] as the cache key instead of id(data).

Fix this →

claude · 2026-05-15T13:43:16Z

    finally:
-        await flush_tracing()
+        await runner.close()
+


When any individual simulation job_fn raises (e.g., the target callback throws, or the runner fails), process_job in processings.py catches the exception and returns JobResult(error=..., evaluator_scores=[]) without calling scorers. The DataPoint's id is never added to result_cache, so it's silently skipped here.

Callers receive a shorter list than len(sim_datapoints) with no indication of which runs failed or why. The previous implementation would propagate the exception. Consider either logging a warning when the result count doesn't match, or documenting this "partial results on failure" behaviour in the docstring.

claude · 2026-05-15T13:43:24Z

+    """
+    from evaluatorq.simulation.convert import to_open_responses
+    from evaluatorq.simulation.runner.simulation import SimulationRunner
+    from evaluatorq.simulation.wrap_agent import _extract_single_datapoint


_extract_single_datapoint is a private helper (_-prefixed) in wrap_agent.py, but it's now imported here from that module. This creates a cross-module dependency on an implementation detail. Since api.py is now the primary caller (the simulate path always creates {"datapoint": ...} shaped inputs), consider either moving this helper to a shared _internals.py / convert.py, or documenting it as part of the internal shared interface.

claude · 2026-05-15T13:43:33Z

+    )
+
+
+async def _run_one_simulation(


The docstring says this function exists "so both the public wrap_simulation_agent job and simulate()'s internal job_fn share one execution path", but api.py's job_fn (in _build_simulation_job_and_cache) calls runner.run() directly — it doesn't go through _run_one_simulation. The two paths are not actually shared. The docstring is misleading and _run_one_simulation is only used by wrap_simulation_agent's job_fn. Either remove the misleading claim from the docstring, or actually route api.py's job_fn through this helper.

claude · 2026-05-15T13:43:43Z

    target_callback: Callable[[list[ChatMessage]], str | Awaitable[str]] | None = None,
    agent_key: str | None = None,
    max_turns: int = 10,
    model: str | None = None,


The evaluators: list[str] | None = None parameter was silently removed. Any callers passing evaluators=[...] to wrap_simulation_agent will now get a TypeError at runtime. If this is intentional (callers should configure evaluators on the evaluatorq() call instead), it's a breaking change that should be called out in the PR or a deprecation warning added.

claude · 2026-05-15T13:43:50Z

+    from evaluatorq.simulation.generators import FirstMessageGenerator

-        return results
+    api_key = _require_orq_api_key("simulate")


When generate_and_simulate triggers first-message generation (it passes personas and scenarios, never datapoints), this will call _require_orq_api_key("simulate"), producing the error message "Set it before calling simulate()" — but the user called generate_and_simulate(). The context string should be a parameter or derived from the caller.

api_key = _require_orq_api_key("simulate") # misleading when called from generate_and_simulate

currentlycodinng added 2 commits May 15, 2026 15:13

claude Bot reviewed May 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(evaluatorq-py): route simulate() through evaluatorq() (RES-594)#130

refactor(evaluatorq-py): route simulate() through evaluatorq() (RES-594)#130
currentlycodinng wants to merge 2 commits into
aminaakhmedova/res-598-add-auto-upload-for-direct-simulate-pathfrom
aminaakhmedova/res-594-refactor-simulate-route-through-evaluatorq

currentlycodinng commented May 15, 2026

Uh oh!

claude Bot commented May 15, 2026 •

edited

Loading

Uh oh!

claude Bot May 15, 2026

Uh oh!

claude Bot May 15, 2026

Uh oh!

claude Bot May 15, 2026

Uh oh!

claude Bot May 15, 2026

Uh oh!

claude Bot May 15, 2026

Uh oh!

claude Bot May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

currentlycodinng commented May 15, 2026

Summary

Architecture

What you get for free

Test Plan

Uh oh!

claude Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review

Uh oh!

claude Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude Bot commented May 15, 2026 •

edited

Loading