refactor(evaluatorq-py): consolidate TokenUsage / Message types across simulation and redteam (RES-596)#127
Open
currentlycodinng wants to merge 55 commits into
Conversation
… response, missing tests (RES-651)
…rn AgentResponse, add scorer tests (RES-651)
…le SDK objects (RES-651)
…tor prompts (RES-651)
…e typecheck (RES-651)
Add tool_calls_per_turn to JobOutputPayload and RedTeamResult so intercepted tool calls from OrchestratorResult are accessible in final reports. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…terception (RES-651) - Add default=str to json.dumps in evaluator to prevent silent TypeError on non-serializable tool arguments - Add AIMessage type guard in LangGraphTarget to prevent phantom tool calls from HumanMessage/ToolMessage - Track _prev_msg_count in LangGraphTarget to extract only current-turn tool calls in multi-turn flows - Emit logger.warning in reset_conversation() when _prev_msg_count > 0 to surface stale-data risk - Narrow except clause and add logger.warning in orq.py JSON arg parse for observability - Update tests to reflect AgentResponse return type and new reset_conversation semantics
…nse}} substitution (RES-651)
Response is the terminal substitution — no further placeholder expansion follows it.
Sanitizing here corrupts legitimate agent output containing {{ (e.g. template syntax explanations).
Cross-expansion is already prevented by sanitizing {{input.all_messages}} and {{output.tool_calls}}.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…essage list (RES-651) Replace separate text/tool_calls/intermediate_steps fields on AgentResponse with a unified output: list[OutputMessage], preserving the order in which items were produced. This allows evaluators to reason about tool call order and interleaving with text responses. OutputMessage is a union of: - TextOutputItem (type: "output_text") — assistant text response - ToolCallOutputItem (type: "function_call") — tool call with optional result - ReasoningOutputItem (type: "reasoning") — chain-of-thought / thinking steps Type discriminators align with the OpenResponses intermediate data format. Backward-compatible .text and .tool_calls properties are preserved. All backends and integration targets updated to build the output list.
…eption (RES-651) - Fix ORQ test mocks: use attribute assignment instead of MagicMock constructor kwargs so _extract_text correctly resolves part.kind and part.text; add response.text assertions to cover text extraction path - Remove dead/orphaned code from test_extracts_tool_call_with_dict_arguments (no-op with patch block and unused call_results list) - Add test_reset_uses_different_thread_id to verify thread_id changes after reset_conversation() in LangGraphTarget - Fix StrategyToolCall rename (ToolCall → StrategyToolCall) in contracts.py and __init__.py exports - Fix orchestrator tool_calls_per_turn tracking: replace boolean flag with _turn_calls local variable per turn to eliminate double-append race - Fix CallableTarget to re-raise TypeError and AttributeError directly instead of wrapping in RuntimeError
…r AgentResponse output (RES-651)
- Merge TokenUsageCollector into LangGraphTarget.send_prompt (kept AgentResponse return type) - Add reset_conversation() and consume_last_token_usage() to LangGraphTarget - Add clone()/reset_conversation() to OpenAIAgentTarget for new tests from main - Keep _coerce_to_agent_response for all static pipeline send_prompt call sites - Use getattr for msg.tool_calls in OpenAIModelTarget to support mock clients - Integrate LLMCallConfig/LLMConfig/xml_escape imports from main - Remove duplicate except asyncio.TimeoutError block (merge artifact in evaluator.py) - Remove test_timeout_error_propagates (superseded by main's inconclusive-on-timeout behavior) - Fix tests: add evaluator._cfg=LLMCallConfig() for __new__-constructed OWASPEvaluator, use new() not clone() for LangGraphTarget
…id CredentialError
…erge - Add ExecutedToolCall to evaluator.py imports - Pass annotations=[] to TextOutputItem constructors (Annotated Field default not recognized by basedpyright) - Serialize dict arguments to JSON strings for ToolCallOutputItem constructors - Use append pattern instead of list comprehension to avoid list invariance errors - Use cast(Any, ...) in tests for intentionally-wrong-type assertions
…construction (RES-651)
…ck failure (RES-651)
…ursor stability, update docstring (RES-651)
…t in CallableTarget (RES-651)
Move LLMCallConfig, AgentResponse, ExecutedToolCall and their supporting types (ReasoningOutputItem, OutputMessage) to the new shared evaluatorq/contracts.py. Add api field (Literal["chat_completions", "responses"], default "chat_completions") to LLMCallConfig. Re-export all moved names from evaluatorq.redteam.contracts so existing import paths remain unchanged. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…t, and AgentTarget conformance Covers BaseAgent._call_llm API dispatch, OrqResponsesTarget call/send_prompt/new/get_usage/timeout behaviour, and OrqResponsesTarget conformance with the redteam AgentTarget protocol. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…put extractor Create `simulation/_client.py` with two shared helpers: - `build_simulation_client()`: replaces near-identical `_build_client` bodies in `BaseAgent` and `OrqResponsesTarget`; both now delegate to it. - `extract_responses_output()`: replaces fragile `hasattr(item, 'name')` discriminator in `base.py._call_responses` and the bespoke `_extract_output` method in `target.py` with a single type-correct implementation using `item.type == 'function_call'`. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tion failures Wrap the `user_simulator.update_context()` call in a try/except so that errors from a badly-implemented injected user simulator are reported with a clear `RuntimeError` message rather than being swallowed into a generic simulation error, making debugging much easier for callers. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… and update_context propagation - test_simulate_with_injected_target: verifies target= callback is invoked per turn - test_simulate_target_takes_precedence_over_target_callback: confirms target= wins over target_callback= - test_simulate_with_injected_user_simulator: verifies injected simulator's generate_first_message is called - test_update_context_called_per_simulation: asserts update_context receives persona/scenario kwargs - test_invalid_user_simulator_raises_type_error: confirms bad user_simulator surfaces as error result - Update test_timeout_exceeded_raises to expect RuntimeError (wrapping TimeoutError) from the new _invoke implementation that converts asyncio.TimeoutError to a descriptive RuntimeError Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…, validation timing, and silent failures (RES-595) Comprehensive fixes to address multi-agent review findings: **OrqResponsesTarget (target.py):** - Fix context duplication: reset _previous_response_id=None in __call__ to avoid threading with full message history - Fix memory entity scoping: generate fresh UUID in new() for independent AgentTarget instances - Fix silent failures: raise RuntimeError for empty output items instead of logging warning **SimulationRunner (simulation.py):** - Fix late validation: move agent validation from run() to __init__ for fail-fast behavior - Fix duplicated API logic: use shared build_simulation_client() helper with proper ownership tracking - Update close() to only close owned clients **Tests:** - Update test_call_raises_error_on_no_output_items to expect RuntimeError instead of empty string All critical issues from Claude review comments now addressed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…SON serialization (RES-651)
- LangGraph & OpenAI Agents targets: emit AgentResponse.output items in
message order so ReAct-style text -> tool_call -> text sequences are
preserved instead of collapsing to [tool_calls..., final_text].
- LangGraph target: pass default=str to json.dumps for tool call arguments
to avoid TypeError on non-serializable values (datetime, bytes, custom
SDK objects). pipeline.py re-raises TypeError unconditionally, which
would otherwise crash the whole attack job.
- openresponses.FunctionCall._serialize_arguments: retry with default=str
before falling back to '{}' and log a warning so silent data loss is
observable.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…S-782) (#118) * feat(evaluatorq-py): port OTel tracing to agent simulation module (RES-782) Mirrors the TypeScript simulation tracing landed in RES-595 to the Python package, so simulations show as a single unified trace in the Orq dashboard with per-turn breakdown. - Add simulation/tracing.py: with_simulation_span, with_llm_span, record_llm_input/output/response, record_token_usage, set_span_attrs, get_trace_context_headers (W3C traceparent injection) - Wrap simulate()/generate_and_simulate() pipelines, run, turn, target_call, judge_evaluation, user_simulator_call, and the persona/ scenario/first_message generators in nested orq.simulation.* spans - Wrap _call_chat_completions and _call_responses in BaseAgent with with_llm_span (operation = "chat" / "responses"), record input/response via GenAI semconv, propagate trace context to the router via extra_headers so router-side spans link as children - Aggregate token usage on the run + pipeline spans; set termination / goal_achieved attrs Tested live against the Orq router: 11-span trace landed with full input/ output capture, parent links, token rollup, and billing all correct. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(evaluatorq-py): address review findings on simulation tracing (RES-782) Fixes from internal review of 981af84: Bugs - record_llm_response: zero token counts (e.g. fully-cached request) no longer fall back to alternate attribute names; explicit None checks - record_llm_response: Responses API output extraction now descends into item.content[*].text (and prefers response.output_text when available) instead of looking for a flat .text on top-level items, which the SDK shape doesn't have - runner.run() error path: dropped dead inner try/except and restored partial token-usage reporting via a usage_holder dict so the outer except can call _get_total_usage from agents created in _run_inner (mirrors TS getTotalUsage closure) Parity - generate_structured: wrap with with_llm_span so persona/scenario generation produces GenAI client child spans (structured + json_object fallback paths both covered); inject W3C traceparent - ScenarioGenerator alt methods (with_coverage, edge_cases, boundary, security) and PersonaGenerator.generate_with_coverage now wrap their generate_structured call in orq.simulation.{persona,scenario}_generation spans with the matching mode attribute, matching TS Tests - record_token_usage_preserves_zero_prompt_tokens (regression for the falsy-or bug) - record_llm_response_responses_api_shape (item.content[].text path) - llm_span_records_exception (mirrors with_simulation_span coverage) - traceparent_injected_into_chat_completions_call (verifies extra_headers carries traceparent into the OpenAI client request) - concurrent_runs_share_pipeline_parent (asyncio.gather context isolation) Verified live against the Orq router post-fix; trace bd15b852556fa1283f2c6bd69982d8b9 has the expected 11-span tree, full input/output capture, and correct token rollup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(evaluatorq-py): address Claude review on simulation tracing PR (RES-782) Findings from the PR review on #118: - _serialize_messages: switch to m.get() so messages without a content key (OpenAI tool-call-only messages) don't raise KeyError; also avoids stringifying None into the literal "None" - record_llm_response: rename second `choices` to `finish_choices` so reading the function doesn't suggest the assignments are linked - run_span / pipeline_span: type as Span | None under TYPE_CHECKING instead of Any so attribute access stays type-checked downstream - Generators (persona generate_with_coverage, scenario {with_coverage, edge_cases, boundary, security}): move parsing and post-processing inside the simulation span so timing and exception attribution are accurate (matches the basic generate() shape) - _capture_message_content: tighten docstring to make the deliberate spec deviation explicit (default True for TS parity vs spec opt-in) Skipped: the structured_output fallback-path billing attribution note — the failed parse() call's APIStatusError doesn't carry a usable response object to record on the span. Verified live: trace 8cd51c85d26bf1a8960932b0ac0c66ac landed with the expected 11-span tree, full input/output capture, and correct token rollup (3259 = 1234 + 529 + 1496). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(evaluatorq-py): address second Claude review pass on PR #118 (RES-782) - with_simulation_span / with_llm_span: catch BaseException instead of Exception so asyncio.CancelledError (used by asyncio.wait_for timeouts) is recorded on the span as ERROR; previously timed-out spans ended with UNSET status, indistinguishable from normal completion in the Orq dashboard. KeyboardInterrupt / SystemExit also now appear in traces, which is desirable for visibility. - ScenarioGenerator.generate(): move the json.JSONDecodeError handler outside the with_simulation_span block so JSON parse failures mark the span ERROR (matches the pattern already used by the alt generators: with_coverage, edge_cases, boundary, security). Test added: test_simulation_span_records_asyncio_cancellation. 18/18 tracing tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(evaluatorq-py): third Claude review pass — Responses API finish reason + LLM span cancellation test (RES-782) - record_llm_response: fall back to response.status when there are no .choices, so gen_ai.response.finish_reasons is populated for Responses API spans (the API uses .status — e.g. "completed", "failed", "incomplete" — instead of per-choice finish_reason) - Tests: add test_llm_span_records_asyncio_cancellation as a mirror of the simulation-span cancellation test, plus test_record_llm_response_responses_api_finish_reason for the finish_reasons fallback. 20/20 tracing tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(evaluatorq-py): close tracing gaps in simulation spans * fix(evaluatorq-py): record run span metadata on cancellation --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Bauke Brenninkmeijer <bauke.brenninkmeijer@gmail.com>
Unifies the AgentTarget protocol around `send_prompt() -> AgentResponse`. `AgentResponse` now carries both the ordered output items (text + tool calls) and the per-call LLM metadata that previously lived on the now- removed `SendResult` (usage, model, response_id, finish_reason). A `SendResult = AgentResponse` alias is kept so existing tests/callers that import `SendResult` continue to work; new code should use `AgentResponse` directly. Updates: - All integrations (callable, LangGraph, OpenAI Agents, Vercel AI SDK) populate `AgentResponse.usage` from their respective usage surfaces. - OpenAI and ORQ backends preserve tool call extraction alongside usage/model/response_id propagation. - Orchestrator, pipeline, and runner read `usage` directly off the returned `AgentResponse`; the `consume_last_token_usage` shim is retained on the optional `SupportsTokenUsage` protocol but no longer wired into hot paths. - Tests migrated from `send_prompt_with_usage` to `send_prompt`. - `test_legacy_target_adapter.py` removed — its premise (distinct legacy `send_prompt` vs. new `send_prompt_with_usage`) no longer applies. Pre-existing CLI help tests (`TestVulnerabilityHelpText`) remain deselected; failures are unrelated to this merge (empty typer help output in the test environment). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…s simulation and redteam (RES-596) Eliminates duplicate type definitions across simulation and redteam: Shared types now live in evaluatorq.contracts (the home introduced by PR #117 for cross-subpackage shared types): - ``TokenUsage`` — uses redteam's superset (adds ``calls`` field + ``from_completion()`` classmethod). Replaces simulation's lighter variant. - ``Message`` — uses redteam's superset (supports ``"tool"`` role, ``tool_calls``, ``tool_call_id``, ``name``). Replaces simulation's simpler ``Message`` and the duplicated ``ChatMessage``. - ``FunctionCall`` and ``StrategyToolCall`` — moved alongside Message since Message references them. Both ``simulation/types.py`` and ``redteam/contracts.py`` now re-export these from ``evaluatorq.contracts`` so legacy import paths (``from evaluatorq.simulation.types import TokenUsage`` / ``from evaluatorq.redteam.contracts import Message``) keep working. ``ChatMessage`` is preserved as a deprecated alias (``ChatMessage = Message``) so external API consumers don't break. Internally, ``runner/simulation.py`` now passes ``list[Message]`` directly without the redundant per-turn ``ChatMessage(role=m.role, content=m.content)`` conversion (the two classes had identical schemas). Other cleanups: - Simulation enums (CommunicationStyle, StartingEmotion, EmotionalArc, CulturalContext, ConversationStrategy, InputFormat, TerminatedBy) now use the same StrEnum polyfill pattern as redteam/contracts.py instead of plain ``(str, Enum)``. - Adapter callbacks coerce ``m.content or ""`` since shared Message has optional content (for tool-call assistant messages). Verified: 832/833 sim+redteam tests pass (1 pre-existing failure unrelated). 0 typecheck errors. Live smoke produces full simulation trace, indexed in Orq dashboard. Stacked on PR #118 (RES-782 OTel tracing); peer of RES-598 branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This comment was marked as resolved.
This comment was marked as resolved.
Keeps HEAD re-exports from evaluatorq.contracts over inline defs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…r (RES-596) Address Claude review on PR #127. _invert_roles_for_simulator was reconstructing Message(role=..., content=...) directly, which silently dropped tool_calls, tool_call_id, and name now that the shared Message type carries those fields. Switch to model_copy(update={"role": ...}) so all other fields are preserved. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three follow-ups from the review pass on PR #127: 1. Increment ``calls`` field on token aggregation. Adopting redteam's superset ``TokenUsage`` added a ``calls`` counter, but simulation token accumulators in ``agents/base.py`` (chat completions and Responses API paths) and ``target.py`` (OrqResponsesTarget) were only summing the three token counts and never bumping ``calls``. ``runner/simulation._get_total_usage`` and ``_build_turn_metrics`` now include ``calls`` in their sum/diff so per-turn and total usage report it accurately. 2. Consolidate StrEnum polyfill. Both ``simulation/types.py`` and ``redteam/contracts.py`` carried their own Python 3.10 polyfill; moved to ``evaluatorq.contracts`` and import from there. 3. ``OrqResponsesTarget._messages_to_input`` now passes ``m.content or ""`` so a future tool-role message with ``content=None`` can't crash the Responses API call. Plus two regression tests in ``test_runner.py``: - ``_invert_roles_for_simulator`` preserves ``tool_calls``, ``tool_call_id``, ``name`` (covers the bug Claude's first review caught and the prior commit fixed). - ``Message`` with ``role="tool"`` round-trips through model_dump / model_validate without losing the new optional fields. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Resolves conflicts in redteam/contracts.py where the base branch re-introduced inline ``Message`` and ``TokenUsage`` definitions while this branch had moved both to ``evaluatorq.contracts``. Kept the re-export approach (single source of truth in ``evaluatorq.contracts``). Picked up base's hardened TokenUsage and applied to the shared location: - ``model_config = ConfigDict(frozen=True)`` (immutable for predictable arithmetic). - ``ge=0`` validators on all token fields. - Improved ``from_completion`` total fallback (uses prompt+completion when provider reports total=0/None). - ``__add__`` / ``__radd__`` for ergonomic accumulation and ``sum()``. Updated simulation token accumulators in ``agents/base.py`` and ``target.py`` to use ``__add__`` reassignment instead of in-place ``+=`` mutation, since frozen models reject ``setattr``. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #117 (Orq Responses v3) and PR #100 (tool call interception) landed on main, both touching the same simulation/redteam surface RES-596 was refactoring. Resolution: - Took main's version for files where 596's only divergence was the shared bauke/karinakalicka branch ancestry (_client, api, generators, tracing, structured_output, all test_* files). - Reconciled the 5 substantive conflicts (evaluatorq/contracts, redteam/contracts, simulation/agents/base, simulation/runner/simulation, simulation/target) by starting from main's feature-complete versions and applying 596's consolidation deltas on top: * Moved TokenUsage / Message / FunctionCall / StrategyToolCall into evaluatorq/contracts.py (joining LLMCallConfig + AgentResponse that #117 already moved there). * redteam/contracts.py and simulation/types.py now re-export from evaluatorq.contracts. * Dropped ExecutedToolCall — #100 removed it; AgentResponse.tool_calls returns list[ToolCallOutputItem] directly. * Runner passes Message all the way through (no ChatMessage rebuild), preserving tool_calls on role inversion (commit e083bfd). * TokenUsage.from_completion() + frozen-model accumulation pattern threaded through agents/base.py and target.py. Verified: 1243 unit tests pass, basedpyright clean.
- target.py: _messages_to_input now preserves tool_calls / tool_call_id / name when serializing to the Responses API, so callers who build a history with tool messages don't hit silent data loss. - contracts.py: drop StrEnum from __all__ — it's a Python-version compatibility shim, not a public domain type. - simulation/types.py: move the Message/TokenUsage import to the top of the module with the other imports; removes the # noqa: E402 suppressor.
Contributor
KarinaKKarinaK
left a comment
There was a problem hiding this comment.
simulation/target.py:247 — m.content or "" turns None into "" for assistant messages, but assistant messages with tool_calls need content: null
so I think a fix could be: skip the fallback when role == "assistant"?
more concretely:
simulation/target.py:247 (_messages_to_input) — when an assistant message has tool_calls, OpenAI expects content: null, not "". The new m.content or "" sends content="" alongside tool_calls=[...], which is the wrong shape. Don't apply the or "" fallback to assistant messages.
…sages In _messages_to_input, the 'or ""' fallback turned None into "" for assistant messages with tool_calls. OpenAI's spec requires content: null in that case, so the previous coercion produced an invalid payload. Keep the fallback for non-assistant roles where the API expects a string.
…ion/types.py The Message section header was left behind when the type itself moved into evaluatorq.contracts. Pure cleanup, no functional change.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Consolidates duplicate
TokenUsageandMessagedefinitions that previously lived in bothsimulation/types.pyandredteam/contracts.py. Shared types now live inevaluatorq/contracts.py(the home introduced by PR #117 for cross-subpackage shared types).TokenUsage— uses redteam's superset (addscallsfield +from_completion()).Message— uses redteam's superset (supportstoolrole,tool_calls,tool_call_id,name).FunctionCallandStrategyToolCall— moved alongsideMessage.ChatMessageis preserved as a deprecated alias (ChatMessage = Message) to avoid breaking external API consumers.runner/simulation.pyno longer does the redundant per-turnChatMessage(role=..., content=...)conversion.Linear: https://linear.app/orqai/issue/RES-596
Test plan
bunx nx typecheck evaluatorq-pybunx nx test evaluatorq-py(simulation + redteam suites)from evaluatorq.simulation.types import TokenUsage,from evaluatorq.redteam.contracts import Message)