refactor(evaluatorq-py): consolidate TokenUsage / Message types across simulation and redteam (RES-596) by currentlycodinng · Pull Request #127 · orq-ai/orqkit

currentlycodinng · 2026-05-13T07:34:44Z

Summary

Consolidates duplicate TokenUsage and Message definitions that previously lived in both simulation/types.py and redteam/contracts.py. Shared types now live in evaluatorq/contracts.py (the home introduced by PR #117 for cross-subpackage shared types).

TokenUsage — uses redteam's superset (adds calls field + from_completion()).
Message — uses redteam's superset (supports tool role, tool_calls, tool_call_id, name).
FunctionCall and StrategyToolCall — moved alongside Message.
ChatMessage is preserved as a deprecated alias (ChatMessage = Message) to avoid breaking external API consumers.
runner/simulation.py no longer does the redundant per-turn ChatMessage(role=..., content=...) conversion.
Both legacy import paths still work via re-exports.

Linear: https://linear.app/orqai/issue/RES-596

Test plan

bunx nx typecheck evaluatorq-py
bunx nx test evaluatorq-py (simulation + redteam suites)
Verify legacy imports still resolve (from evaluatorq.simulation.types import TokenUsage, from evaluatorq.redteam.contracts import Message)

…output (RES-651)

… response, missing tests (RES-651)

…r (RES-651)

…rn AgentResponse, add scorer tests (RES-651)

…le SDK objects (RES-651)

…pe (RES-651)

…tor prompts (RES-651)

…type (RES-651)

…n type (RES-651)

…e typecheck (RES-651)

Add tool_calls_per_turn to JobOutputPayload and RedTeamResult so intercepted tool calls from OrchestratorResult are accessible in final reports. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…terception (RES-651) - Add default=str to json.dumps in evaluator to prevent silent TypeError on non-serializable tool arguments - Add AIMessage type guard in LangGraphTarget to prevent phantom tool calls from HumanMessage/ToolMessage - Track _prev_msg_count in LangGraphTarget to extract only current-turn tool calls in multi-turn flows - Emit logger.warning in reset_conversation() when _prev_msg_count > 0 to surface stale-data risk - Narrow except clause and add logger.warning in orq.py JSON arg parse for observability - Update tests to reflect AgentResponse return type and new reset_conversation semantics

…nse}} substitution (RES-651) Response is the terminal substitution — no further placeholder expansion follows it. Sanitizing here corrupts legitimate agent output containing {{ (e.g. template syntax explanations). Cross-expansion is already prevented by sanitizing {{input.all_messages}} and {{output.tool_calls}}. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…essage list (RES-651) Replace separate text/tool_calls/intermediate_steps fields on AgentResponse with a unified output: list[OutputMessage], preserving the order in which items were produced. This allows evaluators to reason about tool call order and interleaving with text responses. OutputMessage is a union of: - TextOutputItem (type: "output_text") — assistant text response - ToolCallOutputItem (type: "function_call") — tool call with optional result - ReasoningOutputItem (type: "reasoning") — chain-of-thought / thinking steps Type discriminators align with the OpenResponses intermediate data format. Backward-compatible .text and .tool_calls properties are preserved. All backends and integration targets updated to build the output list.

…eption (RES-651) - Fix ORQ test mocks: use attribute assignment instead of MagicMock constructor kwargs so _extract_text correctly resolves part.kind and part.text; add response.text assertions to cover text extraction path - Remove dead/orphaned code from test_extracts_tool_call_with_dict_arguments (no-op with patch block and unused call_results list) - Add test_reset_uses_different_thread_id to verify thread_id changes after reset_conversation() in LangGraphTarget - Fix StrategyToolCall rename (ToolCall → StrategyToolCall) in contracts.py and __init__.py exports - Fix orchestrator tool_calls_per_turn tracking: replace boolean flag with _turn_calls local variable per turn to eliminate double-append race - Fix CallableTarget to re-raise TypeError and AttributeError directly instead of wrapping in RuntimeError

…r AgentResponse output (RES-651)

- Merge TokenUsageCollector into LangGraphTarget.send_prompt (kept AgentResponse return type) - Add reset_conversation() and consume_last_token_usage() to LangGraphTarget - Add clone()/reset_conversation() to OpenAIAgentTarget for new tests from main - Keep _coerce_to_agent_response for all static pipeline send_prompt call sites - Use getattr for msg.tool_calls in OpenAIModelTarget to support mock clients - Integrate LLMCallConfig/LLMConfig/xml_escape imports from main - Remove duplicate except asyncio.TimeoutError block (merge artifact in evaluator.py) - Remove test_timeout_error_propagates (superseded by main's inconclusive-on-timeout behavior) - Fix tests: add evaluator._cfg=LLMCallConfig() for __new__-constructed OWASPEvaluator, use new() not clone() for LangGraphTarget

…id CredentialError

…erge - Add ExecutedToolCall to evaluator.py imports - Pass annotations=[] to TextOutputItem constructors (Annotated Field default not recognized by basedpyright) - Serialize dict arguments to JSON strings for ToolCallOutputItem constructors - Use append pattern instead of list comprehension to avoid list invariance errors - Use cast(Any, ...) in tests for intentionally-wrong-type assertions

…construction (RES-651)

…ck failure (RES-651)

…ursor stability, update docstring (RES-651)

…t in CallableTarget (RES-651)

Move LLMCallConfig, AgentResponse, ExecutedToolCall and their supporting types (ReasoningOutputItem, OutputMessage) to the new shared evaluatorq/contracts.py. Add api field (Literal["chat_completions", "responses"], default "chat_completions") to LLMCallConfig. Re-export all moved names from evaluatorq.redteam.contracts so existing import paths remain unchanged. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…t, and AgentTarget conformance Covers BaseAgent._call_llm API dispatch, OrqResponsesTarget call/send_prompt/new/get_usage/timeout behaviour, and OrqResponsesTarget conformance with the redteam AgentTarget protocol. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…put extractor Create `simulation/_client.py` with two shared helpers: - `build_simulation_client()`: replaces near-identical `_build_client` bodies in `BaseAgent` and `OrqResponsesTarget`; both now delegate to it. - `extract_responses_output()`: replaces fragile `hasattr(item, 'name')` discriminator in `base.py._call_responses` and the bespoke `_extract_output` method in `target.py` with a single type-correct implementation using `item.type == 'function_call'`. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…tion failures Wrap the `user_simulator.update_context()` call in a try/except so that errors from a badly-implemented injected user simulator are reported with a clear `RuntimeError` message rather than being swallowed into a generic simulation error, making debugging much easier for callers. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… and update_context propagation - test_simulate_with_injected_target: verifies target= callback is invoked per turn - test_simulate_target_takes_precedence_over_target_callback: confirms target= wins over target_callback= - test_simulate_with_injected_user_simulator: verifies injected simulator's generate_first_message is called - test_update_context_called_per_simulation: asserts update_context receives persona/scenario kwargs - test_invalid_user_simulator_raises_type_error: confirms bad user_simulator surfaces as error result - Update test_timeout_exceeded_raises to expect RuntimeError (wrapping TimeoutError) from the new _invoke implementation that converts asyncio.TimeoutError to a descriptive RuntimeError Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…, validation timing, and silent failures (RES-595) Comprehensive fixes to address multi-agent review findings: **OrqResponsesTarget (target.py):** - Fix context duplication: reset _previous_response_id=None in __call__ to avoid threading with full message history - Fix memory entity scoping: generate fresh UUID in new() for independent AgentTarget instances - Fix silent failures: raise RuntimeError for empty output items instead of logging warning **SimulationRunner (simulation.py):** - Fix late validation: move agent validation from run() to __init__ for fail-fast behavior - Fix duplicated API logic: use shared build_simulation_client() helper with proper ownership tracking - Update close() to only close owned clients **Tests:** - Update test_call_raises_error_on_no_output_items to expect RuntimeError instead of empty string All critical issues from Claude review comments now addressed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…SON serialization (RES-651) - LangGraph & OpenAI Agents targets: emit AgentResponse.output items in message order so ReAct-style text -> tool_call -> text sequences are preserved instead of collapsing to [tool_calls..., final_text]. - LangGraph target: pass default=str to json.dumps for tool call arguments to avoid TypeError on non-serializable values (datetime, bytes, custom SDK objects). pipeline.py re-raises TypeError unconditionally, which would otherwise crash the whole attack job. - openresponses.FunctionCall._serialize_arguments: retry with default=str before falling back to '{}' and log a warning so silent data loss is observable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…S-782) (#118) * feat(evaluatorq-py): port OTel tracing to agent simulation module (RES-782) Mirrors the TypeScript simulation tracing landed in RES-595 to the Python package, so simulations show as a single unified trace in the Orq dashboard with per-turn breakdown. - Add simulation/tracing.py: with_simulation_span, with_llm_span, record_llm_input/output/response, record_token_usage, set_span_attrs, get_trace_context_headers (W3C traceparent injection) - Wrap simulate()/generate_and_simulate() pipelines, run, turn, target_call, judge_evaluation, user_simulator_call, and the persona/ scenario/first_message generators in nested orq.simulation.* spans - Wrap _call_chat_completions and _call_responses in BaseAgent with with_llm_span (operation = "chat" / "responses"), record input/response via GenAI semconv, propagate trace context to the router via extra_headers so router-side spans link as children - Aggregate token usage on the run + pipeline spans; set termination / goal_achieved attrs Tested live against the Orq router: 11-span trace landed with full input/ output capture, parent links, token rollup, and billing all correct. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(evaluatorq-py): address review findings on simulation tracing (RES-782) Fixes from internal review of 981af84: Bugs - record_llm_response: zero token counts (e.g. fully-cached request) no longer fall back to alternate attribute names; explicit None checks - record_llm_response: Responses API output extraction now descends into item.content[*].text (and prefers response.output_text when available) instead of looking for a flat .text on top-level items, which the SDK shape doesn't have - runner.run() error path: dropped dead inner try/except and restored partial token-usage reporting via a usage_holder dict so the outer except can call _get_total_usage from agents created in _run_inner (mirrors TS getTotalUsage closure) Parity - generate_structured: wrap with with_llm_span so persona/scenario generation produces GenAI client child spans (structured + json_object fallback paths both covered); inject W3C traceparent - ScenarioGenerator alt methods (with_coverage, edge_cases, boundary, security) and PersonaGenerator.generate_with_coverage now wrap their generate_structured call in orq.simulation.{persona,scenario}_generation spans with the matching mode attribute, matching TS Tests - record_token_usage_preserves_zero_prompt_tokens (regression for the falsy-or bug) - record_llm_response_responses_api_shape (item.content[].text path) - llm_span_records_exception (mirrors with_simulation_span coverage) - traceparent_injected_into_chat_completions_call (verifies extra_headers carries traceparent into the OpenAI client request) - concurrent_runs_share_pipeline_parent (asyncio.gather context isolation) Verified live against the Orq router post-fix; trace bd15b852556fa1283f2c6bd69982d8b9 has the expected 11-span tree, full input/output capture, and correct token rollup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(evaluatorq-py): address Claude review on simulation tracing PR (RES-782) Findings from the PR review on #118: - _serialize_messages: switch to m.get() so messages without a content key (OpenAI tool-call-only messages) don't raise KeyError; also avoids stringifying None into the literal "None" - record_llm_response: rename second `choices` to `finish_choices` so reading the function doesn't suggest the assignments are linked - run_span / pipeline_span: type as Span | None under TYPE_CHECKING instead of Any so attribute access stays type-checked downstream - Generators (persona generate_with_coverage, scenario {with_coverage, edge_cases, boundary, security}): move parsing and post-processing inside the simulation span so timing and exception attribution are accurate (matches the basic generate() shape) - _capture_message_content: tighten docstring to make the deliberate spec deviation explicit (default True for TS parity vs spec opt-in) Skipped: the structured_output fallback-path billing attribution note — the failed parse() call's APIStatusError doesn't carry a usable response object to record on the span. Verified live: trace 8cd51c85d26bf1a8960932b0ac0c66ac landed with the expected 11-span tree, full input/output capture, and correct token rollup (3259 = 1234 + 529 + 1496). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(evaluatorq-py): address second Claude review pass on PR #118 (RES-782) - with_simulation_span / with_llm_span: catch BaseException instead of Exception so asyncio.CancelledError (used by asyncio.wait_for timeouts) is recorded on the span as ERROR; previously timed-out spans ended with UNSET status, indistinguishable from normal completion in the Orq dashboard. KeyboardInterrupt / SystemExit also now appear in traces, which is desirable for visibility. - ScenarioGenerator.generate(): move the json.JSONDecodeError handler outside the with_simulation_span block so JSON parse failures mark the span ERROR (matches the pattern already used by the alt generators: with_coverage, edge_cases, boundary, security). Test added: test_simulation_span_records_asyncio_cancellation. 18/18 tracing tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(evaluatorq-py): third Claude review pass — Responses API finish reason + LLM span cancellation test (RES-782) - record_llm_response: fall back to response.status when there are no .choices, so gen_ai.response.finish_reasons is populated for Responses API spans (the API uses .status — e.g. "completed", "failed", "incomplete" — instead of per-choice finish_reason) - Tests: add test_llm_span_records_asyncio_cancellation as a mirror of the simulation-span cancellation test, plus test_record_llm_response_responses_api_finish_reason for the finish_reasons fallback. 20/20 tracing tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(evaluatorq-py): close tracing gaps in simulation spans * fix(evaluatorq-py): record run span metadata on cancellation --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Bauke Brenninkmeijer <bauke.brenninkmeijer@gmail.com>

Unifies the AgentTarget protocol around `send_prompt() -> AgentResponse`. `AgentResponse` now carries both the ordered output items (text + tool calls) and the per-call LLM metadata that previously lived on the now- removed `SendResult` (usage, model, response_id, finish_reason). A `SendResult = AgentResponse` alias is kept so existing tests/callers that import `SendResult` continue to work; new code should use `AgentResponse` directly. Updates: - All integrations (callable, LangGraph, OpenAI Agents, Vercel AI SDK) populate `AgentResponse.usage` from their respective usage surfaces. - OpenAI and ORQ backends preserve tool call extraction alongside usage/model/response_id propagation. - Orchestrator, pipeline, and runner read `usage` directly off the returned `AgentResponse`; the `consume_last_token_usage` shim is retained on the optional `SupportsTokenUsage` protocol but no longer wired into hot paths. - Tests migrated from `send_prompt_with_usage` to `send_prompt`. - `test_legacy_target_adapter.py` removed — its premise (distinct legacy `send_prompt` vs. new `send_prompt_with_usage`) no longer applies. Pre-existing CLI help tests (`TestVulnerabilityHelpText`) remain deselected; failures are unrelated to this merge (empty typer help output in the test environment). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…s simulation and redteam (RES-596) Eliminates duplicate type definitions across simulation and redteam: Shared types now live in evaluatorq.contracts (the home introduced by PR #117 for cross-subpackage shared types): - ``TokenUsage`` — uses redteam's superset (adds ``calls`` field + ``from_completion()`` classmethod). Replaces simulation's lighter variant. - ``Message`` — uses redteam's superset (supports ``"tool"`` role, ``tool_calls``, ``tool_call_id``, ``name``). Replaces simulation's simpler ``Message`` and the duplicated ``ChatMessage``. - ``FunctionCall`` and ``StrategyToolCall`` — moved alongside Message since Message references them. Both ``simulation/types.py`` and ``redteam/contracts.py`` now re-export these from ``evaluatorq.contracts`` so legacy import paths (``from evaluatorq.simulation.types import TokenUsage`` / ``from evaluatorq.redteam.contracts import Message``) keep working. ``ChatMessage`` is preserved as a deprecated alias (``ChatMessage = Message``) so external API consumers don't break. Internally, ``runner/simulation.py`` now passes ``list[Message]`` directly without the redundant per-turn ``ChatMessage(role=m.role, content=m.content)`` conversion (the two classes had identical schemas). Other cleanups: - Simulation enums (CommunicationStyle, StartingEmotion, EmotionalArc, CulturalContext, ConversationStrategy, InputFormat, TerminatedBy) now use the same StrEnum polyfill pattern as redteam/contracts.py instead of plain ``(str, Enum)``. - Adapter callbacks coerce ``m.content or ""`` since shared Message has optional content (for tool-call assistant messages). Verified: 832/833 sim+redteam tests pass (1 pre-existing failure unrelated). 0 typecheck errors. Live smoke produces full simulation trace, indexed in Orq dashboard. Stacked on PR #118 (RES-782 OTel tracing); peer of RES-598 branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Keeps HEAD re-exports from evaluatorq.contracts over inline defs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…r (RES-596) Address Claude review on PR #127. _invert_roles_for_simulator was reconstructing Message(role=..., content=...) directly, which silently dropped tool_calls, tool_call_id, and name now that the shared Message type carries those fields. Switch to model_copy(update={"role": ...}) so all other fields are preserved. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three follow-ups from the review pass on PR #127: 1. Increment ``calls`` field on token aggregation. Adopting redteam's superset ``TokenUsage`` added a ``calls`` counter, but simulation token accumulators in ``agents/base.py`` (chat completions and Responses API paths) and ``target.py`` (OrqResponsesTarget) were only summing the three token counts and never bumping ``calls``. ``runner/simulation._get_total_usage`` and ``_build_turn_metrics`` now include ``calls`` in their sum/diff so per-turn and total usage report it accurately. 2. Consolidate StrEnum polyfill. Both ``simulation/types.py`` and ``redteam/contracts.py`` carried their own Python 3.10 polyfill; moved to ``evaluatorq.contracts`` and import from there. 3. ``OrqResponsesTarget._messages_to_input`` now passes ``m.content or ""`` so a future tool-role message with ``content=None`` can't crash the Responses API call. Plus two regression tests in ``test_runner.py``: - ``_invert_roles_for_simulator`` preserves ``tool_calls``, ``tool_call_id``, ``name`` (covers the bug Claude's first review caught and the prior commit fixed). - ``Message`` with ``role="tool"`` round-trips through model_dump / model_validate without losing the new optional fields. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Resolves conflicts in redteam/contracts.py where the base branch re-introduced inline ``Message`` and ``TokenUsage`` definitions while this branch had moved both to ``evaluatorq.contracts``. Kept the re-export approach (single source of truth in ``evaluatorq.contracts``). Picked up base's hardened TokenUsage and applied to the shared location: - ``model_config = ConfigDict(frozen=True)`` (immutable for predictable arithmetic). - ``ge=0`` validators on all token fields. - Improved ``from_completion`` total fallback (uses prompt+completion when provider reports total=0/None). - ``__add__`` / ``__radd__`` for ergonomic accumulation and ``sum()``. Updated simulation token accumulators in ``agents/base.py`` and ``target.py`` to use ``__add__`` reassignment instead of in-place ``+=`` mutation, since frozen models reject ``setattr``. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

PR #117 (Orq Responses v3) and PR #100 (tool call interception) landed on main, both touching the same simulation/redteam surface RES-596 was refactoring. Resolution: - Took main's version for files where 596's only divergence was the shared bauke/karinakalicka branch ancestry (_client, api, generators, tracing, structured_output, all test_* files). - Reconciled the 5 substantive conflicts (evaluatorq/contracts, redteam/contracts, simulation/agents/base, simulation/runner/simulation, simulation/target) by starting from main's feature-complete versions and applying 596's consolidation deltas on top: * Moved TokenUsage / Message / FunctionCall / StrategyToolCall into evaluatorq/contracts.py (joining LLMCallConfig + AgentResponse that #117 already moved there). * redteam/contracts.py and simulation/types.py now re-export from evaluatorq.contracts. * Dropped ExecutedToolCall — #100 removed it; AgentResponse.tool_calls returns list[ToolCallOutputItem] directly. * Runner passes Message all the way through (no ChatMessage rebuild), preserving tool_calls on role inversion (commit e083bfd). * TokenUsage.from_completion() + frozen-model accumulation pattern threaded through agents/base.py and target.py. Verified: 1243 unit tests pass, basedpyright clean.

- target.py: _messages_to_input now preserves tool_calls / tool_call_id / name when serializing to the Responses API, so callers who build a history with tool messages don't hit silent data loss. - contracts.py: drop StrEnum from __all__ — it's a Python-version compatibility shim, not a public domain type. - simulation/types.py: move the Message/TokenUsage import to the top of the module with the other imports; removes the # noqa: E402 suppressor.

KarinaKKarinaK

simulation/target.py:247 — m.content or "" turns None into "" for assistant messages, but assistant messages with tool_calls need content: null
so I think a fix could be: skip the fallback when role == "assistant"?

more concretely:
simulation/target.py:247 (_messages_to_input) — when an assistant message has tool_calls, OpenAI expects content: null, not "". The new m.content or "" sends content="" alongside tool_calls=[...], which is the wrong shape. Don't apply the or "" fallback to assistant messages.

…sages In _messages_to_input, the 'or ""' fallback turned None into "" for assistant messages with tool_calls. OpenAI's spec requires content: null in that case, so the previous coercion produced an invalid payload. Keep the fallback for non-assistant roles where the API expects a string.

…ion/types.py The Message section header was left behind when the type itself moved into evaluatorq.contracts. Pure cleanup, no functional change.

KarinaKKarinaK and others added 30 commits April 23, 2026 16:15

feat(evaluatorq): add tool call interception to AgentTarget protocol …

073be04

…output (RES-651)

fix(evaluatorq): address PR review feedback (RES-651)

f862feb

fix(evaluatorq): address swarm review — pipeline tool_calls, sanitize…

9ac1c0d

… response, missing tests (RES-651)

fix(evaluatorq): pass tool_calls to scorer in create_dynamic_evaluato…

ea89390

…r (RES-651)

fix(evaluatorq): migrate CallableTarget and VercelAISdkTarget to retu…

9edb6f8

…rn AgentResponse, add scorer tests (RES-651)

fix(evaluatorq): fix test_sanitize.py import path (RES-651)

fef27bd

fix(evaluatorq): clarify reset_conversation, type dict[str,Any], hand…

f2c3966

…le SDK objects (RES-651)

fix(evaluatorq): update integration tests for AgentResponse return ty…

951a1c9

…pe (RES-651)

feat(evaluatorq): add {{output.tool_calls}} to ASI05 and ASI06 evalua…

9541575

…tor prompts (RES-651)

fix(evaluatorq): update manual test scripts for AgentResponse return …

145a94c

…type (RES-651)

fix(evaluatorq): resolve all typecheck errors for AgentResponse retur…

0d55409

…n type (RES-651)

fix(evaluatorq): replace length-guard with explicit flag, fix callabl…

aed6796

…e typecheck (RES-651)

feat(evaluatorq): surface tool_calls_per_turn in RedTeamResult (RES-651)

3004ab7

Add tool_calls_per_turn to JobOutputPayload and RedTeamResult so intercepted tool calls from OrchestratorResult are accessible in final reports. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix evaluatorq tool-call interception edge cases

92b5235

test evaluatorq tool-call interception review coverage

29829a5

address evaluatorq tool-call review suggestions

8c34254

refactor(evaluatorq-py): reuse Pydantic models from convert_models fo…

feac32a

…r AgentResponse output (RES-651)

fix(evaluatorq-py): patch OWASPEvaluator in parse-failure test to avo…

86a8419

…id CredentialError

fix(evaluatorq-py): resolve ruff PERF401/TC006 errors in output list …

75937cb

…construction (RES-651)

fix(evaluatorq-py): suppress basedpyright warnings causing CI typeche…

a72da11

…ck failure (RES-651)

fix(evaluatorq-py): address PR review - preserve tool call IDs, fix c…

bdfea2d

…ursor stability, update docstring (RES-651)

fix(evaluatorq-py): remove TypeError/AttributeError from re-raise lis…

eba95ff

…t in CallableTarget (RES-651)

docs: add design spec for Orq Responses v3 simulation support

4be078a

Baukebrenninkmeijer and others added 9 commits May 5, 2026 17:10

This comment was marked as resolved.

Sign in to view

claude Bot reviewed May 13, 2026

View reviewed changes

Comment thread packages/evaluatorq-py/src/evaluatorq/simulation/runner/simulation.py Outdated

Baukebrenninkmeijer and others added 5 commits May 13, 2026 10:04

merge: integrate tool-call interception from karinakalicka branch

7441df9

Keeps HEAD re-exports from evaluatorq.contracts over inline defs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

merge: integrate origin/main, keep contracts re-exports over inline defs

85e177a

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

claude Bot reviewed May 13, 2026

View reviewed changes

Comment thread packages/evaluatorq-py/src/evaluatorq/simulation/target.py Outdated

claude Bot reviewed May 13, 2026

View reviewed changes

Comment thread packages/evaluatorq-py/src/evaluatorq/contracts.py Outdated

Base automatically changed from bauke/sim-orq-responses-support to main May 14, 2026 15:07

claude Bot reviewed May 15, 2026

View reviewed changes

Comment thread packages/evaluatorq-py/src/evaluatorq/simulation/types.py Outdated

currentlycodinng self-assigned this May 15, 2026

currentlycodinng requested a review from KarinaKKarinaK May 15, 2026 07:14

KarinaKKarinaK reviewed May 15, 2026

View reviewed changes

claude Bot reviewed May 15, 2026

View reviewed changes

Comment thread packages/evaluatorq-py/src/evaluatorq/simulation/types.py Outdated

chore(evaluatorq-py): drop orphan # Message section header in simulat…

e7cf3cd

…ion/types.py The Message section header was left behind when the type itself moved into evaluatorq.contracts. Pure cleanup, no functional change.

currentlycodinng requested review from Baukebrenninkmeijer and KarinaKKarinaK May 15, 2026 12:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(evaluatorq-py): consolidate TokenUsage / Message types across simulation and redteam (RES-596)#127

refactor(evaluatorq-py): consolidate TokenUsage / Message types across simulation and redteam (RES-596)#127
currentlycodinng wants to merge 55 commits into
mainfrom
aminaakhmedova/res-596-consolidate-shared-types-tokenusage-message-across

currentlycodinng commented May 13, 2026

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

KarinaKKarinaK left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

currentlycodinng commented May 13, 2026

Summary

Test plan

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

KarinaKKarinaK left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants